Hi everyone, I'm at JSM 2018 right now, so feel free to drop by my session or drop by in the halls! Just give me a tweet!

Back to the meat-and-potatoes of this post. A while ago I was running good old sample and comparing its performance to my lpm2_kdtree function in the BalancedSampling package (Grafström and Lisic, 2016). In this comparison I noticed that sample was in some cases slower than my balanced sampling method when using sampling weights.

library(BalancedSampling)
library(ggplot2)

# sampling proportion
f <- 0.1

# population size 
N_values <-  seq(from=5000, to=100000,length=10)

set.seed(99)

pi <- runif(100000)

result <- c()

for( N in N_values) {
  # get sample size 
  n <- round(f*N)

  ## run lpm2-kdtree

  start_time <- proc.time()[3]
  # to make things even we normalize pi
  pi_tmp <- pi[1:N] * n/sum(pi)
  # we probably don't want to balance by pi_tmp, but 
  # for checking speed this is fine
  lpm2_kdtree(pi_tmp,pi_tmp)
  end_time <- proc.time()[3] - start_time

  result <- rbind(result, c(f, N, 1, end_time ))

  ## run sample 

  index <- 1:N

  start_time <- proc.time()[3]
  # to make things even we normalize pi
  sample(index,size=n,prob=pi_tmp)
  end_time <- proc.time()[3] - start_time

  result <- rbind(result, c(f, N, 2, end_time ))
}


colnames(result) <-
  c('SamplingFraction', 'PopulationSize', 'Method','RunTime')
result <- as.data.frame(result)

result[c('SamplingFraction','Method')] <-
  lapply(result[c('SamplingFraction','Method')], as.factor)
levels(result$Method) <- c("lpm2_kdtree","sample")

ggplot( aes(x=PopulationSize, y=RunTime, class=Method), data=result)+
  geom_line(aes(color=Method)) +
  ylab("Run Time (s)") +
  xlab("Population Size")

Figure 1: Lpm2_kdtree vs R sample.

How could this be?

Let's condsider a few possible reasons.

Are you running an old version of R?

No, this was done with R 3.5.0, but results hold on 3.5.1.

Is this because you are a freedom hating Mac user?

Nope, I get similar results in Linux

Is this due to the revsort in ProbSampleNoReplace found in random.c?

How amazingly prescient of you! If you track down the code path, the method used for sampling has a revsort call right at the begining (Line 323). Some basic testing seems to identify this as the likely culprit. Therefore, it might be possible to speed up sampling by checking if the probabilities are sorted.

Keep in mind that there are likely good reasons for R to use this method relative to others, including the alternative method below. One is computational stability, split sampling through the pivotal methods tends to work well when sample sizes aren't extremely small. However, I don't have any good comparisons of methods, if anyone knows please let me know, doing a literature search on this is pretty far down on my todo list.

That's great, but what can I do to improve my weighted or PPS sampling?

Well, I wrote a quick pivotal method implementation (think LPM2 without balancing) that seems to work. It's available as split_sample on my github version of BalancedSampling. There is also UPrandompivotal in the sampling package (Tillé and Matei, 2016). However, my implementation seems to be considerably faster. We will take a look at this in another blog post.

Finally, keep in mind that split_sample probably isn't as well tested as UPrandompivotal. So I'm happy to take any feedback.

library(devtools)
install_github("jlisic/BalancedSampling")
library(BalancedSampling)
library(ggplot2)
library(sampling)

# sampling proportion
f <- 0.1

# population size 
N_values <-  seq(from=5000, to=100000,length=10)

set.seed(99)

pi <- runif(100000)

result <- c()

for( N in N_values) {
  # get sample size 
  n <- round(f*N)

  ## run lpm2-kdtree

  start_time <- proc.time()[3]
  # to make things even we normalize pi
  pi_tmp <- pi[1:N] * n/sum(pi)
  # we probably don't want to balance by pi_tmp, but 
  # for checking speed this is fine
  lpm2_kdtree(pi_tmp,pi_tmp)
  end_time <- proc.time()[3] - start_time

  result <- rbind(result, c(f, N, 1, end_time ))


  ## run pivotal method (BalancedSampling) 

  index <- 1:N

  start_time <- proc.time()[3]
  # to make things even we normalize pi
  pi_tmp <- pi[1:N] * n/sum(pi)
  split_sample(pi_tmp)
  end_time <- proc.time()[3] - start_time

  result <- rbind(result, c(f, N, 2, end_time ))


  ## run pivotal sampling (sampling) 

  index <- 1:N

  start_time <- proc.time()[3]
  # to make things even we normalize pi
  pi_tmp <- pi[1:N] * n/sum(pi)
  UPrandompivotal(pi_tmp)
  end_time <- proc.time()[3] - start_time

  result <- rbind(result, c(f, N, 3, end_time ))
}


colnames(result) <-
  c('SamplingFraction', 'PopulationSize', 'Method','RunTime')
result <- as.data.frame(result)

result[c('SamplingFraction','Method')] <-
  lapply(result[c('SamplingFraction','Method')], as.factor)
levels(result$Method) <-
  c("lpm2_kdtree","BalancedSampling::split_sample",
    "sampling::UPrandompivotal")

ggplot( aes(x=PopulationSize, y=RunTime, class=Method), data=result)+
  geom_line(aes(color=Method)) +
  ylab("Run Time (s)") +
  xlab("Population Size")

Figure 2: Lpm2_kdtree vs BalancedSampling::split_sample vs sampling::UPrandompivotal.

Have a great JSM!

References

Grafström, A., Lisic, J.: BalancedSampling: balanced and spatially balanced sampling. R package version 1.5.2. https://CRAN.R-project.org/package=BalancedSampling. Accessed August 28, 2018. (2016)

Tillé, Y., Matei, M.: sampling: Survey Sampling. R package version 2.8. https://CRAN.R-project.org/package=sampling. Accessed August 28, 2018. (2016)