
Java tutorials home
java.util.Random
Random number generators
XORShift
High quality random
Seeding generators
Entropy
SecureRandom
Random sampling
Random simulations and nextGaussian()
How to pick a random sample from a listA problem that crops up fairly frequently is that we want to pick n random elements from a given list or array. For example, a quiz program might have 1000 possible questions and need to pick 20 of them. We would generally like each element to have an equal chance of being picked. Probably the most common method is what I call the "naive" approach: continuously pick a random element from the whole list. Each time, if the element was already picked on a previous run, then we just repeat until we find one that hasn't yet been picked. It should be obvious that as the number of elements to pick relative to the number available increases, this method becomes dreadfully inefficient. If you insist on using this method, then at least store the currently selected sample in a HashSet or other structure giving efficient random access. About the only saving grace of this method is that it is easy to understand. The seminaive approach: shuffle and sliceIn Java, there's usually not much reason to use the naive approach, because at least one more efficient approach can be coded in a couple of lines using calls from the collections library. The technique, which I call "shuffle and slice", is simply the following, assuming the elements to choose from are in an array:
This technique is efficient in the sense that the time taken increases linearly as either n or the number of elements to choose from increases. And it has the significant advantage of being based on library methods— in other words, you pretty much "can't go wrong". In just a few cases, a minor inconvenience of the method is that it disturbs the order of the data that you are sampling from. It turns out that there is another method available which is a little quicker on average, and which preserves the order of the elements being sampled from. (Slightly) more efficient samplingShuffling the entire array as in the previous method essentially scales linearly according to the number of elements to choose from. This is because to shuffle the entire array, we generate a random number at every position in the array. Another technique, which also scales linearly with the number of elements being chosen from, but on average requires fewer calculations per element, is as follows:
At each position, we want the probability of selecting the given element to be the number of elements still needed over the number of elements still left. For example, let's say we want to pick 3 random elements from a total of 15. At the first index, the chance of picking that element should be 3/15. If that element is indeed picked, then at the next index, we need to pick 2 out of a remaining 14 elemnts, so the chance of that second element being picked is 2/14. If the first element wasn't picked, then we still need to pick 3 out of the remaining 14 elements, so the chance of the second one being picked is 3/14, etc. An example implementation in Java goes something like this: public static <T> T[] pickSample(T[] population, int nSamplesNeeded, Random r) { T[] ret = (T[]) Array.newInstance(population.getClass().getComponentType(), nSamplesNeeded); int nPicked = 0, i = 0, nLeft = population.length; while (nSamplesNeeded > 0) { int rand = r.nextInt(nLeft); if (rand < nSamplesNeeded) { ret[nPicked++] = population[i]; nSamplesNeeded; } nLeft; i++; } return ret; } On average, this method is a little quicker than shuffling the entire array, since there is a chance that we'll pick the number of required elements before we've scanned right to the end of the array. Thus, fewer random numbers generally need to be generated. On average, the proportion of the elements that we'll need to scan in order to pick n of them is roughly n/(n+1). In other words, to pick 2 random elements, we need to scan roughly 2/3 of the data; to pick 3, we need to scan roughly 3/4; to pick 4, roughly 4/5 etc. So the method is likely to give a tangible advantage only when we need to pick a very small sample from a large number of elements. Written by Neil Coffey. Copyright © Javamex UK 2013. All rights reserved. 