Reservoir Sampling

Reservoir sampling is a great random sampling algorithm every data engineer should know. It’s an algorithm for extracting a random sample of a specified size, over a large unbounded dataset. The data is such that you cannot pull it all in memory at once, and you don’t know how large it will be when taking the sample.

Over on wikipedia you can check out a nice explanation of the idea behind reservoir sampling. It presents the most simple algorithm of reservoir sampling. Let me regurgitate it in C#:

public static T[] TakeSample<T>(IEnumerable<T> input, int sampleCount) {
	if (sampleCount <= 0)
		throw new ArgumentOutOfRangeException ("sampleCount");

	Random rand = new Random ();
	var samples = new T[sampleCount];
	int i = 0;
	foreach (T item in input) {
		if (i < sampleCount) {
			samples [i] = item;
		} else {
			int j = rand.Next () % i;
			if (j < sampleCount)
				samples [j] = item;

	if (i < sampleCount)
		throw new InvalidOperationException ("Input data does not have enough items to sample");

	return samples;

To some this algorithm may seem botched, as the last items have the least likelihood of being selected. However the overall likelihood of being selected is roughly distributed evenly across the entire dataset. This is because the items with more likelihood of being selected earlier on (or absolute certainty for the first k samples, where k = sample count) will have more likelihood of being replaced by a successor sample.

Jeffrey Scott Vitter has published some alternative implementations that optimise the performance of the sampling.

Distributed Reservoir Sampling

When you are dealing with BIG data where you have a distribute infrastructure at your disposal, your not going to stream the entire data set into a single sequence. Here is a crafty map-reduce approach to the sampling problem on stack overflow: I really like this approach as it draws on fundamental computer science algorithm know-how with big-data technology to get the job done!