Friday, September 16, 2011

Sampling a discrete distribution

The following cute question came up at lunch today (all credit goes to Aaron Archer).


Informal challenge: You are given a discrete distribution with n possible outputs (and you know the probability pi of each output), and you want to sample from the distribution. You can use a primitive that returns a random real number in [0,1].

This was the way the problem came to me, and the goal was to find a very fast practical solution. Think of n as around 106 and use a regular machine as your machine model (real number means a "double" or something like that).


Formal challenge: To formalize this completely for a bounded-precision computer, we can say that we work on a w-bit machine and probabilities have w-bit precision. The distribution is given by n integers xi, and these values sum to exactly 2w. Output "i" should be produced with probability exactly xi / 2w. You are given a primitive that produces a uniformly distributed w-bit number.

The challenge is to get O(1) worst-case time. (NB: some fancy data structures required.)

11 comments:

Anonymous said...

That is a cute question.

No fancy data structures are required using the solution from slides 41--48 of this PDF: http://www.eurandom.nl/events/workshops/2009/quantative_models/BusicSIMParf.pdf

The solution there, known as the Walker method or alias method, has O(n log n) preprocessing and O(1) query time (both worst-case). The preprocessing can easily be improved to O(n), which was done by Kronmal and Peterson or can be easily done in a couple minutes.

Jim Apple said...

How much space am I allowed to use?

Anonymous said...

For the challenge, what pre-processing time are you allowing us?

Mihai said...

The space is O(n) and the preprocessing time=sorting time.

Anon Rex, I don't quite see how this solves the problem. (But it's likely that I'm misunderstanding the algorithm).

If many columns have height 1/N + eps, and some are much below 1/N, you need to move mass from many high columns to a low one. Then each column has many breaks inside it, and you need something like predecessor search to select among the breaks. Can you clarify what the algorithm really does?

Jim Apple said...

Mihai, I think the idea is that in step 1, if the tallest column has height 1/n + eps and the shortest one has height delt, you don't just move eps, but 1/n - delt. Then each column has only one break. (That's what I got from the pictures in the PDF, at least)

I would love to hear the fancy data structure solution!

Anonymous said...

Here's a 2-page note by Warren Smith, who appears to have independently (but subsequently) discovered the linear-time version of Walker's method. It works a lot like quicksort partitioning.

Anonymous said...

Jim is right. It is explained a little better here: http://www.scribd.com/doc/55772825/62/Alias-Method

Basically you want to cut up the probabilities p_i into 2n pieces y_jk for k=1,2, j=1,...,n, in such a way that y_j1+y_j2=1/n, and the sum of y_jk that are associated with p_i sum up to p_i. (Then sampling in constant time is easy.) This turns out to be easy: Just take the smallest p_i (say p_min), define y_11 = p_min and y_12 = 1/n-p_min. Associate y_11 with p_min and y_12 with p_max (the largest p_i). Set p_max = p_max - 1/n + p_min (which is larger than 0 since p_max >= 1/n). Now recurse on the probabilities except p_min.

Apologies for the ugly writeup of the very beautiful idea. And thanks Rex! I learned something today!

ilyaraz said...

Isn't the obvious O(log n)-time algorithm practical enough?

Anonymous said...

The Alias method does solve this problem, at least as it is described in Non-Uniform Random Variate Generation by Luc Devroye http://cg.scs.carleton.ca/~luc/rnbookindex.html

The key insight is that any discrete distribution over n possibilities can be rewritten as a mixture with uniform mixing proportions over 2-possibility discrete distributions. Then the cost for a sample (after setup) is 1 or 2 table lookups and 1 or 2 calls to the primitive random number source. The setup procedure can be done in linear time and only requires saving 2 tables of O(n) space since WLOG we can order the first components of each mixture component by index.

Rasmus Pagh said...

Not quite answering your question: See page 3 of this paper, section "Refinement", for a solution with expected constant time and no fancy data structure (only a trie with precomputed values).

Mihai said...

My solution went as follows. In principle, each possible output gets an interval inside [0,1] and the generated sample is the interval stabbed by the random number. Trivially, this translates to predecessor search, which is not constant time. But we are allowed to shuffle the intervals around, and this will make predecessor search O(1)-time. (It turns out that this magical shuffling is just sorting by length.)

Assume all probabilities are between p and 2p. Then we can easily solve the problem in O(1) time.

Observation 1: Given a point in [0,1], we can locate it inside intervals of roughly equal size (up to a factor of 2) in constant time.

Proof: We impose a grid of precision p and store the interval stabbed by every grid point. Then round the query point to the grid, and do a linear search among the intervals starting from where the nearest grid point lands. This cannot inspect more than O(1) intervals.

Observation 2: Bucket probabilities between 2^k/2^w and 2^(k+1)/2^w for every k. Apply the solution above inside a bucket. The query first needs to figure out which bucket the random number landed in. This is predecessor search among w start points (the beginning of the first interval of each bucket). But this only needs constant time with fusion trees.

While fusion trees are a non-elementary data structure and probably not practical, one can say they "unite theory and practice". In practice, binary search among 32 values is not O(1) time, it's more like zero time because everything is in cache.

As an anonymous commenter points out, I overcomplicate things, and there is a very nice elementary solution already known in the literature. I'll summarize this in a later comment.