Monday, September 19, 2011

Follow-up: Sampling a discrete distribution

This is a follow-up to my last post, on a puzzle about "Sampling a discrete distribution" (see my comment there for the solution I originally thought of).


As an anonymous commenter (Rex?) points out, a nice elementary solution that has been known for a long time. I admit to not knowing this solution ahead of time, and use the fact that it's not really my field as an excuse. Here I want to summarize this solution for educational purposes.

Imagine each sample in the distribution is a vase containing liquid proportional to the probability of the outcome.

We classify these vases into 3 types:
LO —  probability less than 1/n
GOOD –  probability precisely 1/n
HI –  probability greater then 1/n

In the ideal case when every vase is GOOD, we are done (uniform sampling). If not, we  move towards this ideal case by inductively applying the following subroutine:

** Pick a LO vase (say, with x amount of liquid) and a HI vase (with y amount of liquid).

Grab a new vase and fill it up to the 1/n mark: pour all the liquid from the LO vase, and (1/n)-x liquid from the HI vase (remember that a HI vase has liquid y>1/n, so this is certainly possible). Note that the HI vase may now become LO, GOOD, or stay HI.

/end of **

Observe that the process terminates in n steps, since at each step we increment the number of GOOD vases.  If we maintain 3 linked lists with samples of each type, each step can be done in  O(1) time, so O(n) time total.

At the end all vases are GOOD so we use uniform sampling. If we get a random x in [0,1], we use ⌊x·n⌋ to indicate that vase and x·n mod 1 to pick a molecule of  water inside that vase. If we trace back the original vase from where this molecule came, we have sampled according to the initial distribution.

Now we describe how to "undo" the water pouring at query time, i.e. tracing the origin of each water molecule. Observe that operation (**) is never applied to a GOOD vase. So when a vase becomes GOOD, it has reached final form. But the water in a GOOD vase can only come from two distinct vases. So we can essentially draw a water-mark on the vase, to separate the LO water from HI water and store pointers to the original vases whose water got poured into the GOOD vase. Then the sampling algorithm only need one comparison, i.e. it compares x·n mod 1 to the mark on the vase ⌊x·n⌋, and follows a pointer from this vase to the input vases from which water was poured. Note that only one pointer is followed, since a GOOD vase is never touched again by operation (**). Constant time!

5 comments:

Anonymous said...

that's a pretty cool solution to this problem, thanks for explaining it!

-KO

Anonymous said...

wonderfully simple

Anonymous said...

and it is quite practical too, think for instance to simulation applications where this technique can be used to save an O(log n) factor over the trivial solution.

liana said...

nice...

Anonymous said...

Possible simpler solution:

First, assume we know that we want to sample k ahead of time. Then, generate k uniformly random numbers R, sort them, and iterate over the n cumulative probabilities CP in order. For each cp in CP, test if the smallest r in R falls in this range. If it does, advance the R pointer to the next smallest, else advance to the next CP. Sorting R is O(k) w.h.p., and the runtime is therefore O(n + k).

To solve the dynamic problem, assume k is n, giving O(n) preprocessing and O(n) reprocessing every n queries => amortized O(1) queries with high probability.

This can be de-amortized easily too...