WebDiarios de Motocicleta: January 2010

Wednesday, January 27, 2010

Basic Hashtables

To understand the state of the art in hash tables, you must understand the holy trinity of the area: chaining, linear probing, and cuckoo hashing. Chaining is the one that amateurs know, and shows up frequently in code. Linear probing is what you use when performance really matters. And cuckoo hashing is the theoretician's darling, providing the playground for a constant stream of papers.

Here is a basic description of the three hash tables, if you don't know them. There are, of course, many variations.

chaining: Each item x is hashed to one of b bins, where b=Ω(n). Each bin is stored as a linked list, with pointers to the head of each list stored in an array A[1..b]. In practice, you would store the head of each list in A[i], to save a pointer and a cache miss.
linear probing: We hold an array A[1..b] of records, where b ≥ (1+ε)n. When inserting x, we try to place it at A[h(x)]; if that location is empty, try A[h(x)+1], A[h(x)+2], ..., until you find an empty location. This addresses the main performance issues of chaining: there are no cache misses (we walk a contiguous region, not a linked list), and the space is better (no pointers). But, intuitively, it demands much more robustness from the hash function: now some elements hashing to location k can interfere negatively with elements hashing to a close k'.
cuckoo hashing: We hold two arrays A[1..b] and B[1..b] and use two hash functions, h and g. When x arrives, we try to put it at A[h(x)]. If that location already contains y, try to move y to B[g(y)]. If that location already contains z, try to move z to A[h(z)], and so on until you find a free spot. Observe that the query for x is worst-case constant time: just look for x in A[h(x)] and B[g(x)]!

Let us assume, for now, that hash functions are truly random. Moment bounds, which I discussed last time, are useful for understanding both chaining and linear probing (see below). For cuckoo hashing, the analysis is quite different.

Chaining. It is trivial to argue that the expected running time of insertions
and deletions is constant. Focus on some element q. For i≠q, let X_i be the indicator that h(i)=h(q). Then, the time it takes to insert or query q is O(1 + ΣX_i).

Therefore, the expected time is bounded by E[ΣX_i] = ΣE[X_i] = n/b = O(1), since h(x)=h(i) only happens with probability 1/b.

What we have just argued is that the expected number of elements that collide with x is O(1). Another way to state this is that the variance of a bin's size is O(1), a fact that we proved last time. To see this connection, let B_i be the number of elements in bin i. Observe that:

E[Σ(B_i)²] = n + E[#colliding pairs] = n + n · E[#elements colliding with x] = n + n²/b

By uniformity of the hash function, E[(B_i)²] = n/b + n²/b². We have obtained the variance: Var[B_i] = E[(B_i)²] - E[B_i]² = n/b.

Perfect hashing. A very cool consequence of this variance analysis is the well-known dictionary of [Fredman, Komlós, Szemerédi FOCS'82]. Their idea was to construct a static dictionary using randomization, but then have the query be completely deterministic. (Later work has focused on obtaining deterministic queries even in dynamic dictionaries, as in cuckoo hashing, and on completely eliminating randomness.)

The basic idea is that, if we had space 2n², perfect static dictionaries would be trivial. Indeed, the expected number of collisions is n² / b = 1/2, so, by Markov, the hash function is collision-free with probability at least 1/2. For the construction, we can simply generate hash functions until we see a perfect one (a constant number of iterations, in expectation).

To bring the space down to O(n), remember that our variance analysis showed E[Σ(B_i)²] = O(n). Thus, instead of storing the items mapping to A[i] as a linked list, we should store a mini-hashtable of quadratic size inside each A[i]. These mini-tables provide perfect hashing, but their total size is just linear!

Linear probing. The relevance of moments to linear probing was only recognized in a recent breakthrough paper [Pagh, Pagh, Ruzic STOC'07]. I will show the analysis for b=3n to ease notation; it is easy to extend to any load.

In true data-structures style, we consider a perfect binary tree spanning the array A[1..b]. A node on level k has 2^k array positions under it, and (1/3)·2^k items were originally hashed to them in expectation. (Here I am counting the original location h(x) of x, not where x really appears, which may be h(x)+1, h(x)+2, ...). Call the node "dangerous" if at least (2/3)·2^k elements hashed to it.

Now say that we are dealing with element q (a query or an update). We must bound the contiguous run of elements that contain the position h(q). The key observation is that, if this run contains between 2^k and 2^k+1 elements, either the ancestor of h(q) at level k-2 is dangerous, or one of its siblings in an O(1) neighborhood is dangerous.

Let's say this run goes from A[i] to A[j], i≤h(q)≤j. The interval [i,j] is spanned by 4—9 nodes on level k-2. Assume for contradiction that none are dangerous. The first node, which is not completely contained in the interval, contributes less than (2/3)·2^k-2 elements to the run (it the most extreme case, this many elements hashed to the last location of that node). But the next nodes all have more than 2^k-2/3 free locations in their subtree, so 2 more nodes absorb all excess elements. Thus, the run cannot go on for 4 nodes, contradiction.

Now, the expected running time of an operation is clearly:

Σ_k O(2^k)·Pr[h(q) is in a run of 2^k to 2^k+1 elements].

As argued above, this probability is at most O(1) times the probability that a designated node at level k-2 is dangerous.

The rest is a simple balls-in-bins analysis: we want the probability that a bin, of expected size μ=2^k-2/3, actually contains 2μ elements. Last time, we showed that Chebyshev bounds this probability by O(1/μ). Unfortunately, this is not enough, since Σ_k 2^k·O(1/2^k-2) = O(lg n).

However, if we go to the 4th moment, we obtain a probability bound of O(1/μ²). In this case, the running time is Σ_k2^k·O(1/2^2(k-2)) = Σ O(2^-k) = O(1). So the 4th moment is enough to make this series decay geometrically.

Tuesday, January 26, 2010

Moments

This post is a fairly basic review of common probability notions. Things will get more interesting in future posts. Somebody who wants to do TCS but has not studied probability yet can read this post carefully and reach about the same level of formal training that I've had :)

Say we distribute n items into b bins randomly. Fixing our attention on one bin (e.g. the first), how can be bound the number of items landing there?

The expected number of items is n/b. Thus, by the Markov bound, we get:

Pr[bin contains ≥ 2n/b items] ≤ 1/2

Variance. To strengthen the bound, we may look at the variance and apply the Chebyshev bound. Let X be the number of elements in the first bin. Also let μ=E[X]=n/b. The variance of X is defined as E[(X-μ)²], and this is exactly what we need to compute for Chebyshev.

How can we compute the variance? We let X_i be the indicator for whether the i-th item falls in our bin (indicator means the variable is 1 if the event happens, and 1 otherwise). Then, X=X₁+...+X_n.

Since we are interested in X-μ, not X, it is more convenient to define Y_i = X_i-(1/b). With this definition, X-μ=Y₁+...+Y_n. Observe that E[Y_i] = 0.

We can not break up the variance:

Var[X] = E[(X-μ)²] = E[( ΣY_i )²] = Σ E[Y_iY_j]

We are down to analyzing E[Y_iY_j], which is simple. If i≠j, Y_i and Y_j are independent random variables. Thus, the expectation commutes with the product:

E[Y_iY_j] = E[Y_i] E[Y_j] = 0

In the case i=j, we use brute force calculation. Y_i=-1/b with probability 1-(1/b), and equals 1-(1/b) with probability 1/b. Thus, E[(Y_i)²] = O(1/b).

We have found the variance to be E[(X-μ)²]=O(n/b)=O(μ). Then:

Pr[X ≥ 2n/b] = Pr[X-μ ≥ μ] ≤ Pr[(X-μ)² ≥ μ²] ≤ O(μ) / μ² = O(1/μ)

Observe that the Chebyshev bound is nothing more than Markov on the variable (X-μ)².

Third moment. If stronger bounds are required, we can try to look at higher moments, E[(X-μ)^k]. Unfortunately, moving to the 3rd moment (or any other odd moment) does not really help: the variable (X-μ)³ is no longer positive, so Markov cannot apply.

One way to fix this is to look at the absolute third moment: E[|X-μ|³]. It is no longer easy to compute this moment, since we cannot break up |ΣY_i|³ into components, due to the absolute value. Thus, we do not commonly use absolute moments.

However, I have come across absolute moments once, in the following interesting application. The central limit theorem states that the average of N i.i.d. variables tends to the normal distribution as N→∞. The Berry-Esseen theorem quantifies this phenomenon: it gives a fairly strong bound on the speed of the convergence, assuming the third absolute moment of the summands is bounded.

Fourth moment. To strengthen the variance bound, one most commonly looks at the 4th moment. To get a feeling for it, let's see how we can bound the 4th moment in the balls and bins case:

E[(X-μ)⁴] = E[( ΣY_i )⁴] = Σ_ijkl E[ Y_iY_jY_kY_l ]

We can understand the terms generated by an (i,j,k,l) tuple by case analysis:

If one of the elements appears exactly once, the term is zero. For instance, let's say i∉ {j,k,l}. Then Y_i is independent of the rest, so the expectation commutes with product: E[Y_iY_jY_kY_l] = E[Y_i] E[Y_jY_kY_l] = 0.
If all elements are equal (i=j=k=l), E[(Y_i)⁴] = O(1/b).
If the tuple consists of two equal pairs, for instance (i,i,j,j), we have E[(Y_i)² (Y_j)²] = E[(Y_i)²] E[(Y_j)²] = O(1/b²).

We have n terms of type 2, and O(n²) terms of type 3. Thus, the fourth moment is O(n² / b²) = O(μ²).

To bound the bin size, we can now apply Markov on the fourth moment:

Pr[X ≥ 2n/b] = Pr[X-μ ≥ μ] ≤ Pr[(X-μ)⁴ ≥ μ⁴] ≤ O(μ²) / μ⁴ = O(1/μ²)

Thus, our bounds have improved from 1/2 for Markov, to O(1/μ) for Chebyshev, and to O(1/μ²) for the 4th moment. Going to the 6th, 8th, etc yields the predictable improvement.

Chernoff. The next step to improving our bounds is to go to the Chernoff bound. This bound has many forms, in particular two rather different instantiations for additive and relative error.

Let me quote an uncommon, but nifty version of the theorem:

Let X₁, ..., X_n be independent random variables bounded in [0,1]. Let X=ΣX_i and μ=E[X].

If Z≥μ, then Pr[X≥Z] ≤ e^Z-μ (μ/Z)^Z.

If Z≤μ, then Pr[X≤Z] ≤ e^Z-μ (μ/Z)^Z.

In our case, we are interested in Z=2μ. Thus, the upper bound on the probability is e^μ / 2^2μ ≈ 1 / 1.47^μ. We have obtained an exponential bound, instead of the polynomial bound possible by constant moments.

If we have been interested in showing that the bin gets at least Z=μ/2 elements, the second part of the theorem gives a probability bound of e^-μ/2 2^μ/2 ≈ 1 / 1.17^μ. Note how the two terms of the bound trade places: e^Z-μ is pushing the probability down in the second case, while (μ/Z)^Z is making sure it is small in the first case.

The proof of Chernoff is a bit technical, but conceptually easy. As before, we define Y_i = X_i - E[X_i], so that X-μ=ΣY_i. Instead of looking at E[(X-μ)^k], we will not look at E[α^X-μ] (where α>1 is a parameter than can be optimized at the end). This quantity is easy to compute since E[α^ΣYi] = E[Π α^Yi] = Π E[α^Yi]. At the end, we can just apply Markov on the positive variable α^X-μ.

High probability. In TCS, we are passionate about bounds that hold "with high probability" (w.h.p.), which means probability 1 - 1/n^c, for any constant c. For instance,

Algorithm A runs in O(n) time w.h.p.

formally means the following:

There exists a function f(.) such that, if you choose any constant c, I can prove that algorithm A runs in time f(c)·n with probability 1 - 1/n^c.

While such bounds may seem weird at first, they do make a lot of sense: think of applying some randomized procedure a polynomial number of times. Also, these bounds make a lot more sense when dealing with many experiments over huge data sets (the essence of Computer Science) than adopting the convention from statistics, which asks for bounds that hold with 95% probability.

Since we are usually happy with whp bounds, one often hears that Chernoff is morally equivalent to O(lg n)-moments. Indeed, such a moment will give us a bound of the form 1/μ^{O(lg n)}, which is high probability even in the hardest case when μ is a constant.

The paper of [Schmidt, Siegel, Srinivasan SODA'93] is often cited for this. Their Theorem 5 shows that you can get the same bounds as Chernoff (up to constants) if you look at a high enough moment.

Thursday, January 21, 2010

Applications

This is the time when many young people are fretting about their applications, be they for undergrad admission, PhD admission, or academic jobs. (At the same time, marginally older people are happy to be done with recommendation letters.)

First of all, I wanted to suggest that you channel the energy (or, should I say, agitation) of this moment into positive directions. My interview spring was also when I came up with some pretty neat ideas and wrote 4 FOCS papers.

Second of all, I wanted to share an anecdote that I heard during my job hunt, which may or may not help. As the story goes, a department chair comes back after the break and finds a couple of hundred applications sitting on his desk. He looks at the pile, shakes his head, and promptly commits half of them to the recycle bin. He then announces that, "This university does not need unlucky people."

If you doubt that this could be true, let me offer statistical evidence from my wife's hospital: on her year, the doctors that got jobs were Maria, Miriam, Mariam, and Milos. (Evidently, the secretary had sorted the pile before putting it on the chair's desk.)

Amusingly, their workplace also offers the best illustration of a hash function collision that I have yet seen. They have three Romanian doctors which go by Mira, Mira, and Mera (hash codes for Mirabela, Miriam, and Merima -- all exceedingly rare names in Romania).