Friday, February 12, 2010

The 6 Months News Cycle

The STOC accepted papers, with some pointers to online versions, can be found here. I had two papers: Towards Polynomial Lower Bounds for Dynamic Problems and Changing Base without Losing Space. More on them later.

The FOCS call for papers is here. An experimental feature is the lack of page limits. While nominally the reviewers are only obliged to look at the first 10 pages, we all know such formalism is irrelevant. Reviewers look at anywhere between 0.5 and 50 pages, until they form some opinion about the paper. As an author, I'm happy to stop playing the childish \baselinestretch and \appendix games, which were not helping anyone.

On the other hand, I vividly remember a discussion during my tenure on the FOCS PC when people made a strong case for a strict 10-page submission limit (no appendix). Many authors don't give a damn about teaching the reader anything, and seem driven by the desire to make their paper look as long and technical as possible. The reasoning was that a strict 10-page limit might actually force them to explain some ideas. Instead of a strict page limit, I would actually advocate using reviewers that are not easily intimidated, and stop accepting papers just because "it looks hard and these people seem to have put a lot of effort into it." In practice, of course, this ideal is hard to implement.

Forward by one conference, and I am on the SODA PC. This seems to be a standard punishment for having 4 papers in the conference. Speaking of which, I have recently uploaded slides for all these papers.

I am travelling to Europe in the coming weeks, so my series on hashing will continue at a slower pace.

Tuesday, February 2, 2010

Cuckoo Hashing

Today, we will prove bounds for the basic cuckoo hashing. We are going to place our n keys into two arrays A[1..b] and B[1..b], where b=2n. For this, we will use two hash functions, h and g. When some element x arrives, we first try to put it in A[h(x)]. If that location already contains some element y, try to move y to B[g(y)]. If that location already contains some z, try to move z to A[h(z)], and so on until you find a free spot.

The proper image to have in mind is that of a random bipartite graph. The graph will have b nodes on each side, corresponding to the locations of A and B. In this view, a key x is an edge from the left vertex h(x) to the right vertex g(x).

Simple paths. As a warm-up, let's deal with the case of simple paths: upon inserting x, the update path finds an empty spot without intersecting itself. It turns out that the update time of cuckoo hashing behaves like a geometric random variable:
The probability that insert(x) traverses a simple path of length k is 2-Ω(k).
I will prove this by a cute encoding analysis (you know I like encoding proofs). Let's say you want to encode the two hash codes for each of the n keys. As the hash functions h and g are truly random, this requires H=2nlg b bits on average (the entropy). But what if, whenever some event E happened, I could encode the hash codes using H-Δ bits? This would prove that Pr[E]=O(2): there are only O(2H-Δ) bad outcomes that lead to event E, out of 2H possible ones. Thus, the task of proving a probability upper bound becomes the task of designing an algorithm.

In our case, E={insert(x) traverses a simple path of length k} and we will achieve a saving of Δ=Ω(k). Here is what we put in the encoding:
  • one bit, saying whether the path grows from A[h(x)] or B[g(x)];
  • the value k, taking O(lg k) bits;
  • all edges of the path, in order, taking (k-1)lg n bits.
  • all vertices of the path, in order, taking (k+1)lg b bits.
  • the hash codes for all keys not on the path, in order, taking (n-k)·2lg b bits.
Observe that the hash codes of the k keys on the path are specified using essentially half the information, since a hash code is shared between two edges. Since lg n = lg(b/2) = lg b - 1, the encoding saves Δ=k-O(lg k) bits compared to H.

The intuition for why a k-path occurs with probability 2-Ω(k) is simple. Say I've reached edge y and I'm on the right side. Then, the probability that B[g(y)] is collision free is at least 1/2, since there are only n keys mapped to a space of 2n. In other words, at each point the path stops with probability half. This is exactly what the encoding is saying: we can save one bit per edge, since it takes lgn to encode an edge, but lg(2n) to encode an endpoint.

One cycle. Let us now deal with the case that the connected component of x contains one cycle. It is tempting to say that cuckoo hashing fails in this case, but it does not. Here is what might happen to the update path in case a cycle is part of the component (see figure):
  1. the path luckily avoids the cycle and finds a free location without intersection itself. Cool.
  2. the path reaches B[g(x)], which is occupied by some key y. Note that this has closed the cycle through the x edge, but the x edge is not actually traversed. Following y to A[h(y)] must eventually reach a free spot (no more cycles).
  3. the path intersects itself. Then, it will start backtracking, flipping elements back to their position before Insert(x). Eventually, it reaches A[h(x)], where the algorithm had originally placed x. Following the normal cuckoo rules, x is moved to B[g(x)] and the exploration from there on must find an empty spot.
As before, we want to show that an update takes time k with probability 2-Ω(k). In cases 1 and 2, we can simply apply the encoding from before, since the path is simple. In case 3, let ℓ be the number of edges until the path meets itself. The number of edges after B[g(x)] is at least k-2ℓ. Thus, we have a simple path from x of length max{ℓ, k-2ℓ} = Ω(k), so the old argument applies.

Two cycles. We now arrive at the cases when cuckoo hashing really fails: the bipartite graph contains as a subgraph (1) a cycles with a chord; or (2) two cycles connected by a path (possibly a trivial path, i.e. the cycles simply share a vertex).
From the figure we see that, by removing two edges, we can always turn the bad subgraph into two paths starting at x. We first encode those two paths as above, saving Ω(k), where k=size of the subgraph. Now we can add to the encoding the two infringing edges. For each, we can specify its identity with lg n bits, and its two end points with O(lg k) bits (a lower order loss compared to
the Ω(k) saving). In return, we know their two hash codes, which are worth 4lg b bits. Thus, the overall saving is at least 2lg n bits.

We have shown that an insertion fails with probability O(1/n2). By a union bound, cuckoo hashing will handle any fixed set of n elements with probability 1-O(1/n).

This bound is actually tight. Indeed, if three keys x,y,z have h(x)=h(y)=h(z) and g(x)=g(y)=g(z), then cuckoo hashing fails (this is the simplest obstruction subgraph). But such a bad event happens with probability (n choose 3)·b2 / b6 = Θ(1/n).

Better Guarantees for Chaining and Linear Probing

Last time we showed that chaining and linear probing enjoy O(1) expected running times per operation. However, this guarantee is fairly weak, and certainly does not explain the popularity of the schemes. If you had a computation that was expected to run for 4 hours, would you be happy to hear, say, that it might take 10 hours with 10% probability?

Worst case w.h.p. So, what bound can be put on the running time that hold with high probability? For chaining, simply apply the Chernoff bound. Given that a bin contains μ=O(1) elements in expectation, the probability that it contain Z elements is at most eZ-μ / (μ/Z)Z = (O(1)/Z)Z. To make this be n-c, set Z=Θ(lg n/lglg n).

Observe that this bound is tight. There are (n choose Z) ≈ nZ/Z! combinations of keys that can land in a designated bin, and some Z keys land there with probability b-Z=(2n)-Z. Thus, in expectation, at least one bin will see Z=Ω(lg n/lglg n) elements.

For linear probing, we can also reduce this to a balls-in-bins analysis, by defining a bin as L=O(lg n) consecutive locations in the array. In the running example b=2n, such a bin is full when Z=2μ=L. By Chernoff, this happens with probability eZ-μ / (μ/Z)Z = (e/4)L/2. Thus, the maximal run in O(lg n) w.h.p. Again, this is tight by a direct calculation.


Amortized bounds. Unfortunately, these worst-case1 bounds are not too satisfactory either, since O(lg n) is trivial to get by binary trees. If you grant me some lenience for my modeling, I will prove an O(1) amortized bound w.h.p., which I believe explains the power of the algorithms much better.
1 There is an unfortunate tendency of some TCS papers to use "worst case" when they really mean deterministic. I am much happier with the convention that worst case is opposite of amortized, at least when your paper has any connection to data structures.
Formally, I will prove the following, rather trivial statement:
Let T≥clg n, for a large enough constant c. Any T operations in chaining hashtables only touch O(T) memory w.h.p.
Fix the hash codes of the T elements, and define our "bin" to consist of these ≤T distinct hash codes. All other elements are still totally random, and we expect μ≤Tn/b=Θ(lg n) to fall into the bin. If μ≤Z/(2e), the Chernoff bound is eZ-μ/(μ/Z)Z ≤ 2 = high probability.

But does this actually mean that chaining has O(1) amortized running time? Formally, no: if I repeat a single query T times, the running time will be T times the running time of that query, i.e. a geometric random variable with no good concentration. Here is where I invoke a bit of lenience in my modeling: in practice, it is ridiculous to worry about repeating one of the last O(lg n) operations! The memory used by these recent updates will be fresh in the first level of cache, making a repetition cost essentially zero. (One may formally say that chaining has amortized O(1) running time w.h.p. in the external memory model with a cache of size Ω(lg n).)

A pretty way to understand the amortized bound is as a "buffer size" guarantee. The most demanding applications of hashtables are in analyzing a continuous stream of data, when operations need to be super-fast to keep up with the line speed. In such application, if the design is at all sensible, there will be a buffer between the network interface and our CPU. The goal is not necessarily to take O(1) time for every single operation, but to keep the buffer small. Our proof says that the buffer will not grow to more than T=O(lg n) w.h.p., if you can afford the average time/operation.

For linear probing, we can instead show:
Let T=Ω(lg3n). Any T operations in chaining hashtables only touch O(T) memory w.h.p.
Remember from last time that we analyzed linear probing by building a binary tree over the array, and bounding the number of nodes that become dangerous (two-thirds full).

Let μ be the number of keys we expect under some node. First of all, if μ≫lg n, we do not need to worry about the node: it doesn't become dangerous w.h.p. Otherwise, we showed that the node becomes dangerous with probability O(1/μ2); if it does, we will pay a cost of μ.

Looking at T elements, I am dealing with ≤T nodes on each level, and I expect O(T/μ2) nodes to be dangerous. As long as T/μ2clg n, Chernoff tells me that only O(T/μ2) nodes are dangerous w.h.p. Since we only deal with μ=O(lg n), I needed to set T=Ω(lg3n). With this bound, the number of memory locations accessed by the T operations is Σμ=2^i O(T/μ2)·μ = O(T) w.h.p.

I end with a question to my knowledgeable readers. By being more careful, I can prove T=Ω(lg1+εn) suffices for linear probing. Is it possible to prove T=Ω(lg n), as in chaining? (Perhaps I'm missing something obvious.)