To understand the state of the art in hash tables, you must understand the holy trinity of the area: chaining, linear probing, and cuckoo hashing. Chaining is the one that amateurs know, and shows up frequently in code. Linear probing is what you use when performance really matters. And cuckoo hashing is the theoretician's darling, providing the playground for a constant stream of papers.
Here is a basic description of the three hash tables, if you don't know them. There are, of course, many variations.
- chaining
- Each item x is hashed to one of b bins, where b=Ω(n). Each bin is stored as a linked list, with pointers to the head of each list stored in an array A[1..b]. In practice, you would store the head of each list in A[i], to save a pointer and a cache miss.
- linear probing
- We hold an array A[1..b] of records, where b ≥ (1+ε)n. When inserting x, we try to place it at A[h(x)]; if that location is empty, try A[h(x)+1], A[h(x)+2], ..., until you find an empty location. This addresses the main performance issues of chaining: there are no cache misses (we walk a contiguous region, not a linked list), and the space is better (no pointers). But, intuitively, it demands much more robustness from the hash function: now some elements hashing to location k can interfere negatively with elements hashing to a close k'.
- cuckoo hashing
- We hold two arrays A[1..b] and B[1..b] and use two hash functions, h and g. When x arrives, we try to put it at A[h(x)]. If that location already contains y, try to move y to B[g(y)]. If that location already contains z, try to move z to A[h(z)], and so on until you find a free spot. Observe that the query for x is worst-case constant time: just look for x in A[h(x)] and B[g(x)]!
Chaining. It is trivial to argue that the expected running time of insertions
and deletions is constant. Focus on some element q. For i≠q, let Xi be the indicator that h(i)=h(q). Then, the time it takes to insert or query q is O(1 + ΣXi).
Therefore, the expected time is bounded by E[ΣXi] = ΣE[Xi] = n/b = O(1), since h(x)=h(i) only happens with probability 1/b.
What we have just argued is that the expected number of elements that collide with x is O(1). Another way to state this is that the variance of a bin's size is O(1), a fact that we proved last time. To see this connection, let Bi be the number of elements in bin i. Observe that:
E[Σ(Bi)2] = n + E[#colliding pairs] = n + n · E[#elements colliding with x] = n + n2/bBy uniformity of the hash function, E[(Bi)2] = n/b + n2/b2. We have obtained the variance: Var[Bi] = E[(Bi)2] - E[Bi]2 = n/b.
Perfect hashing. A very cool consequence of this variance analysis is the well-known dictionary of [Fredman, Komlós, Szemerédi FOCS'82]. Their idea was to construct a static dictionary using randomization, but then have the query be completely deterministic. (Later work has focused on obtaining deterministic queries even in dynamic dictionaries, as in cuckoo hashing, and on completely eliminating randomness.)
The basic idea is that, if we had space 2n2, perfect static dictionaries would be trivial. Indeed, the expected number of collisions is n2 / b = 1/2, so, by Markov, the hash function is collision-free with probability at least 1/2. For the construction, we can simply generate hash functions until we see a perfect one (a constant number of iterations, in expectation).
To bring the space down to O(n), remember that our variance analysis showed E[Σ(Bi)2] = O(n). Thus, instead of storing the items mapping to A[i] as a linked list, we should store a mini-hashtable of quadratic size inside each A[i]. These mini-tables provide perfect hashing, but their total size is just linear!
Linear probing. The relevance of moments to linear probing was only recognized in a recent breakthrough paper [Pagh, Pagh, Ruzic STOC'07]. I will show the analysis for b=3n to ease notation; it is easy to extend to any load.
In true data-structures style, we consider a perfect binary tree spanning the array A[1..b]. A node on level k has 2k array positions under it, and (1/3)·2k items were originally hashed to them in expectation. (Here I am counting the original location h(x) of x, not where x really appears, which may be h(x)+1, h(x)+2, ...). Call the node "dangerous" if at least (2/3)·2k elements hashed to it.
Now say that we are dealing with element q (a query or an update). We must bound the contiguous run of elements that contain the position h(q). The key observation is that, if this run contains between 2k and 2k+1 elements, either the ancestor of h(q) at level k-2 is dangerous, or one of its siblings in an O(1) neighborhood is dangerous.
Let's say this run goes from A[i] to A[j], i≤h(q)≤j. The interval [i,j] is spanned by 4—9 nodes on level k-2. Assume for contradiction that none are dangerous. The first node, which is not completely contained in the interval, contributes less than (2/3)·2k-2 elements to the run (it the most extreme case, this many elements hashed to the last location of that node). But the next nodes all have more than 2k-2/3 free locations in their subtree, so 2 more nodes absorb all excess elements. Thus, the run cannot go on for 4 nodes, contradiction.
Now, the expected running time of an operation is clearly:
Σk O(2k)·Pr[h(q) is in a run of 2k to 2k+1 elements].As argued above, this probability is at most O(1) times the probability that a designated node at level k-2 is dangerous.
The rest is a simple balls-in-bins analysis: we want the probability that a bin, of expected size μ=2k-2/3, actually contains 2μ elements. Last time, we showed that Chebyshev bounds this probability by O(1/μ). Unfortunately, this is not enough, since Σk 2k·O(1/2k-2) = O(lg n).
However, if we go to the 4th moment, we obtain a probability bound of O(1/μ2). In this case, the running time is Σk2k·O(1/22(k-2)) = Σ O(2-k) = O(1). So the 4th moment is enough to make this series decay geometrically.