Wednesday, January 27, 2010

Basic Hashtables

To understand the state of the art in hash tables, you must understand the holy trinity of the area: chaining, linear probing, and cuckoo hashing. Chaining is the one that amateurs know, and shows up frequently in code. Linear probing is what you use when performance really matters. And cuckoo hashing is the theoretician's darling, providing the playground for a constant stream of papers.

Here is a basic description of the three hash tables, if you don't know them. There are, of course, many variations.

chaining
Each item x is hashed to one of b bins, where b=Ω(n). Each bin is stored as a linked list, with pointers to the head of each list stored in an array A[1..b]. In practice, you would store the head of each list in A[i], to save a pointer and a cache miss.

linear probing
We hold an array A[1..b] of records, where b ≥ (1+ε)n. When inserting x, we try to place it at A[h(x)]; if that location is empty, try A[h(x)+1], A[h(x)+2], ..., until you find an empty location. This addresses the main performance issues of chaining: there are no cache misses (we walk a contiguous region, not a linked list), and the space is better (no pointers). But, intuitively, it demands much more robustness from the hash function: now some elements hashing to location k can interfere negatively with elements hashing to a close k'.

cuckoo hashing
We hold two arrays A[1..b] and B[1..b] and use two hash functions, h and g. When x arrives, we try to put it at A[h(x)]. If that location already contains y, try to move y to B[g(y)]. If that location already contains z, try to move z to A[h(z)], and so on until you find a free spot. Observe that the query for x is worst-case constant time: just look for x in A[h(x)] and B[g(x)]!
Let us assume, for now, that hash functions are truly random. Moment bounds, which I discussed last time, are useful for understanding both chaining and linear probing (see below). For cuckoo hashing, the analysis is quite different.

Chaining. It is trivial to argue that the expected running time of insertions
and deletions is constant. Focus on some element q. For i≠q, let Xi be the indicator that h(i)=h(q). Then, the time it takes to insert or query q is O(1 + ΣXi).

Therefore, the expected time is bounded by E[ΣXi] = ΣE[Xi] = n/b = O(1), since h(x)=h(i) only happens with probability 1/b.

What we have just argued is that the expected number of elements that collide with x is O(1). Another way to state this is that the variance of a bin's size is O(1), a fact that we proved last time. To see this connection, let Bi be the number of elements in bin i. Observe that:
E[Σ(Bi)2] = n + E[#colliding pairs] = n + n · E[#elements colliding with x] = n + n2/b
By uniformity of the hash function, E[(Bi)2] = n/b + n2/b2. We have obtained the variance: Var[Bi] = E[(Bi)2] - E[Bi]2 = n/b.


Perfect hashing. A very cool consequence of this variance analysis is the well-known dictionary of [Fredman, Komlós, Szemerédi FOCS'82]. Their idea was to construct a static dictionary using randomization, but then have the query be completely deterministic. (Later work has focused on obtaining deterministic queries even in dynamic dictionaries, as in cuckoo hashing, and on completely eliminating randomness.)

The basic idea is that, if we had space 2n2, perfect static dictionaries would be trivial. Indeed, the expected number of collisions is n2 / b = 1/2, so, by Markov, the hash function is collision-free with probability at least 1/2. For the construction, we can simply generate hash functions until we see a perfect one (a constant number of iterations, in expectation).

To bring the space down to O(n), remember that our variance analysis showed E[Σ(Bi)2] = O(n). Thus, instead of storing the items mapping to A[i] as a linked list, we should store a mini-hashtable of quadratic size inside each A[i]. These mini-tables provide perfect hashing, but their total size is just linear!


Linear probing. The relevance of moments to linear probing was only recognized in a recent breakthrough paper [Pagh, Pagh, Ruzic STOC'07]. I will show the analysis for b=3n to ease notation; it is easy to extend to any load.

In true data-structures style, we consider a perfect binary tree spanning the array A[1..b]. A node on level k has 2k array positions under it, and (1/3)·2k items were originally hashed to them in expectation. (Here I am counting the original location h(x) of x, not where x really appears, which may be h(x)+1, h(x)+2, ...). Call the node "dangerous" if at least (2/3)·2k elements hashed to it.

Now say that we are dealing with element q (a query or an update). We must bound the contiguous run of elements that contain the position h(q). The key observation is that, if this run contains between 2k and 2k+1 elements, either the ancestor of h(q) at level k-2 is dangerous, or one of its siblings in an O(1) neighborhood is dangerous.

Let's say this run goes from A[i] to A[j], i≤h(q)≤j. The interval [i,j] is spanned by 4—9 nodes on level k-2. Assume for contradiction that none are dangerous. The first node, which is not completely contained in the interval, contributes less than (2/3)·2k-2 elements to the run (it the most extreme case, this many elements hashed to the last location of that node). But the next nodes all have more than 2k-2/3 free locations in their subtree, so 2 more nodes absorb all excess elements. Thus, the run cannot go on for 4 nodes, contradiction.

Now, the expected running time of an operation is clearly:
Σk O(2k)·Pr[h(q) is in a run of 2k to 2k+1 elements].
As argued above, this probability is at most O(1) times the probability that a designated node at level k-2 is dangerous.

The rest is a simple balls-in-bins analysis: we want the probability that a bin, of expected size μ=2k-2/3, actually contains 2μ elements. Last time, we showed that Chebyshev bounds this probability by O(1/μ). Unfortunately, this is not enough, since Σk 2k·O(1/2k-2) = O(lg n).

However, if we go to the 4th moment, we obtain a probability bound of O(1/μ2). In this case, the running time is Σk2k·O(1/22(k-2)) = Σ O(2-k) = O(1). So the 4th moment is enough to make this series decay geometrically.

12 comments:

Unknown said...

Very nice. I can't wait for the next installment.

Anonymous said...

Can you please more details on the following

E[Σ(Bi)^2] = n + E[#colliding pairs] = n + n · E[#elements colliding with x] = n + n^2/b

Anonymous said...

I meant
"Can you please GIVE more details on the following"

sorry about that.

Mihai said...

Sure, requests for clarifications are always welcome.

If Bi is the number of elements in bin i, then the number of pairs (i,j) that collide, in the sense that h(i)=h(j), can be written as Σ(Bi)^2 - n. If some bin has 7 elements, for instance, it account for 7*6 = 7^2 - 7 collisions. I have thus shown E[Σ(Bi)^2] = n + E[#colliding pairs]

Now E[#colliding pairs] = E[#colliding pairs of the form (1,*)] + E[#colliding pairs of the form (2,*)] + ...

So E[#colliding pairs] = n · E[#elements colliding with some fixed element]

But the number of elements colliding with some fixed element is just what I calculated above, i.e. n/b.

Does this make sense now?

Anonymous said...

Thanks a lot for the clarifications.

I have one more question, though.
B_i is what you call X in the first post, right? However, the variance of X is somewhat different. Am I missing something?

Mihai said...

B_i is what you call X in the first post, right?

Yes.

However, the variance of X is somewhat different.

Hopefully not :) I proved in the first post that Var[X] = O(n/b). Here I showed it without a constant, Var[B_i] = n/b.

Anonymous said...

Most probably I miss something but please allow me to continue.
If we work without O(), from the first post we have that
E[Y_i^2] = 1/b - 1/b^2
and Σ E[Y_i^2] = Var[X] = n/b - n/b^2

It is O(n/b) but not n/b as you proved for B

Mihai said...

You are absolutely right, I can't calculate :) The mistake is:

E[#elements colliding with x] = n/b.

It is in fact (n-1)/b, since there are only n-1 other elements...

Rasmus Pagh said...

Thanks for this nice expository post, Mihai! It compensates for my own lack of doing more on that front :-) A couple of comments on what to do when speed really matters:

- If the items in your hash table are big, don't use linear probing, as the cache won't help you anyway.

- A proper implementation of cuckoo hashing can have essentially the same cache-efficiency as linear probing for lookups, as the memory probes can be pipelined. But insertions are cache-inefficient, so don't use when there are many updates.

- There are cases where chaining can be useful, e.g. if you are storing (large) variable-size data. It seems that only linear probing can accommodate variable-size data, among open addressing schemes, but the performance degenerates for large items.

Mihai said...

Hi Rasmus!

I slightly disagree on large items and linear probing. You are still getting the advantage of the prefetcher. In our tests, linear probing with in-table record is still the fastest for items of 256 bytes (this was quite surprising to us). After that, you can switch to pointers.

If you have large, variable-sized records, then I don't think there is any real difference between linear probing and chaining. You are traversing the same number of pointers.

Anonymous said...

Dear Mihai,

in the textbooks they say linear probing might not be so good due to clustering and recommend quadratic probing and "double hashing". Can you comment on that? I'm asking this because I regularly teach these matters and don't want to be obsolete.

Thanks, Martin

Mihai said...

Those textbooks are obsolete. If you're going to do something that is not cache friendly, you should just do cuckoo hashing which has O(1) query time in the worst-case. There's no reason to do quadratic probing, etc. Those date back to the days when we didn't know any better :)