Monday, September 27, 2010

Retrieval-Only Dictionaries

We saw two cool applications of dictionaries without membership; now it's time to construct them. Remember that we are given a set S, where each element xS has some associated data[x], a k-bit value. We want a data structure of O(nk) bits which retrieves data[x] for any xS and may return garbage when queried for xS.

A conceptually simple solution is the "Bloomier filter" of [Chazelle, Kilian, Rubinfeld, Tal SODA'04]. This is based on the power of two choices, so you should first go back and review my old post giving an encoding analysis of cuckoo hashing.

Standard cuckoo hashing has two arrays A[1..2n], B[1..2n] storing keys, and places a key either at A[h(x)] or B[g(x)]. Instead of this, our arrays A and B will store k-bit values (O(nk) bits in total), and the query retrieve-data(x) will return A[h(x)] xor B[g(x)].

The question is whether we can set up the values in A and B such that any query xS returns data[x] correctly. This is a question about the feasibility of a linear system with n equations (one per key) and 4n variables (one per array entry).

Consider a connected component in the bipartite graph induced by cuckoo hashing. If this component is acyclic, we can fix A and B easily. Take an arbitrary node and make it "zero"; then explore the tree by DFS (or BFS). Each new node (an entry in A or B) has a forced value, since the edge advancing to it must return some data[x] and the parent node has been fixed already. As the component is acyclic, there is only one constraint on every new node, so there are no conflicts.

On the other hand, if a component has a cycle, we are out of luck. Remark that if we xor all cycle nodes by some Δ, the answers are unchanged, since the Δ's cancel out on each edge. So a cycle of length k must output k independent data values, but has only k-1 degrees of freedom.

Fortunately, one can prove the following about the cuckoo hashing graph:
  • the graph is acyclic with some constant probability. Thus, the construction algorithm can rehash until it finds an acyclic graph, taking O(n) time in expectation.

  • the total length of all cycles is O(lg n) with high probability. Thus we can make the graph acyclic by storing O(lg n) special elements in a stash. This gives construction time O(n) w.h.p., but the query algorithm is slightly more complicated (for instance, it can handle the stash by a small hash table on the side).
These statements fall out naturally from the encoding analysis of cuckoo hashing. A cycle of length k allows a saving of roughly k bits in the encoding: we can write the k keys on the cycle (klg n bits) plus the k hash codes (klg(2n) bits) instead of 2k hash codes (2klg(2n) bits).

Further remarks. Above, I ignored the space to store the hash functions h and g. You have to believe me that there exist families of hash functions representable in O(nε) space, which can be evaluated in constant time, and make cuckoo hashing work.

A very interesting goal is to obtain retrieval dictionaries with close to kn bits. As far as I know, the state of the art is given by [Pagh-Dietzfelbinger ICALP'08] and [Porat].


Pall Melsted said...

Excellent post! Do you know if there are any results on the dynamic problem? The dynamic part here would be restricted to insert-only and updating values stored.

Mihai said...

Excellent question ;)

In fact the dynamic problem was solved in two of my papers: On Dynamic Range Reporting in One Dimension and De Dictionariis Dynamicis Pauco Spatio Utentibus