In the previous post I described 3 ways of making the "van Emde Boas data structure" take linear space. I use quotes since there is no unique vEB structure, but rather a family of data structures inspired by the FOCS'75 paper of van Emde Boas. By the way, if you're curious who van Emde Boas is, here is a portrait found on his webpage.

In this post, I will describe a 4th method. You'd be excused for asking what the point is, so let me quickly mention that this technique has a great application (1D range reporting, which I will discuss in another post) and it introduces a nice tools you should know.

Here is a particularly simple variant of vEB, introduced by Willard as the "y-fast tree". Remember from the last post that the trie representing the set has

*n*-1 branching nodes connected by 2*n*-1 "active" paths; if we know the lowest branching ancestor of the query, we can find the predecessor in constant time. Willard's approach is to store a hash table with all O(*n*lg*u*) active nodes in the trie; for each node, we store a pointer to its lowest branching ancestor. Then, we can binary search for the height of the lowest active ancestor of the query, and follow a pointer to the lowest branching node above. As the trie height is O(lg*u*), this search takes O(lglg*u*) look-ups in the hash table.Of course, we can reduce the space from O(

*n*lg*u*) to O(*n*) by bucketing. But let's try something else. We could break the binary search into two phases:- Find
*v*, the lowest active ancestor of the query at some depth of the form i·√lg*u*(binary search on*i*). Say*v*is on the path*u*→*w*(where*u*,*w*are branching nodes). If*w*is not an ancestor of the query, return*u.* - Otherwise, the lowest branching ancestor of the query is found at some depth in [ i·√lg
*u*, (i+1)√lg*u*]. Binary search to find the lowest active ancestor in this range, and follow a pointer to the lowest active ancestor.

With this modification, we only need to store O(

*n*√lg*u*) active nodes in the hash table! To support step 1., we need active nodes at depths i·√lg*u*. To support step 2., we need active nodes whose lowest branching ancestor is only ≤ √lg*u*levels above. All other active nodes can be ignored.You could bring the space down to O(

*n*lg^{ε}*u*) by breaking the search into more segments. But to bring the space down to linear, we use heavier machinery:**Retrieval-only dictionaries.**Say we want a dictionary ("hash table") that stores a set of

*n*keys from the universe [

*u*], where each key has

*k*bits of associated data. The dictionary supports two operations:

- membership: is
*x*in the set? - retrieval: assuming
*x*is in the set, return data[*x*].

If we want to support both operations, the smallest space we can hope for is log(

*u*choose*n*) +*nk*≈*n*(lg*u*+*k*)*bits: the data structure needs to encode the set itself, and the data.*Somewhat counterintuitively, dictionaries that only support retrieval (without membership queries) are in fact useful. (The definition of such a dictionary is that retrieve(

*x*) may return garbage if*x*is not in the set.)Retrieval-only dictionaries can be implemented using only O(

*nk*) bits. I will describe this in the next post, but I hope it is believable enough.When is a retrieval-only dictionary helpful? When we can verify answers in some other way. Remember the data structure with space O(

*n*√lg*u*) from above. We will store branching nodes in a real hash table (there are only*n*-1 of them). But observe the following about the O(*n*√lg*u*) active nodes that we store:- We only need
*k*=O(lglg*u*) bits of associated data. Instead of storing a pointer to the lowest branching ancestor, we can just store the height difference (a number between 1 and lg*u*). This is effectively a pointer: we can compute the branching ancestor by zeroing out so many bits of the node. - We only need to store them in a retrieval-only dictionary. Say we query some node
*v*and find a height difference δ to the lowest branching ancestor. We can verify whether*v*was real by looking up the δ-levels-up ancestor of*v*in the hash table of branching nodes, and checking that*v*lies on one of the two paths descending from this branching node.

Therefore, the dictionary of active nodes only requires O(

*n*√lg*u*· lglg*u*) bits, which is o(*n*) words of space! This superlinear number of nodes take negligible space compared to the branching nodes.
## 1 comment:

Hi Mihai. Just a comment (since nobody else is leaving one) that I am greatly enjoying this set of technical posts you are writing. Thanks!

Michael

Post a Comment