Tuesday, September 21, 2010

vEB Space: Method 4

In the previous post I described 3 ways of making the "van Emde Boas data structure" take linear space. I use quotes since there is no unique vEB structure, but rather a family of data structures inspired by the FOCS'75 paper of van Emde Boas. By the way, if you're curious who van Emde Boas is, here is a portrait found on his webpage.

In this post, I will describe a 4th method. You'd be excused for asking what the point is, so let me quickly mention that this technique has a great application (1D range reporting, which I will discuss in another post) and it introduces a nice tools you should know.

Here is a particularly simple variant of vEB, introduced by Willard as the "y-fast tree". Remember from the last post that the trie representing the set has n-1 branching nodes connected by 2n-1 "active" paths; if we know the lowest branching ancestor of the query, we can find the predecessor in constant time. Willard's approach is to store a hash table with all O(n lg u) active nodes in the trie; for each node, we store a pointer to its lowest branching ancestor. Then, we can binary search for the height of the lowest active ancestor of the query, and follow a pointer to the lowest branching node above. As the trie height is O(lg u), this search takes O(lglg u) look-ups in the hash table.

Of course, we can reduce the space from O(n lg u) to O(n) by bucketing. But let's try something else. We could break the binary search into two phases:
1. Find v, the lowest active ancestor of the query at some depth of the form i·√lg u (binary search on i). Say v is on the path uw (where u, w are branching nodes). If w is not an ancestor of the query, return u.

2. Otherwise, the lowest branching ancestor of the query is found at some depth in [ i·√lg u , (i+1)√lg u ]. Binary search to find the lowest active ancestor in this range, and follow a pointer to the lowest active ancestor.
With this modification, we only need to store O(nlg u ) active nodes in the hash table! To support step 1., we need active nodes at depths i·√lg u. To support step 2., we need active nodes whose lowest branching ancestor is only ≤ √lg u levels above. All other active nodes can be ignored.

You could bring the space down to O(n lgεu) by breaking the search into more segments. But to bring the space down to linear, we use heavier machinery:

Retrieval-only dictionaries. Say we want a dictionary ("hash table") that stores a set of n keys from the universe [u], where each key has k bits of associated data. The dictionary supports two operations:
• membership: is x in the set?
• retrieval: assuming x is in the set, return data[x].
If we want to support both operations, the smallest space we can hope for is log(u choose n) + nk n(lg u + k) bits: the data structure needs to encode the set itself, and the data.

Somewhat counterintuitively, dictionaries that only support retrieval (without membership queries) are in fact useful. (The definition of such a dictionary is that retrieve(x) may return garbage if x is not in the set.)

Retrieval-only dictionaries can be implemented using only O(nk) bits. I will describe this in the next post, but I hope it is believable enough.

When is a retrieval-only dictionary helpful? When we can verify answers in some other way. Remember the data structure with space O(nlg u ) from above. We will store branching nodes in a real hash table (there are only n-1 of them). But observe the following about the O(nlg u ) active nodes that we store:
1. We only need k=O(lglg u) bits of associated data. Instead of storing a pointer to the lowest branching ancestor, we can just store the height difference (a number between 1 and lg u). This is effectively a pointer: we can compute the branching ancestor by zeroing out so many bits of the node.

2. We only need to store them in a retrieval-only dictionary. Say we query some node v and find a height difference δ to the lowest branching ancestor. We can verify whether v was real by looking up the δ-levels-up ancestor of v in the hash table of branching nodes, and checking that v lies on one of the two paths descending from this branching node.
Therefore, the dictionary of active nodes only requires O(nlg u · lglg u) bits, which is o(n) words of space! This superlinear number of nodes take negligible space compared to the branching nodes.

1 comment:

Michael Mitzenmacher said...

Hi Mihai. Just a comment (since nobody else is leaving one) that I am greatly enjoying this set of technical posts you are writing. Thanks!

Michael