Sunday, September 19, 2010

Van Emde Boas and its space complexity

In this post, I want to describe 3 neat and very different ways of making the space of the van Emde Boas (vEB) data structure linear. While this is not hard, it is subtle enough to confuse even seasoned researchers at times. In particular, it is the first bug I ever encountered in a class: Erik Demaine was teaching Advanced Data Structures at MIT in spring of 2003 (the first grad course I ever took!), and his solution for getting linear space was flawed.

Erik is the perfect example of how you can get astronomically high teaching grades while occasionally having bugs in your lectures. In fact, I sometimes suspected him of doing it on purpose: deliberately letting a bug slip by to make the course more interactive. Perhaps there is a lesson to be learned here.


Here is a quick review of vEB if you don't know it. Experienced readers can skip ahead.

The predecessor problem is to support a set S of |S|=n integers from the universe {1, ..., u} and answer:
predecessor(q) = max { xS | xq }
The vEB data structure can answer queries in O(lglg u) time, which is significantly faster than binary search for moderate universes.

The first idea is to divide the universe into √u segments of size √u. Let hi(x) = ⌊x/√u⌋ be the segment containing x, and lo(x) = x mod √u be the location of x within its segment. The data structure has the following components:
  • a hash table H storing hi(x) for all xS.
  • a top structure solving predecessor search among { hi(x) | xS }. This is the same as the original data structure, i.e. use recursion.
  • for each element α∈H, a recursive bottom structure solving predecessor search inside the α segment, i.e. among the keys { lo(x) | xS and hi(x)=α }.
The query algorithm first checks if hi(q) ∈ H. If so, all the action is in q's segment, so you recurse in the appropriate bottom structure. (You either find its predecessor there, or in the special case when q is less than the minimum in that segment, find the successor and follow a pointer in a doubly linked list.)

If q's segment is empty, all the action is at the segment level, and q's predecessor is the max in the preceding non-empty segment. So you recurse in the top structure.

In one step, the universe shrinks from u to √u, i.e. lg u shrinks to ½ lg u. Thus, in O(lglg u) steps the problem is solved.


So what is the space of this data structure? As described above, each key appears in the hash table, and in 2 recursive data structures. So the space per key obeys the recursion S(u) = 1 + 2 S(√u). Taking logs: S'(lg u) = 1 + 2 S'(½ lg u), so the space is O(lg u) per key.

How can we reduce this to space O(n)? Here are 3 very different ways:

Brutal bucketing. Group elements into buckets of O(lg u) consecutive elements. From each bucket, we insert the min into a vEB data structure. Once we find a predecessor in the vEB structure, we know the bucket where we must search for the real predecessor. We can use binary search inside the bucket, taking time O(lglg u). The space is (n/lg u) ·lg u = O(n).

Better analysis. In fact, the data structure from above does take O(n) space if you analyze it better! For each segment, we need to remember the max inside the segment in the hash table, since a query in the top structure must translate the segment number into the real predecessor. But then there's no point in putting the max in the bottom structure: once the query accesses the hash table, it can simply compare with the max in O(1) time. (If the query is higher than the max in its segment, the max is the predecessor.)

In other words, every key is stored recursively in just one data structure: either the top structure (for each segment max) or the bottom structure (for all other keys). This means there are O(lglg u) copies of each element, so space O(n lglg u).

But note that copies get geometrically cheaper! At the first level, keys are lg u bits. At the second level, they are only ½ lg u bits; etc. Thus, the cost per key, in bits, is a geometric series, which is bounded by O(lg u). In other words, the cost is only O(1) words per key. (You may ask: even if the cost of keys halves every time, what about the cost of pointers, counters, etc? The cost of a pointer is O(lg n) bits, and nu in any recursive data structure.)

Be slick. Here's a trickier variation due to Rasmus Pagh. Consider the trie representing the set of keys (a trie is a perfect binary tree of depth lg u in which each key is a root-to-leaf path). The subtree induced by the keys has n-1 branching nodes, connected by 2n-1 unbranching paths. It suffices to find the lowest branching node above the query. (If each branching node stores a pointer to his children, and the min and max values in its subtree, we can find the predecessor with constant work after knowing the lowest branching node.)

We can afford space O(1) per path. The data structure stores:
  • a top structure, with all paths that begin and end above height ½ lg u.
  • a hash table H with the nodes at depth ½ lg u of every path crossing this depth.
  • for each α∈H, a bottom structure with all paths starting below depth ½ lg u which have α as prefix.
Observe that each path is stored in exactly one place, so the space is linear. But why can we query for the lowest branching node above some key? As the query proceeds, we keep a pointer p to the lowest branching node found so far. Initially p is the root. Here is the query algorithm:
  • if p is below depth ½ lg u, recurse in the appropriate bottom structure. (We have no work to do on this level.)
  • look in H for the node above the query at depth ½ lg u. If not found, recurse in the top structure. If found, let p be the bottom node of the path crossing depth ½ lg u which we just found in the hash table. Recurse to the appropriate bottom structure.
The main point is that a path is only relevant once, at the highest level of the recursion where the path crosses the middle line. At lower levels the path cannot be queried, since if you're on the path you already have a pointer to the node at the bottom of the path!


JeffE said...

Erik is the perfect example of how you can get astronomically high teaching grades while occasionally having bugs in your lectures.

I think there's a stronger principle at work here: Almost all the instructors I know with astronomically high teaching grades have buggy lectures. Conversely, almost all the instructors I know with bug-free lectures have merely terrestrial teaching scores.

(See also: confirmation bias.)

Another Example said...

Patrick Winston for example frequently misspells simple words on the blackboard to keep students attentive.

Mihai said...

A correction: Method 3 is originally due to Milan Ružić, from Making deterministic signatures quickly