Wednesday, September 29, 2010

Problem solving versus new techniques

This is a guest post by Mikkel Thorup:

--------------------------

I think there is nothing more inhibiting for problem solving than referees looking for new general techniques.

When I go to STOC/FOCS, I hope to see some nice solutions to important problems and some new general techniques. I am not interested in semi-new techniques for semi-important problems. A paper winning both categories is a wonderful but rare event.

Thus I propose a max-evaluation rather than a sum. If the strength of a paper is that it solves an important problem, then speculations on the generality of the approach are of secondary importance. Conversely, if the strength of the paper is some new general techniques, then I can forgive that it doesn't solve anything new and important.

One of the nice things about TCS is that we have problems that are important, not just as internal technical challenges, but because of their relation to computing. At the end of the day, we hope that our techniques will end up solving important problems.

Important problems should be solved whatever way comes natural. It may be deep problem specific understanding, and it may build on previous techniques. Why would we be disappointed if an old problem got solved by a surprising reuse of an old technique?

Elegant reuse of old techniques is not a bad thing. After all, when we develop new techniques, we are hoping they will be reused, and from a teaching perspective, it is great to discover that the same general technique applies to more diverse problems. The last thing we want is to encourage authors to hide reuse. Of course, reuse should be highly non-obvious for a strong conference, but for established open problems, non-obviousness will normally be self-evident.

Monday, September 27, 2010

Retrieval-Only Dictionaries

We saw two cool applications of dictionaries without membership; now it's time to construct them. Remember that we are given a set S, where each element xS has some associated data[x], a k-bit value. We want a data structure of O(nk) bits which retrieves data[x] for any xS and may return garbage when queried for xS.

A conceptually simple solution is the "Bloomier filter" of [Chazelle, Kilian, Rubinfeld, Tal SODA'04]. This is based on the power of two choices, so you should first go back and review my old post giving an encoding analysis of cuckoo hashing.

Standard cuckoo hashing has two arrays A[1..2n], B[1..2n] storing keys, and places a key either at A[h(x)] or B[g(x)]. Instead of this, our arrays A and B will store k-bit values (O(nk) bits in total), and the query retrieve-data(x) will return A[h(x)] xor B[g(x)].

The question is whether we can set up the values in A and B such that any query xS returns data[x] correctly. This is a question about the feasibility of a linear system with n equations (one per key) and 4n variables (one per array entry).

Consider a connected component in the bipartite graph induced by cuckoo hashing. If this component is acyclic, we can fix A and B easily. Take an arbitrary node and make it "zero"; then explore the tree by DFS (or BFS). Each new node (an entry in A or B) has a forced value, since the edge advancing to it must return some data[x] and the parent node has been fixed already. As the component is acyclic, there is only one constraint on every new node, so there are no conflicts.

On the other hand, if a component has a cycle, we are out of luck. Remark that if we xor all cycle nodes by some Δ, the answers are unchanged, since the Δ's cancel out on each edge. So a cycle of length k must output k independent data values, but has only k-1 degrees of freedom.

Fortunately, one can prove the following about the cuckoo hashing graph:
  • the graph is acyclic with some constant probability. Thus, the construction algorithm can rehash until it finds an acyclic graph, taking O(n) time in expectation.

  • the total length of all cycles is O(lg n) with high probability. Thus we can make the graph acyclic by storing O(lg n) special elements in a stash. This gives construction time O(n) w.h.p., but the query algorithm is slightly more complicated (for instance, it can handle the stash by a small hash table on the side).
These statements fall out naturally from the encoding analysis of cuckoo hashing. A cycle of length k allows a saving of roughly k bits in the encoding: we can write the k keys on the cycle (klg n bits) plus the k hash codes (klg(2n) bits) instead of 2k hash codes (2klg(2n) bits).

Further remarks. Above, I ignored the space to store the hash functions h and g. You have to believe me that there exist families of hash functions representable in O(nε) space, which can be evaluated in constant time, and make cuckoo hashing work.

A very interesting goal is to obtain retrieval dictionaries with close to kn bits. As far as I know, the state of the art is given by [Pagh-Dietzfelbinger ICALP'08] and [Porat].

Tuesday, September 21, 2010

Static 1D Range Reporting

Method 4 for implementing van Emde Boas with linear space, described in my last post, is due to [Alstrup, Brodal, Rauhe: STOC'01]. They worked on static range reporting in 1 dimension: preprocess a set of integers S, and answer query(a,b) = report all points in S ∩ [a,b]. This is easier than predecessor search: you can first find the predecessor of a and then output points in order until you exceed b. Using van Emde Boas, we would achieve a linear-space data structure with query time O(lglg u + k), where k is the number of points to be reported.

Alstrup, Brodal, and Rauhe showed the following surprising result:
Static 1D range reporting can be solved with O(n) space and O(1+k) query time.
I like this theorem a lot, since it is so easy to describe to anybody with minimal background in Computer Science, yet the result is not obvious. I have used it many times to answer questions like, "Tell me a surprising recent result from data structures."

The solution. We need a way to find some (arbitrary) key from S ∩ [a,b] in constant time. Once we have that, we can walk left and right in an ordered list until we go outside the interval.

Let's first see how to do this with O(n lg u) space; this was described by [Miltersen, Nisan, Safra, Wigderson: STOC'95]. Of course, we build the trie representing the set. Given the query [a,b] let us look at the lowest common ancestor (LCA) of a and b. Note that LCA(a,b) is a simple mathematical function of the integers a and b, and can be computed in constant time. (The height of the LCA is the most significant set bit in a xor b.)
  • if LCA(a,b) is a branching node, look at the two descendant branching nodes. If the interval [a,b] is nonempty, it must contain either the max in the tree of the left child, or the min in the tree of the right child.

  • if LCA(a,b) is an active node, go to its lowest branching ancestor, and do something like the the above.

  • if LCA(a,b) is not an active node, the interval [a,b] is certainly empty!
Thus, it suffices to find the lowest branching ancestor of LCA(a,b) assuming that LCA(a,b) is active. This is significantly easier than predecessor search, which needs the lowest branching ancestor of an arbitrary node.

The proposal of [Miltersen et al.] is to store all O(n lg u) active nodes in a hash table, with pointers to their lowest branching ancestors.

As in my last post, the technique of [Alstrup et al.] to achieve O(n) space is: store only O(nlg u) active nodes, and store them in a retrieval-only dictionary with O(lglg u) bits per node. We store the following active nodes:
  1. active nodes at depth i·√lg u ;
  2. active nodes less than √lg u levels below a branching node.
We first look up LCA(a,b) in the dictionary. If the lowest branching ancestor is less than √lg u levels above, LCA(a,b) is in the dictionary and we find the ancestor. If not, we truncate the depth of the LCA to a multiple of √lg u , and look up the ancestor at that depth. If [a,b] is nonempty, that ancestor must be an active node and it will point us to a branching ancestor.

vEB Space: Method 4

In the previous post I described 3 ways of making the "van Emde Boas data structure" take linear space. I use quotes since there is no unique vEB structure, but rather a family of data structures inspired by the FOCS'75 paper of van Emde Boas. By the way, if you're curious who van Emde Boas is, here is a portrait found on his webpage.

In this post, I will describe a 4th method. You'd be excused for asking what the point is, so let me quickly mention that this technique has a great application (1D range reporting, which I will discuss in another post) and it introduces a nice tools you should know.

Here is a particularly simple variant of vEB, introduced by Willard as the "y-fast tree". Remember from the last post that the trie representing the set has n-1 branching nodes connected by 2n-1 "active" paths; if we know the lowest branching ancestor of the query, we can find the predecessor in constant time. Willard's approach is to store a hash table with all O(n lg u) active nodes in the trie; for each node, we store a pointer to its lowest branching ancestor. Then, we can binary search for the height of the lowest active ancestor of the query, and follow a pointer to the lowest branching node above. As the trie height is O(lg u), this search takes O(lglg u) look-ups in the hash table.

Of course, we can reduce the space from O(n lg u) to O(n) by bucketing. But let's try something else. We could break the binary search into two phases:
  1. Find v, the lowest active ancestor of the query at some depth of the form i·√lg u (binary search on i). Say v is on the path uw (where u, w are branching nodes). If w is not an ancestor of the query, return u.

  2. Otherwise, the lowest branching ancestor of the query is found at some depth in [ i·√lg u , (i+1)√lg u ]. Binary search to find the lowest active ancestor in this range, and follow a pointer to the lowest active ancestor.
With this modification, we only need to store O(nlg u ) active nodes in the hash table! To support step 1., we need active nodes at depths i·√lg u. To support step 2., we need active nodes whose lowest branching ancestor is only ≤ √lg u levels above. All other active nodes can be ignored.

You could bring the space down to O(n lgεu) by breaking the search into more segments. But to bring the space down to linear, we use heavier machinery:

Retrieval-only dictionaries. Say we want a dictionary ("hash table") that stores a set of n keys from the universe [u], where each key has k bits of associated data. The dictionary supports two operations:
  • membership: is x in the set?
  • retrieval: assuming x is in the set, return data[x].
If we want to support both operations, the smallest space we can hope for is log(u choose n) + nk n(lg u + k) bits: the data structure needs to encode the set itself, and the data.

Somewhat counterintuitively, dictionaries that only support retrieval (without membership queries) are in fact useful. (The definition of such a dictionary is that retrieve(x) may return garbage if x is not in the set.)

Retrieval-only dictionaries can be implemented using only O(nk) bits. I will describe this in the next post, but I hope it is believable enough.

When is a retrieval-only dictionary helpful? When we can verify answers in some other way. Remember the data structure with space O(nlg u ) from above. We will store branching nodes in a real hash table (there are only n-1 of them). But observe the following about the O(nlg u ) active nodes that we store:
  1. We only need k=O(lglg u) bits of associated data. Instead of storing a pointer to the lowest branching ancestor, we can just store the height difference (a number between 1 and lg u). This is effectively a pointer: we can compute the branching ancestor by zeroing out so many bits of the node.

  2. We only need to store them in a retrieval-only dictionary. Say we query some node v and find a height difference δ to the lowest branching ancestor. We can verify whether v was real by looking up the δ-levels-up ancestor of v in the hash table of branching nodes, and checking that v lies on one of the two paths descending from this branching node.
Therefore, the dictionary of active nodes only requires O(nlg u · lglg u) bits, which is o(n) words of space! This superlinear number of nodes take negligible space compared to the branching nodes.

Sunday, September 19, 2010

Van Emde Boas and its space complexity

In this post, I want to describe 3 neat and very different ways of making the space of the van Emde Boas (vEB) data structure linear. While this is not hard, it is subtle enough to confuse even seasoned researchers at times. In particular, it is the first bug I ever encountered in a class: Erik Demaine was teaching Advanced Data Structures at MIT in spring of 2003 (the first grad course I ever took!), and his solution for getting linear space was flawed.

Erik is the perfect example of how you can get astronomically high teaching grades while occasionally having bugs in your lectures. In fact, I sometimes suspected him of doing it on purpose: deliberately letting a bug slip by to make the course more interactive. Perhaps there is a lesson to be learned here.

***

Here is a quick review of vEB if you don't know it. Experienced readers can skip ahead.

The predecessor problem is to support a set S of |S|=n integers from the universe {1, ..., u} and answer:
predecessor(q) = max { xS | xq }
The vEB data structure can answer queries in O(lglg u) time, which is significantly faster than binary search for moderate universes.

The first idea is to divide the universe into √u segments of size √u. Let hi(x) = ⌊x/√u⌋ be the segment containing x, and lo(x) = x mod √u be the location of x within its segment. The data structure has the following components:
  • a hash table H storing hi(x) for all xS.
  • a top structure solving predecessor search among { hi(x) | xS }. This is the same as the original data structure, i.e. use recursion.
  • for each element α∈H, a recursive bottom structure solving predecessor search inside the α segment, i.e. among the keys { lo(x) | xS and hi(x)=α }.
The query algorithm first checks if hi(q) ∈ H. If so, all the action is in q's segment, so you recurse in the appropriate bottom structure. (You either find its predecessor there, or in the special case when q is less than the minimum in that segment, find the successor and follow a pointer in a doubly linked list.)

If q's segment is empty, all the action is at the segment level, and q's predecessor is the max in the preceding non-empty segment. So you recurse in the top structure.

In one step, the universe shrinks from u to √u, i.e. lg u shrinks to ½ lg u. Thus, in O(lglg u) steps the problem is solved.

***

So what is the space of this data structure? As described above, each key appears in the hash table, and in 2 recursive data structures. So the space per key obeys the recursion S(u) = 1 + 2 S(√u). Taking logs: S'(lg u) = 1 + 2 S'(½ lg u), so the space is O(lg u) per key.

How can we reduce this to space O(n)? Here are 3 very different ways:

Brutal bucketing. Group elements into buckets of O(lg u) consecutive elements. From each bucket, we insert the min into a vEB data structure. Once we find a predecessor in the vEB structure, we know the bucket where we must search for the real predecessor. We can use binary search inside the bucket, taking time O(lglg u). The space is (n/lg u) ·lg u = O(n).


Better analysis. In fact, the data structure from above does take O(n) space if you analyze it better! For each segment, we need to remember the max inside the segment in the hash table, since a query in the top structure must translate the segment number into the real predecessor. But then there's no point in putting the max in the bottom structure: once the query accesses the hash table, it can simply compare with the max in O(1) time. (If the query is higher than the max in its segment, the max is the predecessor.)

In other words, every key is stored recursively in just one data structure: either the top structure (for each segment max) or the bottom structure (for all other keys). This means there are O(lglg u) copies of each element, so space O(n lglg u).

But note that copies get geometrically cheaper! At the first level, keys are lg u bits. At the second level, they are only ½ lg u bits; etc. Thus, the cost per key, in bits, is a geometric series, which is bounded by O(lg u). In other words, the cost is only O(1) words per key. (You may ask: even if the cost of keys halves every time, what about the cost of pointers, counters, etc? The cost of a pointer is O(lg n) bits, and nu in any recursive data structure.)


Be slick. Here's a trickier variation due to Rasmus Pagh. Consider the trie representing the set of keys (a trie is a perfect binary tree of depth lg u in which each key is a root-to-leaf path). The subtree induced by the keys has n-1 branching nodes, connected by 2n-1 unbranching paths. It suffices to find the lowest branching node above the query. (If each branching node stores a pointer to his children, and the min and max values in its subtree, we can find the predecessor with constant work after knowing the lowest branching node.)

We can afford space O(1) per path. The data structure stores:
  • a top structure, with all paths that begin and end above height ½ lg u.
  • a hash table H with the nodes at depth ½ lg u of every path crossing this depth.
  • for each α∈H, a bottom structure with all paths starting below depth ½ lg u which have α as prefix.
Observe that each path is stored in exactly one place, so the space is linear. But why can we query for the lowest branching node above some key? As the query proceeds, we keep a pointer p to the lowest branching node found so far. Initially p is the root. Here is the query algorithm:
  • if p is below depth ½ lg u, recurse in the appropriate bottom structure. (We have no work to do on this level.)
  • look in H for the node above the query at depth ½ lg u. If not found, recurse in the top structure. If found, let p be the bottom node of the path crossing depth ½ lg u which we just found in the hash table. Recurse to the appropriate bottom structure.
The main point is that a path is only relevant once, at the highest level of the recursion where the path crosses the middle line. At lower levels the path cannot be queried, since if you're on the path you already have a pointer to the node at the bottom of the path!

Tuesday, September 7, 2010

IOI Wrap-up

In the past 2 years, I have been a member of the Host Scientific Committee (HSC) of the IOI. This is the body that comes up with problems and test data. While it consists primarily of people from the host country (Bulgaria in 2009, Canada in 2010), typically the host will have a call-for-problems and invite the authors of problems they intend to use.

This year, I was elected member of the International Scientific Committee (ISC). This committee works together with the HSC on the scientific aspects, the hope being that a perenial body will maintain similar standards of quality from one year to another. There are 3 elected members in the ISC, each serving 3-year terms (one position is open each year).

I anticipate this will be a lot of fun, and you will probably hear more about the IOI during this time. When a call for problems comes up (will be advertised here), do consider submitting!

I will end with an unusual problem from this IOI:
Consider the largest 50 languages on Wikipedia. We picked 200 random articles in each language, and extracted an excerpt of 100 consecutive characters from each. You will receive these 10000 texts one at a time in random order, and for each you have to guess its language. After each guess, your algorithm learns the correct answer. The score is the percentage of correct guesses.

To discourage students from coding a lot of special rules, a random permutation is applied to the Unicode alphabet, and the language IDs are random values. So, essentially, you start with zero knowledge.
Considering the tiny amount of training data and the real-time nature of guessing, one might not expect too good solutions. However, it turns out that one can get around 90% accuracy with relatively simple ideas.

My own approach was to define Score(text, language) as the minimal number of substrings seen previously in this language that compose the text. This can be computed efficiently by maintaining a suffix tree for each language, and using it to answer longest common prefix queries.