WebDiarios de Motocicleta

Wednesday, June 2, 2010

Representing a vector

This is the second post about "Changing Base without Losing Space." The main result of the paper is, if I am allowed to say it, rather amazing:

On a binary computer, you can represent a vector of N decimal digits using ⌈N·log₂10⌉ bits, and support reading/writing any digit in constant time.

Of course, this works for any alphabet Σ; decimal digits are just an example.

Unlike the prefix-free coding from last time, this may never find a practical application. However, one can hardly imagine a more fundamental problem than representing a vector on a binary computer --- this statement that could have been discovered in 1950 or in 2050 :) Somebody characterized our paper as "required reading for computer scientists." (In fact, if you teach Math-for-CS courses, I would certainly encourage you to teach this kind of result for its entertainment value.)

To understand the solution, let's think back of the table for prefix-free coding from last time:

Here is the reasoning process behind the first few operations:

I have two symbols from alphabet B+1. I need an output alphabet of B, so let's split them into a letter from B, and whatever is left (in particular, a letter from B+3).
I have a letter from B+3. Can I combine it with another letter to make something close to a power of 2? Yes, I should use a letter from alphabet B-3, since (B+3)(B-3) is close to B².
How can I get a letter from B-3? Take the next two input symbols, and split them into a letter from B-3, and whatever is left (in particular, a letter from B+6).

To generalize, let's assume we want to code something from an alphabet X (not a power of two). We can store it in binary without loss of space, if we have another, unrelated, symbol to code (an information carrier). In a picture:

If I want to output M bits (M≫lg X), I have to combine the symbol from X with a symbol from Y=⌊2^M/X⌋. The loss of entropy will be lg(Y+1) - lg Y = O(1/Y), since the floor function could convert Y+0.9999 into Y.

Now I have to get a symbol from Y. This is possible if my information carrier came from some alphabet T≫Y. Then, I can break it into a symbol from Y, and one from Z=⌈T/Y⌉. Again, the entropy loss is lg Z - lg(Z-1)=O(1/Z), since the ceiling can convert Z-0.9999 into Z.

To balance the losses, set Y≈Z≈√T. That is, by having a large enough information carrier, I can make the loss negligible. In particular, if I apply the information carrier N times, I could set T≈N², meaning that I only lose O(1/N) bits per application, and only a fraction of a bit overall! (This fraction of a bit will become one entire bit at the end, since I need to write the last symbol in binary.)

Thus, in principle, I could encode an array of digits by grouping them into blocks of O(lg N) digits (making T=the alphabet of a block be large enough). Then, I can iteratively use one block as the information carrier for the leftover of the previous block (the Z value from the previous block). The crucial observation is that, to decode block i, we only need to look at memory locations i (giving the Y component) and i+1 (giving the Z component). Thus, we have constant time access!

The one questionable feature of this scheme is that it requires O(N) precomputed constants, which is cheating. Indeed, the alphabets Y and Z change chaotically from one iteration to the next (the next Y is dictated by the previous Z, "filling it" to a power of two). There seems to be no pattern to these values, so I actually need to store them.

One can get away with just O(lg N) constants by applying the information carrier idea in a tree fashion. The alphabets will vary from level to level, but are identical on one level by symmetry. See the paper for complete details.

I will talk about my second STOC paper ("Towards Polynomial Lower Bounds for Dynamic Problems") after the conference.

Prefix-Free Codes

I am told the least this blog could do is to talk about my own results :) So here goes.

Next week at STOC, I am presenting "Changing Base without Losing Space," a joint paper with Mikkel Thorup and Yevgeniy Dodis. The paper and slides can be found here. The paper contains two results achieved by the same technique. Today, I will talk about the simpler one: online prefix-free codes.

The problem is to encode a vector of bits of variable length in a prefix-free way; that is, the decoder should be able to tell when the code ends. (Note on terminology: In information theory, this is called "universal coding"; prefix-free is about coding letters from a fixed alphabet, e.g. the Hamming code is prefix-free.)

Let N be the (variable) length of the bit vector. Here are some classic solutions (known as Elias codes):

A code of 2N bits: after each data bit, append one bit that is 0 for end-of-file (EOF) or 1 if more data is coming;
A code of N+2lg N bits: at the beginning of the message, write N by code 1; then write the bit vector.
A code of N+lg N+2lglg N bits: at the beginning, write N by code 2; then write the bit vector.
Recursing, one obtains the optimal size of N+lg N+lglg N+...+O(lg*N)

Prefix-freeness turns out to be very useful in (practical) cryptography. For instance, if I want a signature of a file, I could use some standard cypher that works on small blocks (say, AES on 128 bits), and chain it:

However, this is only secure if the input is prefix-free, or otherwise we are vulnerable to extension attacks:

This creates the need for online prefix-free codes: I want to encode a stream of data (in real time, with little buffering), whose length is unknown in advance. In this setting, the simplest solution using 2N bits still works, but the others don't, since they need to write N at the beginning. In fact, one can "rebalance" the 2N solution into an online code of size N+O(√N): append a bit after each block of size √N, wasting a partially-filled block at the end. Many people (ourselves included) believed this to be optimal for quite some time...

However, our paper presents an online code with ideal parameters: the size is N+lg N+O(lglg N), the memory is only O(lg N), and the encoding is real time (constant time per symbol). Since the solution is simple and practical, there is even reason to hope that it will become canonical in future standards!

So, how do we do it? I will describe the simplest version, which assumes the input comes in blocks of b bits and that b≫2lg N (quite reasonable for b=128 as in AES). Each block is a symbol from an alphabet of size B=2^b. We can augment this alphabet with an EOF symbol; in principle, this should not cost much, since lg(B+1)≈lg B for large B. More precisely, N symbols from an alphabet of B+1 have entropy N·lg(B+1) = N·b+O(N/B) bits, so there's negligible loss if B≥N.

The problem, though, is to "change base without losing space": how can we change from base B+1 (not a power of two) into bits in real time? A picture is worth 1000 words:

We can think of two continuously running processes that regroup two symbols into two symbols of different alphabets:

Split: Two input symbols in alphabet B+1 are changed into two symbols in alphabets B-3i and B+3(i+1), for i=0,1,2,... This works as long as (B-3i)(B+3i+3) ≥ (B+1)², which is always the case for n²≤ B/4 (hence the assumption b≫2lg N).
Merge: Two input symbols in alphabet B-3i and B+3i are regrouped into two symbols in alphabet B, which can be written out in binary (b bits each). This is always possible, since (B-3i)(B+3i) ≤ B²

Encoding and decoding are fast and simple: they amount to a few multiplications and divisions for each block. And that's it!

Tuesday, June 1, 2010

MADALGO summer school

MADALGO is organizing a Summer School on Geometric Data Structures on Aug 16-19 in Aarhus, Denmark. Registration and local food are free (with a capacity limit), but you have to get there on your own dime.

The speakers are Timothy Chan, Sariel Har-Peled, John Iacono, and yours truly. We will strive to make this a very instructive event, and I would encourage you to attend.

Avner Magen

On May 29, Avner died in an avalanche while climbing in Alaska. A memorial blog has been set up here.

Like research, climbing is about fighting the soul-less nature. A mountain peak or a proof take no pity on our attempts --- they are as hard to reach as they have been since time immemorial. When we succeed to reach the top, our flesh and soul are the mark of our superiority, for we can feel a joy that nature does not share. But when we fail, our flesh and soul are our sorrow.

Rest in peace, Avner. Researchers and climbers have enjoyed your company.

Friday, February 12, 2010

The 6 Months News Cycle

The STOC accepted papers, with some pointers to online versions, can be found here. I had two papers: Towards Polynomial Lower Bounds for Dynamic Problems and Changing Base without Losing Space. More on them later.

The FOCS call for papers is here. An experimental feature is the lack of page limits. While nominally the reviewers are only obliged to look at the first 10 pages, we all know such formalism is irrelevant. Reviewers look at anywhere between 0.5 and 50 pages, until they form some opinion about the paper. As an author, I'm happy to stop playing the childish \baselinestretch and \appendix games, which were not helping anyone.

On the other hand, I vividly remember a discussion during my tenure on the FOCS PC when people made a strong case for a strict 10-page submission limit (no appendix). Many authors don't give a damn about teaching the reader anything, and seem driven by the desire to make their paper look as long and technical as possible. The reasoning was that a strict 10-page limit might actually force them to explain some ideas. Instead of a strict page limit, I would actually advocate using reviewers that are not easily intimidated, and stop accepting papers just because "it looks hard and these people seem to have put a lot of effort into it." In practice, of course, this ideal is hard to implement.

Forward by one conference, and I am on the SODA PC. This seems to be a standard punishment for having 4 papers in the conference. Speaking of which, I have recently uploaded slides for all these papers.

I am travelling to Europe in the coming weeks, so my series on hashing will continue at a slower pace.

Tuesday, February 2, 2010

Cuckoo Hashing

Today, we will prove bounds for the basic cuckoo hashing. We are going to place our n keys into two arrays A[1..b] and B[1..b], where b=2n. For this, we will use two hash functions, h and g. When some element x arrives, we first try to put it in A[h(x)]. If that location already contains some element y, try to move y to B[g(y)]. If that location already contains some z, try to move z to A[h(z)], and so on until you find a free spot.

The proper image to have in mind is that of a random bipartite graph. The graph will have b nodes on each side, corresponding to the locations of A and B. In this view, a key x is an edge from the left vertex h(x) to the right vertex g(x).

Simple paths. As a warm-up, let's deal with the case of simple paths: upon inserting x, the update path finds an empty spot without intersecting itself. It turns out that the update time of cuckoo hashing behaves like a geometric random variable:

The probability that insert(x) traverses a simple path of length k is 2^-Ω(k).

I will prove this by a cute encoding analysis (you know I like encoding proofs). Let's say you want to encode the two hash codes for each of the n keys. As the hash functions h and g are truly random, this requires H=2nlg b bits on average (the entropy). But what if, whenever some event E happened, I could encode the hash codes using H-Δ bits? This would prove that Pr[E]=O(2^-Δ): there are only O(2^H-Δ) bad outcomes that lead to event E, out of 2^H possible ones. Thus, the task of proving a probability upper bound becomes the task of designing an algorithm.

In our case, E={insert(x) traverses a simple path of length k} and we will achieve a saving of Δ=Ω(k). Here is what we put in the encoding:

one bit, saying whether the path grows from A[h(x)] or B[g(x)];
the value k, taking O(lg k) bits;
all edges of the path, in order, taking (k-1)lg n bits.
all vertices of the path, in order, taking (k+1)lg b bits.
the hash codes for all keys not on the path, in order, taking (n-k)·2lg b bits.

Observe that the hash codes of the k keys on the path are specified using essentially half the information, since a hash code is shared between two edges. Since lg n = lg(b/2) = lg b - 1, the encoding saves Δ=k-O(lg k) bits compared to H.

The intuition for why a k-path occurs with probability 2^-Ω(k) is simple. Say I've reached edge y and I'm on the right side. Then, the probability that B[g(y)] is collision free is at least 1/2, since there are only n keys mapped to a space of 2n. In other words, at each point the path stops with probability half. This is exactly what the encoding is saying: we can save one bit per edge, since it takes lgn to encode an edge, but lg(2n) to encode an endpoint.

One cycle. Let us now deal with the case that the connected component of x contains one cycle. It is tempting to say that cuckoo hashing fails in this case, but it does not. Here is what might happen to the update path in case a cycle is part of the component (see figure):

the path luckily avoids the cycle and finds a free location without intersection itself. Cool.
the path reaches B[g(x)], which is occupied by some key y. Note that this has closed the cycle through the x edge, but the x edge is not actually traversed. Following y to A[h(y)] must eventually reach a free spot (no more cycles).
the path intersects itself. Then, it will start backtracking, flipping elements back to their position before Insert(x). Eventually, it reaches A[h(x)], where the algorithm had originally placed x. Following the normal cuckoo rules, x is moved to B[g(x)] and the exploration from there on must find an empty spot.

As before, we want to show that an update takes time k with probability 2^-Ω(k). In cases 1 and 2, we can simply apply the encoding from before, since the path is simple. In case 3, let ℓ be the number of edges until the path meets itself. The number of edges after B[g(x)] is at least k-2ℓ. Thus, we have a simple path from x of length max{ℓ, k-2ℓ} = Ω(k), so the old argument applies.

Two cycles. We now arrive at the cases when cuckoo hashing really fails: the bipartite graph contains as a subgraph (1) a cycles with a chord; or (2) two cycles connected by a path (possibly a trivial path, i.e. the cycles simply share a vertex).

From the figure we see that, by removing two edges, we can always turn the bad subgraph into two paths starting at x. We first encode those two paths as above, saving Ω(k), where k=size of the subgraph. Now we can add to the encoding the two infringing edges. For each, we can specify its identity with lg n bits, and its two end points with O(lg k) bits (a lower order loss compared to
the Ω(k) saving). In return, we know their two hash codes, which are worth 4lg b bits. Thus, the overall saving is at least 2lg n bits.

We have shown that an insertion fails with probability O(1/n²). By a union bound, cuckoo hashing will handle any fixed set of n elements with probability 1-O(1/n).

This bound is actually tight. Indeed, if three keys x,y,z have h(x)=h(y)=h(z) and g(x)=g(y)=g(z), then cuckoo hashing fails (this is the simplest obstruction subgraph). But such a bad event happens with probability (n choose 3)·b² / b⁶ = Θ(1/n).

Better Guarantees for Chaining and Linear Probing

Last time we showed that chaining and linear probing enjoy O(1) expected running times per operation. However, this guarantee is fairly weak, and certainly does not explain the popularity of the schemes. If you had a computation that was expected to run for 4 hours, would you be happy to hear, say, that it might take 10 hours with 10% probability?

Worst case w.h.p. So, what bound can be put on the running time that hold with high probability? For chaining, simply apply the Chernoff bound. Given that a bin contains μ=O(1) elements in expectation, the probability that it contain Z elements is at most e^Z-μ / (μ/Z)^Z = (O(1)/Z)^Z. To make this be n^-c, set Z=Θ(lg n/lglg n).

Observe that this bound is tight. There are (n choose Z) ≈ n^Z/Z! combinations of keys that can land in a designated bin, and some Z keys land there with probability b^-Z=(2n)^-Z. Thus, in expectation, at least one bin will see Z=Ω(lg n/lglg n) elements.

For linear probing, we can also reduce this to a balls-in-bins analysis, by defining a bin as L=O(lg n) consecutive locations in the array. In the running example b=2n, such a bin is full when Z=2μ=L. By Chernoff, this happens with probability e^Z-μ / (μ/Z)^Z = (e/4)^L/2. Thus, the maximal run in O(lg n) w.h.p. Again, this is tight by a direct calculation.

Amortized bounds. Unfortunately, these worst-case¹ bounds are not too satisfactory either, since O(lg n) is trivial to get by binary trees. If you grant me some lenience for my modeling, I will prove an O(1) amortized bound w.h.p., which I believe explains the power of the algorithms much better.

¹ There is an unfortunate tendency of some TCS papers to use "worst case" when they really mean deterministic. I am much happier with the convention that worst case is opposite of amortized, at least when your paper has any connection to data structures.

Formally, I will prove the following, rather trivial statement:

Let T≥clg n, for a large enough constant c. Any T operations in chaining hashtables only touch O(T) memory w.h.p.

Fix the hash codes of the T elements, and define our "bin" to consist of these ≤T distinct hash codes. All other elements are still totally random, and we expect μ≤Tn/b=Θ(lg n) to fall into the bin. If μ≤Z/(2e), the Chernoff bound is e^Z-μ/(μ/Z)^Z ≤ 2^-μ = high probability.

But does this actually mean that chaining has O(1) amortized running time? Formally, no: if I repeat a single query T times, the running time will be T times the running time of that query, i.e. a geometric random variable with no good concentration. Here is where I invoke a bit of lenience in my modeling: in practice, it is ridiculous to worry about repeating one of the last O(lg n) operations! The memory used by these recent updates will be fresh in the first level of cache, making a repetition cost essentially zero. (One may formally say that chaining has amortized O(1) running time w.h.p. in the external memory model with a cache of size Ω(lg n).)

A pretty way to understand the amortized bound is as a "buffer size" guarantee. The most demanding applications of hashtables are in analyzing a continuous stream of data, when operations need to be super-fast to keep up with the line speed. In such application, if the design is at all sensible, there will be a buffer between the network interface and our CPU. The goal is not necessarily to take O(1) time for every single operation, but to keep the buffer small. Our proof says that the buffer will not grow to more than T=O(lg n) w.h.p., if you can afford the average time/operation.

For linear probing, we can instead show:

Let T=Ω(lg³n). Any T operations in chaining hashtables only touch O(T) memory w.h.p.

Remember from last time that we analyzed linear probing by building a binary tree over the array, and bounding the number of nodes that become dangerous (two-thirds full).

Let μ be the number of keys we expect under some node. First of all, if μ≫lg n, we do not need to worry about the node: it doesn't become dangerous w.h.p. Otherwise, we showed that the node becomes dangerous with probability O(1/μ²); if it does, we will pay a cost of μ.

Looking at T elements, I am dealing with ≤T nodes on each level, and I expect O(T/μ²) nodes to be dangerous. As long as T/μ² ≥ clg n, Chernoff tells me that only O(T/μ²) nodes are dangerous w.h.p. Since we only deal with μ=O(lg n), I needed to set T=Ω(lg³n). With this bound, the number of memory locations accessed by the T operations is Σ_μ=2^i O(T/μ²)·μ = O(T) w.h.p.

I end with a question to my knowledgeable readers. By being more careful, I can prove T=Ω(lg^1+εn) suffices for linear probing. Is it possible to prove T=Ω(lg n), as in chaining? (Perhaps I'm missing something obvious.)