Sunday, September 30, 2007

China (III)

I do not write about politics, since I could never be as good as some other people who do. A link to fellow theorist Bernard Chazelle, and the captivating essays on his webpage, is in order. (I hesitated posting a link not because I disagree with some of his arguments, which indeed I do, but because you will probably get distracted and not read the rest of my post... But render unto Caesar the things which are Caesar’s, I already said there are many people who write better than me.)

So, I did go to China, and I am back, and now I feel compelled to write a few words.

My friends in Romania ask me how the Chinese can stand their government, and why they are not fighting. Of course, the obvious guess on everybody's lips is panem et circenses -- the Chinese economy has been making strides forward, which allowed the government to keep citizens happy.


But another reason that cannot be ignored lies in the remarkable efficiency of Chinese propaganda among its citizens. A prime accomplishment of the communist machine is the fact that political discussion is so touchy among Chinese inteligentia in the US. The party has managed to convince even truly brilliant people that the state is the party and the party is the state. Then, if somebody challenges Chinese communism, they are attacking the Chinese patriotic sentiment. Love of the Vaterland becomes a shield of the regime.

Indeed, they are careful and proactive about propaganda. In the new age of communication, they have set up the world's best known firewall to keep the citizens away from subversive information. A non-technical person will not be able to get to Wikipedia, or to blogspot, or to any "unofficial" description of Tiananmen 1989, and will not be able to use encryption (ssh).

In fact, brainwashing and censorship were such a durable and effective efforts, that it became hard to get censors! A censor recently let a reference to Tiannmen be published because she thought June 4, 1989 was a mining disaster [Reuters].

Tiananmen has become the place where "On this spot in 1989, nothing happened".


Yet despite the brainwashing, there are those who are fighting (and currently losing). Taiwan's situation is constantly becoming worse, since every dollar that the booming Chinese economy earns is another dollar behind the economic blackmail in the One China policy. With Burkina Faso as the largest country to officially recognize Taiwan, their backing sadly reminds us of the grand Coalition of the Willing.

Inside the People's Republic, a cultural fight is led. In good Soviet tradition, the Party's policy was: grab as much land and people as possible, and rebrand everything as Chinese. (Han Chinese, of course.)

Tibet and Xinjiang are just about as Chinese as Finland. But in the past half-century, the Party has managed to bring 7 million or so Han into Xinjiang, a land populated by Turkic people (Uyghur and Kazakhs). This raised the percentage of Han from zeroish to 40%, and will soon make Xinjiang "Chinese", a feat that would have made the Father of Nations, Joseph Stalin, exceedingly proud.

As for Tibet, it may not even be needed to resort to ethnic invasion. Here is the simple story. The Dalai Lama and the Panchen Lama assure a continuity in Tibetan spiritual leadership. When one dies, the other searches for a reincarnation, and "takes care of business," while the young boy found to be the reincarnation grows up, and is educated to be a great Lama. When Panchen Lama died, the Dalai Lama (from exile in India) selected a new Panchen Lama in communication with a search commitee wandering across Tibet.

In 1995, the young boy became a political prisoner at the unlikely age of 6, and he has not been seen since. In its infinite wisdom and understanding of Tibetan culture, the Party chose the next Panchen Lama, and took the boy to Beijing. From there, he has been making periodic statements about the great freedom of religion enjoyed by the Chinese people living in Tibet.

Now, when the Dalai Lama dies, can anyone guess on what criteria the Panchen Lama (Beijing version) will select a reincarnation?

In the mean time, any attempt to cross into India and communicate with the Dalai Lama is a treason to the Chinese state. Here is a ghastly piece of footage caught by a Romanian cameraman while climbing in the Himalayas (which pedictably won some accolades at the Emmy Awards this year).

It shows a Chinese border patrol executing Tibetans who are trying to cross the mountain. Via the New York Times, we learn that China acknowledged the incident, and justified it as self-defense after the monks tried to attack the soldiers.



Thursday, September 20, 2007

Slepian, Wolf and relatives (III)

To wrap up this thread of posts, let me discuss some motivation and directions for asymmetric communication channels, described in post II.

In my talk in China, I mentioned the following naive motivation. We have two sensors in a river. Given the average velocity of the flow, we can deduce that the readings of the sensor upstream at some time t, will be correlated with the readings of the sensor downstream at some time tt. The correlation will not be perfect (otherwise we wouldn't use two sensors), but it is enough to compress the message. However, the sensor downstream doesn't know the distribution of his own input, since he doesn't know the other sensor's reading at an earlier time.

This example is simple and intuitive enough that it should work as PR. Here are the more serious motivations that get thrown around:
  1. sensor networks (in some abstract way)
  2. internet connections (most ISPs give you faster download than upload; downloading from a satellite is significantly easier than uploading)
  3. stenography and combating consorship. This comes from the paper: [Feamster, Balazinska, Harfst, Balakrishnan, Karger: Infranet: Circumventing Web Censorship and Surveillance. USENIX Security Symposium 2002]
About 1., an Anonymous commenter on my earlier post believes that the academic work on sensor networks is mostly PR. Piotr explains that the problem is that you want to do some source processing, as opposed to just sending some large sample, and it's hard to understand any distribution after source processing. This is a very valid concern for both the Slepian-Wolf model, and the asymmetric communication model.

Moving on to 2., I have not seen examples where this approach might be useful at the scale of realistic ISP links, nor examples of how you might get the server to know the distribution better than you.

About 3., I have a feeling that it is simply accidental cobranding. To get your paper accepted to a practical conference, it helps to use some fancy theoretical work somewhere (maybe a place where you don't even need it), and theoretical work can claim practical importance if somebody uses it in a supposedly practical conference. I am reminded of a funny quote: "I would gladly work on the practical side of academia, but so little of practice actually applies to practice..."

Thus, unfortunately, I do not find the motivations I heard so compelling (outside of theoretical beauty with which I will fully agree).

What is probably needed is more attention towards the EE end of the discipline. Is anybody aware of any EE-work on this problem? In our abstract setting (Bob knows a distribution), any algorithm is ultimately impractical, because nobody works with perfect knowledge of a distribution (a distribution is a big object!). What we need is better modelling of what "understanding a distribution" means, and algorithms based on that. To give the prototypical sily example, the distribution may be (roughly) normal, so Bob only needs to know the mean and standard deviation.

On the lower bound side, I have some rough ideas to prove that the lower bound holds even if the distribution is the trajectory of a Markov chain of polynomal size (as opposed to the distribution in our paper, which really took exponentially many bits to describe).

Wednesday, September 19, 2007

Slepian, Wolf and relatives (II)

I didn't really get the answers I was hoping for in my original post on Slepian-Wolf coding, but let's move to the second part.

This post is about "asymmetric communication channels". Assume Alice receives an input x in {0,1}n, and she wants to communicate it to Bob. If x comes from some distribution D of less with entropy less than n, Alice and Bob should agree to communicate using a Huffman code and send ≤ H(D)+1 bits.

But what if D is only known to Bob, and not to Alice? Can Bob send something intelligent, which is still much less than the 2n description of the distribution, but which allows Alice to communicate efficiently? If you want to hear motivations and applications of this setup, I will discuss them in my next post. Personally, I find this "can you communicate a Huffman code" question sufficiently natural and intriguing.

It is not hard to show that the total communication needs to be at least n. Thus, a nontrivial protocol is only possible in an asymmetric setting, where Bob is allowed to communicate more than Alice.

Upper bounds. The problem was first studied by Adler and Maggs [FOCS'98], who gave a protocol in which Bob sends expected O(n) bits, and Alice sends expected O(H) bits. This is a surprising result: Bob is sending some sketch of the distribution that is logarithmic in its description, but with this sketch Alice can compress her input down to entropy.

There is one catch, though: the protocol uses O(1) rounds in expectation. It does not provide a guarantee for any fixed number of rounds.

With this catch, you can start to get a glimpse of how the protocol might run. Up to constant factors, a distribution with entropy H looks like a distribution putting probability 2-(H+1) on some 2H samples (mass 1/2), then probability 2-(H+2) on other 2H samples (total mass 1/4), probability 2-(H+3) on other 2H samples (mass 1/8) and so on. Bob is going to send a hash function {0,1}n -> {0,1}H which is injective on the 2H samples on the first level, and Alice returns the hash code. Bob sends the sample that maps to that hash code, and if Alice says he was wrong, they continue to the second level (which will only happen with probability 1/2). This continues, reaching level i (and round i) with probability 2-O(i).

There has been a lot of follow-up work, which tries to reduce the constants in the communication. Some papers have Alice communicating only H+2 bits, for instance. However, I do not like these results because they use roughly H rounds in expectation. In real life, a round of communication is much much more expensive than sending a few extra bits.

In SODA'05, Adler extended the protocol to the case where there are k Alice's, which have samples x1, ..., xk to communicate. These samples come from a joint distribution, known only to Bob.

Lower bounds. Micah Adler came to a problem session organized by Erik, and told us about the problem, highlighting the weird behavior that the number of rounds is geometric. Personally, I was fascinated by the question of whether this was tight.

Unfortunately, the proof eventually required not only some nice ideas, but also a lot of work in balancing everything just right. I spent a few months paying long daily visits to Nick's office, before we finally managed to make it go through. In the slides for my talk in China, I associated this problem with a bas-relief depicting galley slaves.

The proof uses round elimination and message switching (which I will talk about in future posts), but in an unusual context: you need to keep recursing in these smaller and smaller probability spaces.

Eventually, we showed that i rounds are needed with probability 2-O(i lg i), which is fairly close to optimal, and demonstrates that any fixed number of rounds is not enough. In the good tradition of sharp phase transitions, the lower bound holds even if Bob can communicate an outrageous amount: 2^{n1-ε}.

Our paper appeared in SODA'06. The paper and slides have fairly good intuition for how the bound works.

Job Market

Since I have announced that I am graduating this year and looking for a job, people have been asking me who else is on the market. I am in China now, with all the brilliant graduate students, so I took the opportunity to ask who will be on the job market. Here is the list of people who said they will be applying for faculty jobs in the US in 2007. This list does not include people who announced they are only looking for postdocs, or who want jobs outside US.

Alphabetically:
If you know other people not on the China list (eg people who are now postdocs), feel free to comment.

I have heard (rumours) about the following people:

Monday, September 17, 2007

China (II)

As announced, I am attending China Theory Week in Beijing.

Sadly, this puts me behind the Great Firewall of China, which hates blogspot. I will wait until I leave China to express my opinion on this :) Lots and lots of thanks to Erik and Elad Verbin (currently postdoc here), for teaching me how to get around the firewall.

My talk here was: "Round Elimination — A Proof, A Concept, A Direction."
Round elimination, which can be traced back to [Ajtai 1989], has proven to be one of the most useful tools in communication complexity, playing a role similar to PCPs in approximation algorithms. We describe the history of its many variants, including the simplest known proof, and the most illustrative applications. We also describe possible future work, including conjectures that would bring significant progress in applications.

Based on literature survey, and joint publications with Micah Adler, Amit Chakrabarti, Erik Demaine, Sudipto Guha, TS Jayram, Nick Harvey, and Mikkel Thorup.
Slides are available in pptx or exported as ppt. The PDF didn't look decent enough to be posted; sorry.

If you know my usual style of giving talks, you know the slides are somewhat unorthodox. Since I will hopefully give a nonzero number of job talks in the spring, I am especially curious what readers think of the advice on slide 2 :)
Learn from Holywood: sex and violence are the key to a successful job talk.

Friday, September 14, 2007

Slepian, Wolf and relatives (I)

The [Splepian-Wolf'73] coding scheme is a famous result in coding theory, that inspired the field of distributed source coding. I guess this topic would be better for Michael's blog, but I am curious to ask some questions, and I want to describe a slightly related beast in the next post.

Review. You have k clients, and a server. The clients have some inputs x1, ..., xk, each of n bits, and they want to communicate all inputs to the server. Now assume these inputs come from some joint distribution with entropy much less than nk. Can one use similarly little communication?

If the distribution is rectangular, i.e. the xi's are independent, each client i can perform source coding (say, Huffman coding) based on his own distribution, achieving H(xi) communication. With dependencies, if clients broadcast their messages in order, each client can perform source coding based on his distribution conditioned on the previous messages. This achieves optimal total communication, since Sumi H(xi | x1, ..., xi-1) = H(x1, ..., xk).

Slepian and Wolf showed that it is possible to achieve optimal communication even if all clients communicate one message simmultaneously, and the messages are only seen by the server.

In fact, there is a whole simplex of possible scenarios for optimal communication. Consider k=2 to make notation easy. Then, it is possible for client 1 to send m1 bits, and client to to send m2 bits as long as:
  • m1+m2 ≥ H(x1, x2)
  • m1 ≥ H(x1 | x2)
  • m2 ≥ H(x2 | x1)
Now, this is the perfect kind of result if you want to start a field: it shows that something is possible in principle, but through a nonconstructive probabilistic technique which isn't really feasible to use in practice.

Since then, people have developed efficently encodable and decodable distributed codes for all sorts of important special cases. A "special case" means what makes the inputs correlated (they are close in what metric?). To give realistic examples, the inputs may be close in Hamming distance, or in edit distance, or they may be numbers that are close in value (say, two noisy observations of the same event).

Questions. I never actually worked on this (but on something related, see next post), so I only have some meta questions. The field is usually sold as being important in sensor networks. Is this "for real", or just theoretical PR? People interested in sensor networks on our side of theory (eg the streaming folks) do not seem too aware or interested in this field. Why?

Athena and TCS

In the ancient days of radical democracy, Αθήνα was run by her people through direct involvement. The school of thought of this democracy held that the people governing themselves could not be wrong. The ἐκκλησία or Ήλιαία could not possibly judge wrongly, even when making impulsive decisions governed by mob mentality. Instead, if one of these institutions reached a decision shown later to be objectively wrong, it was because the people did not posses enough information at the time, or because they were misled.

In the scientific community (especially in theoretical branches), we are running our own experiment in radical democracy. We have essentially no outside constraints, no obligation to finish some research about some topic by some deadline. We are judging our own work, and allocating the resources among ourselves.

So if a PC rejects your paper, or NSF decides not to fund your grant proposal, you should be angry and make creative use of English adjectives over beer. But after this ritual dedicated to Πανάκεια, you should understand that your colleagues rejected your work not because they are ἰδιώτες (pun intended -- idiot originally meant people who were not involved in community decisions, preferring a private life), but because they didn't know better.

So give talks, lobby for your field over beer, write a 5-page introduction to every paper, or you know, maybe start a blog ;)

If the field really consists of narrow-minded people who permanently refuse to understand what you're talking about, you are in the wrong field.

3n+1

A vivid example that the halting problem is hard... :)

Is there some n for which the following algorithm does not halt?
int n;
scanf("%d", &n);
while(n > 1)
n = (n % 2) ? 3*n+1 : n/2;
Of course, this is the famous 3n+1 problem, or the Collatz problem. It dates back to 1937, and many people are betting that it's "the next Fermat", if it is lucky enough to stay unsolved for a few more centuries. Erdős said "Mathematics is not yet ready for such problems."

Like many people I know, I too spent some time in high school trying to tackle it. Other people, who presumably spent quite a bit of time on the problem after high school, showed various interesting things. For example, if a cycle exists, it has size at least 275000 [Lagarias 85].

While this may look like a CS problem, I suggest we do not claim it, and let the mathematicians worry about it for a while.

Monday, September 10, 2007

A word from our sponsors (I)

With the MIT tuition bill arriving, our IBMish paper being accepted to SODA, and Michael running a similar series, I figured it was high time to speak about money and sponsors.

I am pleased to announce that Erik and I sold our soul to Google this year, and in return won a Google Research Award. Our grant proposal was on "Fundamental Data Structures". If this didn't scare Google, they must be pretty open to foundational research, and any myth to the contrary seems outdated :)

Our proposal consisted of a list of some papers we wrote (together and separately), with a 3-line summary for each. We worked quite a bit on putting together the list, with the goal to extract an exclamation from any reader at one point or another. As usual, Erik is the perfect co-writer for any piece of text (it's a pity he won't do blogging...).

The same team (Erik as PI, and yours truly as "Senior Researcher") tried a similar stunt with NSF (different topic though), without success. The comments were a maximal set of strong and contradictory statements from a couple of panels.

My favorites (yes, I know you don't usually make this list public):
  • This is a very well-written proposal.
  • My tendency is to recommend not to fund this proposal in order that the PIs write it better.
  • Even the most optimistic progress will only [..] have little practical application.
  • This is a highly important research direction that [..] will allow to develop more practically feasible geometric algorithms.
  • What are the broader impacts of the proposed activity? None!
  • Faster algorithms [..] might be much faster in practice. In fact, I believe that such improvements might lead to more practical and realistic algorithms.
My general modus operandi is to challenge people, make them think again, and push them to reevaluate their holiest convictions. Polarizing the audience is a clear sign of success. Thus, this seems like a decently successful proposal, with the obvious drawback of not getting funded. I leave the last word to another reviewer comment, which summarizes our approach to writing this grant:
  • Very strong proposal - it's definitely not "more of the same" like many other proposals.

Sunday, September 9, 2007

SODA 2008 (updated)

The list of accepted papers has been posted.

If you get this weird feeling that something is missing, remember that he was on the PC.

On a rough count, there are about 13 data structure papers (cf. 16 last year), matching my prediction that the number of papers is not really affected by PC composition. From what I understand, most PCs are not actively trying to balance accept ratios across fields, so this phenomenon has a nonobvious explanation. (Maybe it is proof that the PC mechanism "works" in some abstract sense.)

I am also happy to report that our paper was accepted.

Saturday, September 8, 2007

Love thy predecessor (IV): Approx Nearest Neighbor

The message of this post is:
Predecessor search is equivalent to (1+ε)-approximate nearest neighbor in any constant dimension d.
Perhaps the most appropriate citation is Timothy Chan [SODA 02] "Closest-point problems simplified on the RAM" (but I think this post sheds more light into the matter).

This is a pretty cool proof, and seems rather underappreciated and under-taught. A nice bonus is that it is based on space-filling curves, which some people on the practical side of nearest neighbor love to talk about (though Piotr believes k-d trees are preferable).

[Incidentally, we'll see that the curve mapping is obtained by bitwise operations, again making the point that when people want to solve some problem better, they couldn't care less about our comparison model.]


Basic idea. Never mind the 1+ε part, let's go for O(1)-approximation. Say we have two numbers, a and b, and want to see if |a-b|<2k. Well, the obvious idea is to compute a-b :) The dumb and less obvious idea is to see if a and b are identical in the first w-k bits. Of course, this does not work (consider 111111 - 011111). But it almost works; e.g., the first w-k-1 bit are identical with constant probability if we add a random shift to the numbers.

In 2-d, the dumb idea starts to look better. We're going to define a mapping F(x,y) that takes the bits of x and y and interleaves them into an integer of 2w bits. Now, to test whether |X-x|<2k and |Y-y|<2k (with some constant approximation), you can perform just one test on F(X,Y) and F(x,y): test whether they start with the same ~2(w-k) bits.

Details. F is reducing a 2-d problem to a 1-d problem. We do not actually care about k: we can map the input set through F, and then search for the predecessor of F(query). Either the predecessor or successor has the longest prefix in common with the query, leading to a near neighbor.

Observe that we are actually heading towards a solution for the L nearest-neighbor problem, because the common prefix in F(.) outputs is sensitive to the largest distance on some coordinate. But that's Okay, since L and L2 are constant factors away for d=constant.

Now what approximation is possible in L? Remember we said we should perform a random translation. In d dimensions, we want all coordinates to have a long common prefix. A bad event with 1/d probability will happen, so on some coordinate i, xi and Xi may have only w-k-lg d initial bits in common. Thus, we should have an O(d) approximation.

Deterministic! To derandomize this and simplify the analysis, we use the pigeonhole principle. There are d dimensions, so d bad events. Then, if we're careful, we can find d+1 fixed translations, such that at least one of them preserves approximate nearest neighbors through the F map, with an approximation of exactly d+1.

For i=0,...,d, translation #i adds to every coordinate floor(i 2w/(d+1)). Simple enough.

To see why it works, first observe that the true nearest neighbor will not straddle a bad approximation barrier for at least one of these shifts (pigeonhole). Then, if it is beat by another point with that shift, that other point cannot be too bad, because the map cannot make points look closer (if there's a difference on one coordinate, it is preserved through interleaving). I won't work out the math here.

Space-filling curves. Alright, but where are these space filling curves, anyway? Well, that's just a nice way to view the algorithm. Note that F is a bijection between the interval [2dw] and the cube [2w]x...x[2w]. Thus, the graph of F-1 (the inverse disassociating the bits) is 1-d curve filling the space.

The algorithm looks like this. In turn, for each of d+1 space-filling curves (differing by translation) do:
  • map the input points to the linear order of the curve, and construct a predecessor structure.
  • given a query, consider its predecessor and successor on the curve as potential near neighbors.
Discussion: approximation. So far we've talked about f(d) approximation. In practice, it is claimed that this is enough. This is not because f(d)-approximation is acceptable, but because practical data sets don't have too many candidates masquerading as the nearest neighbor. If the nearest neighbor is at distance r, there are O(1) points at distance O(r). [[NB: this assumption is a special case of the "bounded doubling dimension" assumption. ]] In this case, we can check all "candidates", by considering more adjacent points on the space-filling curve, beyond the successor and predecessor.

In theory, there are a bunch of things you can do, increasing the space by
1/εO(d). Maybe the most obvious is gridding (storing a big lookup table with a grid around each point). Checking additional points left or right does not work per se (imagine the nearest neighbor at distance 1; you find an approximate neighbor at distance 2, and there are many many point very close to this neighbor). However, it does work if we walk the curve considering only points that are appropriately far from their predecessor. See Timothy's paper for details (though note: since we're only considering O(1) "levels" I don't think you need priority search trees...). The bottom line is that we need to consider about 1/εd additional points.

Discussion: equivalence. It is not hard to observe that predecessor search is a lower bound for (say) 2-approximate nearest neighbor, even in 1-d. We already know that predecessor search is equivalent to finding the longest common prefix. A 2-approximation to the nearest-neighbor essentially gives you somebody with the same longest common prefix (with some O(1) extra work).

Wednesday, September 5, 2007

Top 10 Theory Impact

Rasmus asks for a Top 10 list of theory results that have had a profound impact outside theory. Especially, he wants such a list with results from the past 25 years only.

That is something that we should all be armed with, yet I cannot find a good list. I was hoping to find something on TheoryMatters, but that project seems to have been hijacked by the lens-to-sciences view, and has very little concrete things about what we have already done. While that may be good for NSF and the US administration, we should also be prepared to show that Theory Matters for Computer Science.

(Incidentally, this reminds me of a great quote by Shang-Hua Teng: "When I came to the US, I learned about the notion of discrimination. But before I discovered discrimination against Blacks, Hispanics, etc, I discovered discrimination against theory in CS departments.")

So either:
  • post a link to some similar lists out there

  • or suggest your favorite entries for such a list. Ideally, the top 10 list would be compelling enough to be more like 10 lines than 10 pages with explanations, but maybe I'm asking for too much...

Tuesday, September 4, 2007

Chernoff about STOC/FOCS

There is a constant stream of attacks on the CS publication model -- especially by Math purists, I feel. I have certainly seen a lot of commentary after the recent Koblitz nonsense. Though I have many qualms with our publication model, I strongly believe Math is a bad example of an ideal. Here's my analysis of it.

In Math, it's mainly about the results. Math is old, it knows exactly what it's doing, it has these problems that it's been obsessing about for ages, and its painfully slow intake of any real-world influence makes it very stationary. Thus, what matters most is that they get some new results. The best strategy is to stress young people out, and make them come up with one new and hard contribution. Sure, many fail and go to Wall Street. But rationally for "the good of Mathematics", the results matter and people don't. (There is of course a lot of revering of old Mathematicians with some amazing result, which is again rational because that's exactly how you motivate young people to try to have a similar achievement).

In CS, I believe, it's about the people. We:
  • are a new field, and still sorting out fundamental aspects from nonsense.
  • need people with strong influence and personality to shape and guide a large community in good research directions.
  • are very closely related to one of the world's prime industries. Thus, the reality of this industry is our reality -- think of how theoretical research in parallel computers died with the temporary explosion in performance of single-processor machines, and how research in streaming flourished with dot-com businesses. We cannot and will not be a stationary field, and our in-take of real-world desiderata is high.
Thus, we are not in the business of squeezing one major publication out of PhD students and junior faculty. We are in the business of selecting the best people to stay in the field, and continue to do brilliant and relevant work, by a changing standard of relevance.

And how do you select the best people? Well, Chernoff bounds tell you that you should not rely on one trial. You should rely on several works (publications) to assess how fit the person is for this type of research. If we expect students to publish 3-5 STOC/FOCS papers by the time they graduate, and maybe 5 more as junior faculty, that's already enough to have reasonable confidence in this publication measurement. We then have a pretty good idea of how a person thinks, and what his skillset is.

Monday, September 3, 2007

Love thy predecessor (III): van Emde Boas

The data structure of van Emde Boas is by every definition a classic:
  • it is solving a fundamental problem optimally (see my paper with Mikkel)
  • it is very simple and elegant
  • it has 101 known variants
  • it has been extremely impactful in wildly different research directions (see cache oblivious layout, Tango trees).
  • it is important in practice (see my last post about IP lookup and routing).
  • and of course, it got Knuth excited ;)
    On March 22, 1977, as I was drafting Section 7.1 of The Art of Computer Programming, I read four papers by Peter van Emde Boas that turned out to be more appropriate for Chapter 8 than Chapter 7. I wrote a five-page memo entitled ``Notes on the van Emde Boas construction of priority deques: An instructive use of recursion,'' and sent it to Peter on March 29, with copies also to Bob Tarjan and John Hopcroft.
Next time you teach a general algorithms course, consider teaching it. Below I describe the simplest variant that I know. There are many more complicated ones (and the original description in van Emde Boas is particularly painful).

We want to search predecessors among a set S of n integers of w bits each. We will build a data structure that has O(n) space, and O(lg w) query time. (The "universe" of the input is U=2w, so the bound is often cited as O(lglg U), which seems more impressive and less practical due to the double log; since the structure is neither "impressive", nor impractical, I don't like this notation.)

Another thing that I don't like is that people take the word "integers" literally. It actually means "values stored with precision w". If you read about how floating point numbers are represented in computers (I think there's an IEEE standard), you realize that comparing two floating point numbers is exactly the same as comparing two integers with the same bit representation. So forget that your data was floating point, and simply pretend it was integers as far as predecessor is concerned. [[ Personal note: I made quite an impression among Romanian IOIers when I realized this in the 9th grade :) It allowed everybody to use radix sort for real numbers, and everybody knew radix sort is simpler and faster than comparison-sorting. ]]

Longest common prefix. Predecessor search is equivalent to finding the longest common prefix between the query q and a value is S (where numbers are seen as strings of bits). To realize that, think of the binary trie of depth w that represents our universe. The longest common prefix p is a node in this trie. If q continues with a one, the predecessor is the maximum value in S with prefix p, which can be precomputed and stored at every node. If q continues with a zero, the successor is the minimum value in S with prefix p, and once we have the successor, we can use a linked list to get the predecessor.

Here is my artful depiction of the first case (query continues to the right of the longest common prefix):

Now, we're looking for the (length of the) longest common prefix. What strategy do we know for finding a number? Binary search. So we're going to binary search for the length of the longest common prefix, i.e. binary search on the vertical of the trie.

To do that, we construct a dictionary (aka hash table) in which we store every prefix of every value in S. If we conjecture the longest common prefix has length l, we take the first l bits of q, and see if this prefix exists in the dictionary. If it does, the longest common prefix is at least l, otherwise it is at most l.

The space. What I described above takes space O(nw), because we store every prefix of every number. We can fix that with a simple, yet very effective idea: bucketing.

Group every w consecutive values into a bucket, and let S' be the set of the smallest value in each bucket, |S'| = n/w. We first build the above data structure to search for predecessors among S'. The space is O((n/w) w) = O(n). After finding the predecessor in S', we know the correct bucket where the answer is, and we can binary search inside that bucket in additional time O(lg w).