I do not write about politics, since I could never be as good as some other people who do. A link to fellow theorist Bernard Chazelle, and the captivating essays on his webpage, is in order. (I hesitated posting a link not because I disagree with some of his arguments, which indeed I do, but because you will probably get distracted and not read the rest of my post... But render unto Caesar the things which are Caesar’s, I already said there are many people who write better than me.)

So, I did go to China, and I am back, and now I feel compelled to write a few words.

My friends in Romania ask me how the Chinese can stand their government, and why they are not fighting. Of course, the obvious guess on everybody's lips is panem et circenses -- the Chinese economy has been making strides forward, which allowed the government to keep citizens happy.

But another reason that cannot be ignored lies in the remarkable efficiency of Chinese propaganda among its citizens. A prime accomplishment of the communist machine is the fact that political discussion is so touchy among Chinese inteligentia in the US. The party has managed to convince even truly brilliant people that the state is the party and the party is the state. Then, if somebody challenges Chinese communism, they are attacking the Chinese patriotic sentiment. Love of the Vaterland becomes a shield of the regime.

Indeed, they are careful and proactive about propaganda. In the new age of communication, they have set up the world's best known firewall to keep the citizens away from subversive information. A non-technical person will not be able to get to Wikipedia, or to blogspot, or to any "unofficial" description of Tiananmen 1989, and will not be able to use encryption (ssh).

In fact, brainwashing and censorship were such a durable and effective efforts, that it became hard to get censors! A censor recently let a reference to Tiannmen be published because she thought June 4, 1989 was a mining disaster ^{[Reuters]}.

Tiananmen has become the place where "On this spot in 1989, nothing happened".

Yet despite the brainwashing, there are those who are fighting (and currently losing). Taiwan's situation is constantly becoming worse, since every dollar that the booming Chinese economy earns is another dollar behind the economic blackmail in the One China policy. With Burkina Faso as the largest country to officially recognize Taiwan, their backing sadly reminds us of the grand Coalition of the Willing.

Inside the People's Republic, a cultural fight is led. In good Soviet tradition, the Party's policy was: grab as much land and people as possible, and rebrand everything as Chinese. (Han Chinese, of course.)

Tibet and Xinjiang are just about as Chinese as Finland. But in the past half-century, the Party has managed to bring 7 million or so Han into Xinjiang, a land populated by Turkic people (Uyghur and Kazakhs). This raised the percentage of Han from zeroish to 40%, and will soon make Xinjiang "Chinese", a feat that would have made the Father of Nations, Joseph Stalin, exceedingly proud.

As for Tibet, it may not even be needed to resort to ethnic invasion. Here is the simple story. The Dalai Lama and the Panchen Lama assure a continuity in Tibetan spiritual leadership. When one dies, the other searches for a reincarnation, and "takes care of business," while the young boy found to be the reincarnation grows up, and is educated to be a great Lama. When Panchen Lama died, the Dalai Lama (from exile in India) selected a new Panchen Lama in communication with a search commitee wandering across Tibet.

In 1995, the young boy became a political prisoner at the unlikely age of 6, and he has not been seen since. In its infinite wisdom and understanding of Tibetan culture, the Party chose the next Panchen Lama, and took the boy to Beijing. From there, he has been making periodic statements about the great freedom of religion enjoyed by the Chinese people living in Tibet.

Now, when the Dalai Lama dies, can anyone guess on what criteria the Panchen Lama (Beijing version) will select a reincarnation?

In the mean time, any attempt to cross into India and communicate with the Dalai Lama is a treason to the Chinese state. Here is a ghastly piece of footage caught by a Romanian cameraman while climbing in the Himalayas (which pedictably won some accolades at the Emmy Awards this year).

It shows a Chinese border patrol executing Tibetans who are trying to cross the mountain. Via the New York Times, we learn that China acknowledged the incident, and justified it as self-defense after the monks tried to attack the soldiers.

## Sunday, September 30, 2007

### China (III)

Posted by Mihai at 7:53 AM 8 comments

## Thursday, September 20, 2007

### Slepian, Wolf and relatives (III)

To wrap up this thread of posts, let me discuss some motivation and directions for asymmetric communication channels, described in post II.

In my talk in China, I mentioned the following naive motivation. We have two sensors in a river. Given the average velocity of the flow, we can deduce that the readings of the sensor upstream at some time t, will be correlated with the readings of the sensor downstream at some time t+Δt. The correlation will not be perfect (otherwise we wouldn't use two sensors), but it is enough to compress the message. However, the sensor downstream doesn't know the distribution of his own input, since he doesn't know the other sensor's reading at an earlier time.

This example is simple and intuitive enough that it should work as PR. Here are the more serious motivations that get thrown around:

- sensor networks (in some abstract way)
- internet connections (most ISPs give you faster download than upload; downloading from a satellite is significantly easier than uploading)
- stenography and combating consorship. This comes from the paper: [Feamster, Balazinska, Harfst, Balakrishnan, Karger: Infranet: Circumventing Web Censorship and Surveillance. USENIX Security Symposium 2002]

Moving on to 2., I have not seen examples where this approach might be useful at the scale of realistic ISP links, nor examples of how you might get the server to know the distribution better than you.

About 3., I have a feeling that it is simply accidental cobranding. To get your paper accepted to a practical conference, it helps to use some fancy theoretical work somewhere (maybe a place where you don't even need it), and theoretical work can claim practical importance if somebody uses it in a supposedly practical conference. I am reminded of a funny quote: "I would gladly work on the practical side of academia, but so little of practice actually applies to practice..."

Thus, unfortunately, I do not find the motivations I heard so compelling (outside of theoretical beauty with which I will fully agree).

What is probably needed is more attention towards the EE end of the discipline. Is anybody aware of any EE-work on this problem? In our abstract setting (Bob knows a distribution), any algorithm is ultimately impractical, because nobody works with perfect knowledge of a distribution (a distribution is a big object!). What we need is better modelling of what "understanding a distribution" means, and algorithms based on that. To give the prototypical sily example, the distribution may be (roughly) normal, so Bob only needs to know the mean and standard deviation.

On the lower bound side, I have some rough ideas to prove that the lower bound holds even if the distribution is the trajectory of a Markov chain of polynomal size (as opposed to the distribution in our paper, which really took exponentially many bits to describe).

Posted by Mihai at 12:29 PM 0 comments

## Wednesday, September 19, 2007

### Slepian, Wolf and relatives (II)

I didn't really get the answers I was hoping for in my original post on Slepian-Wolf coding, but let's move to the second part.

This post is about "asymmetric communication channels". Assume Alice receives an input x in {0,1}^{n}, and she wants to communicate it to Bob. If x comes from some distribution D of less with entropy less than n, Alice and Bob should agree to communicate using a Huffman code and send ≤ H(D)+1 bits.

But what if D is only known to Bob, and not to Alice? Can Bob send something intelligent, which is still much less than the 2^{n} description of the distribution, but which allows Alice to communicate efficiently? If you want to hear motivations and applications of this setup, I will discuss them in my next post. Personally, I find this "can you communicate a Huffman code" question sufficiently natural and intriguing.

It is not hard to show that the total communication needs to be at least n. Thus, a nontrivial protocol is only possible in an asymmetric setting, where Bob is allowed to communicate more than Alice.

Upper bounds. The problem was first studied by Adler and Maggs [FOCS'98], who gave a protocol in which Bob sends expected O(n) bits, and Alice sends expected O(H) bits. This is a surprising result: Bob is sending some sketch of the distribution that is logarithmic in its description, but with this sketch Alice can compress her input down to entropy.

There is one catch, though: the protocol uses O(1) rounds in expectation. It does not provide a guarantee for any fixed number of rounds.

With this catch, you can start to get a glimpse of how the protocol might run. Up to constant factors, a distribution with entropy H looks like a distribution putting probability 2^{-(H}^{+1)} on some 2^{H} samples (mass 1/2), then probability 2^{-(H}^{+2)} on other 2^{H} samples (total mass 1/4), probability 2^{-(H}^{+3)} on other 2^{H} samples (mass 1/8) and so on. Bob is going to send a hash function {0,1}^{n} -> {0,1}^{H} which is injective on the 2^{H} samples on the first level, and Alice returns the hash code. Bob sends the sample that maps to that hash code, and if Alice says he was wrong, they continue to the second level (which will only happen with probability 1/2). This continues, reaching level i (and round i) with probability 2^{-}^{O(}^{i}^{)}.

There has been a lot of follow-up work, which tries to reduce the constants in the communication. Some papers have Alice communicating only H+2 bits, for instance. However, I do not like these results because they use roughly H rounds in expectation. In real life, a round of communication is much much more expensive than sending a few extra bits.

In SODA'05, Adler extended the protocol to the case where there are k Alice's, which have samples x_{1}, ..., x_{k} to communicate. These samples come from a joint distribution, known only to Bob.

Lower bounds. Micah Adler came to a problem session organized by Erik, and told us about the problem, highlighting the weird behavior that the number of rounds is geometric. Personally, I was fascinated by the question of whether this was tight.

Unfortunately, the proof eventually required not only some nice ideas, but also a lot of work in balancing everything just right. I spent a few months paying long daily visits to Nick's office, before we finally managed to make it go through. In the slides for my talk in China, I associated this problem with a bas-relief depicting galley slaves.

The proof uses round elimination and message switching (which I will talk about in future posts), but in an unusual context: you need to keep recursing in these smaller and smaller probability spaces.

Eventually, we showed that i rounds are needed with probability 2^{-O(i lg i)}, which is fairly close to optimal, and demonstrates that any fixed number of rounds is not enough. In the good tradition of sharp phase transitions, the lower bound holds even if Bob can communicate an outrageous amount: 2^{n^{1-ε}}.

Our paper appeared in SODA'06. The paper and slides have fairly good intuition for how the bound works.

Posted by Mihai at 11:25 PM 0 comments

### Job Market

Since I have announced that I am graduating this year and looking for a job, people have been asking me who else is on the market. I am in China now, with all the brilliant graduate students, so I took the opportunity to ask who will be on the job market. Here is the list of people who said they will be applying for faculty jobs in the US in 2007. This list does not include people who announced they are only looking for postdocs, or who want jobs outside US.

Alphabetically:

- Nina Balcan
- Andrej Bogdanov
- Mark Braverman (maybe)
- Constantinos Daskalakis
- Nick Harvey
- Philipp Hertel (maybe)
- Adriana Karagiozova (maybe)
- Anup Rao (maybe)
- Paul Valiant
- Ryan Williams
- Amir Yehudayoff (maybe)

I have heard (rumours) about the following people:

Posted by Mihai at 2:01 AM 3 comments

## Monday, September 17, 2007

### China (II)

As announced, I am attending China Theory Week in Beijing.

Sadly, this puts me behind the Great Firewall of China, which hates blogspot. I will wait until I leave China to express my opinion on this :) Lots and lots of thanks to Erik and Elad Verbin (currently postdoc here), for teaching me how to get around the firewall.

My talk here was: "*Round Elimination — A Proof, A Concept, A Direction*."

Round elimination, which can be traced back to [Ajtai 1989], has proven to be one of the most useful tools in communication complexity, playing a role similar to PCPs in approximation algorithms. We describe the history of its many variants, including the simplest known proof, and the most illustrative applications. We also describe possible future work, including conjectures that would bring significant progress in applications.Slides are available in pptx or exported as ppt. The PDF didn't look decent enough to be posted; sorry.

Based on literature survey, and joint publications with Micah Adler, Amit Chakrabarti, Erik Demaine, Sudipto Guha, TS Jayram, Nick Harvey, and Mikkel Thorup.

If you know my usual style of giving talks, you know the slides are somewhat unorthodox. Since I will hopefully give a nonzero number of job talks in the spring, I am especially curious what readers think of the advice on slide 2 :)

Learn from Holywood: sex and violence are the key to a successful job talk.

Posted by Mihai at 9:27 PM 2 comments

## Friday, September 14, 2007

### Slepian, Wolf and relatives (I)

The [Splepian-Wolf'73] coding scheme is a famous result in coding theory, that inspired the field of distributed source coding. I guess this topic would be better for Michael's blog, but I am curious to ask some questions, and I want to describe a slightly related beast in the next post.

Review. You have k clients, and a server. The clients have some inputs x_{1}, ..., x_{k}, each of n bits, and they want to communicate all inputs to the server. Now assume these inputs come from some joint distribution with entropy much less than nk. Can one use similarly little communication?

If the distribution is rectangular, i.e. the x_{i}'s are independent, each client i can perform source coding (say, Huffman coding) based on his own distribution, achieving H(x_{i}) communication. With dependencies, if clients broadcast their messages in order, each client can perform source coding based on his distribution conditioned on the previous messages. This achieves optimal total communication, since Sum_{i} H(x_{i} | x_{1}, ..., x_{i-1}) = H(x_{1}, ..., x_{k}).

Slepian and Wolf showed that it is possible to achieve optimal communication even if all clients communicate one message simmultaneously, and the messages are only seen by the server.

In fact, there is a whole simplex of possible scenarios for optimal communication. Consider k=2 to make notation easy. Then, it is possible for client 1 to send m_{1} bits, and client to to send m_{2} bits as long as:

- m
_{1}+m_{2}≥ H(x_{1}, x_{2}) - m
_{1}≥ H(x_{1}| x_{2}) - m
_{2}≥ H(x_{2}| x_{1})

Since then, people have developed efficently encodable and decodable distributed codes for all sorts of important special cases. A "special case" means what makes the inputs correlated (they are close in what metric?). To give realistic examples, the inputs may be close in Hamming distance, or in edit distance, or they may be numbers that are close in value (say, two noisy observations of the same event).

Questions. I never actually worked on this (but on something related, see next post), so I only have some meta questions. The field is usually sold as being important in sensor networks. Is this "for real", or just theoretical PR? People interested in sensor networks on our side of theory (eg the streaming folks) do not seem too aware or interested in this field. Why?

Posted by Mihai at 3:36 PM 4 comments

### Athena and TCS

In the ancient days of radical democracy, Αθήνα was run by her people through direct involvement. The school of thought of this democracy held that the people governing themselves could not be wrong. The *ἐκκλησία *or Ήλιαία could not possibly judge wrongly, even when making impulsive decisions governed by mob mentality. Instead, if one of these institutions reached a decision shown later to be objectively wrong, it was because the people did not posses enough information at the time, or because they were misled.

In the scientific community (especially in theoretical branches), we are running our own experiment in radical democracy. We have essentially no outside constraints, no obligation to finish some research about some topic by some deadline. We are judging our own work, and allocating the resources among ourselves.

So if a PC rejects your paper, or NSF decides not to fund your grant proposal, you should be angry and make creative use of English adjectives over beer. But after this ritual dedicated to Πανάκεια, you should understand that your colleagues rejected your work not because they are ἰδιώτες (pun intended -- idiot originally meant people who were not involved in community decisions, preferring a private life), but because they didn't know better.

So give talks, lobby for your field over beer, write a 5-page introduction to every paper, or you know, maybe start a blog ;)

If the field really consists of narrow-minded people who permanently refuse to understand what you're talking about, you are in the wrong field.

Posted by Mihai at 3:52 AM 2 comments

### 3n+1

A vivid example that the halting problem is hard... :)

Is there some n for which the following algorithm does not halt?

int n;Of course, this is the famous 3n+1 problem, or the Collatz problem. It dates back to 1937, and many people are betting that it's "the next Fermat", if it is lucky enough to stay unsolved for a few more centuries. Erdős said "Mathematics is not yet ready for such problems."

scanf("%d", &n);

while(n > 1)

n = (n % 2) ? 3*n+1 : n/2;

Like many people I know, I too spent some time in high school trying to tackle it. Other people, who presumably spent quite a bit of time on the problem after high school, showed various interesting things. For example, if a cycle exists, it has size at least 275000 [Lagarias 85].

While this may look like a CS problem, I suggest we do not claim it, and let the mathematicians worry about it for a while.

Posted by Mihai at 3:04 AM 1 comments

## Monday, September 10, 2007

### A word from our sponsors (I)

With the MIT tuition bill arriving, our IBMish paper being accepted to SODA, and Michael running a similar series, I figured it was high time to speak about money and sponsors.

I am pleased to announce that Erik and I sold our soul to Google this year, and in return won a Google Research Award. Our grant proposal was on "Fundamental Data Structures". If this didn't scare Google, they must be pretty open to foundational research, and any myth to the contrary seems outdated :)

Our proposal consisted of a list of some papers we wrote (together and separately), with a 3-line summary for each. We worked quite a bit on putting together the list, with the goal to extract an exclamation from any reader at one point or another. As usual, Erik is the perfect co-writer for any piece of text (it's a pity he won't do blogging...).

The same team (Erik as PI, and yours truly as "Senior Researcher") tried a similar stunt with NSF (different topic though), without success. The comments were a maximal set of strong and contradictory statements from a couple of panels.

My favorites (yes, I know you don't usually make this list public):

- This is a very well-written proposal.
- My tendency is to recommend not to fund this proposal in order that the PIs write it better.
- Even the most optimistic progress will only [..] have little practical application.
- This is a highly important research direction that [..] will allow to develop more practically feasible geometric algorithms.
- What are the broader impacts of the proposed activity? None!
- Faster algorithms [..] might be much faster in practice. In fact, I believe that such improvements might lead to more practical and realistic algorithms.

- Very strong proposal - it's definitely not "more of the same" like many other proposals.

Posted by Mihai at 6:54 PM 2 comments

## Sunday, September 9, 2007

### SODA 2008 (updated)

The list of accepted papers has been posted.

If you get this weird feeling that something is missing, remember that he was on the PC.

On a rough count, there are about 13 data structure papers (cf. 16 last year), matching my prediction that the number of papers is not really affected by PC composition. From what I understand, most PCs are not actively trying to balance accept ratios across fields, so this phenomenon has a nonobvious explanation. (Maybe it is proof that the PC mechanism "works" in some abstract sense.)

I am also happy to report that our paper was accepted.

Posted by Mihai at 4:13 PM 2 comments

## Saturday, September 8, 2007

### Love thy predecessor (IV): Approx Nearest Neighbor

The message of this post is:

Predecessor search is equivalent to (1+ε)-approximate nearest neighbor in any constant dimension d.Perhaps the most appropriate citation is Timothy Chan [SODA 02] "Closest-point problems simplified on the RAM" (but I think this post sheds more light into the matter).

### This is a pretty cool proof, and seems rather underappreciated and under-taught. A nice bonus is that it is based on space-filling curves, which some people on the practical side of nearest neighbor love to talk about (though Piotr believes k-d trees are preferable).

[Incidentally, we'll see that the curve mapping is obtained by bitwise operations, again making the point that when people want to solve some problem better, they couldn't care less about our comparison model.]

Basic idea. Never mind the 1+ε part, let's go for O(1)-approximation. Say we have two numbers, a and b, and want to see if |a-b|<2

^{k}. Well, the obvious idea is to compute a-b :) The dumb and less obvious idea is to see if a and b are identical in the first w-k bits. Of course, this does not work (consider 111111 - 011111). But it almost works; e.g., the first w-k-1 bit are identical with constant probability if we add a random shift to the numbers.

In 2-d, the dumb idea starts to look better. We're going to define a mapping F(x,y) that takes the bits of x and y and interleaves them into an integer of 2w bits. Now, to test whether |X-x|<2

^{k}and |Y-y|<2

^{k}(with some constant approximation), you can perform just one test on F(X,Y) and F(x,y): test whether they start with the same ~2(w-k) bits.

Details. F is reducing a 2-d problem to a 1-d problem. We do not actually care about k: we can map the input set through F, and then search for the predecessor of F(query). Either the predecessor or successor has the longest prefix in common with the query, leading to a near neighbor.

Observe that we are actually heading towards a solution for the L

_{∞}nearest-neighbor problem, because the common prefix in F(.) outputs is sensitive to the largest distance on some coordinate. But that's Okay, since L

_{∞}and L

_{2}are constant factors away for d=constant.

Now what approximation is possible in L

_{∞}? Remember we said we should perform a random translation. In d dimensions, we want all coordinates to have a long common prefix. A bad event with 1/d probability will happen, so on some coordinate i, x

_{i}and X

_{i}may have only w-k-lg d initial bits in common. Thus, we should have an O(d) approximation.

Deterministic! To derandomize this and simplify the analysis, we use the pigeonhole principle. There are d dimensions, so d bad events. Then, if we're careful, we can find d+1 fixed translations, such that at least one of them preserves approximate nearest neighbors through the F map, with an approximation of exactly d+1.

For i=0,...,d, translation #i adds to every coordinate floor(i 2

^{w}/(d+1)). Simple enough.

To see why it works, first observe that the true nearest neighbor will not straddle a bad approximation barrier for at least one of these shifts (pigeonhole). Then, if it is beat by another point with that shift, that other point cannot be too bad, because the map cannot make points look closer (if there's a difference on one coordinate, it is preserved through interleaving). I won't work out the math here.

Space-filling curves. Alright, but where are these space filling curves, anyway? Well, that's just a nice way to view the algorithm. Note that F is a bijection between the interval [2

^{dw}] and the cube [2

^{w}]x...x[2

^{w}]. Thus, the graph of F

^{-1}(the inverse disassociating the bits) is 1-d curve filling the space.

The algorithm looks like this. In turn, for each of d+1 space-filling curves (differing by translation) do:

- map the input points to the linear order of the curve, and construct a predecessor structure.
- given a query, consider its predecessor and successor on the curve as potential near neighbors.

In theory, there are a bunch of things you can do, increasing the space by 1/ε

^{O(d)}. Maybe the most obvious is gridding (storing a big lookup table with a grid around each point). Checking additional points left or right does not work per se (imagine the nearest neighbor at distance 1; you find an approximate neighbor at distance 2, and there are many many point very close to this neighbor). However, it does work if we walk the curve considering only points that are appropriately far from their predecessor. See Timothy's paper for details (though note: since we're only considering O(1) "levels" I don't think you need priority search trees...). The bottom line is that we need to consider about 1/ε

^{d}additional points.

Discussion: equivalence. It is not hard to observe that predecessor search is a lower bound for (say) 2-approximate nearest neighbor, even in 1-d. We already know that predecessor search is equivalent to finding the longest common prefix. A 2-approximation to the nearest-neighbor essentially gives you somebody with the same longest common prefix (with some O(1) extra work).

Posted by Mihai at 3:03 PM 0 comments

## Wednesday, September 5, 2007

### Top 10 Theory Impact

Rasmus asks for a Top 10 list of theory results that have had a profound impact outside theory. Especially, he wants such a list with results from the past 25 years only.

That is something that we should all be armed with, yet I cannot find a good list. I was hoping to find something on TheoryMatters, but that project seems to have been hijacked by the lens-to-sciences view, and has very little concrete things about what we have already done. While that may be good for NSF and the US administration, we should also be prepared to show that Theory Matters for Computer Science.

(Incidentally, this reminds me of a great quote by Shang-Hua Teng: "When I came to the US, I learned about the notion of discrimination. But before I discovered discrimination against Blacks, Hispanics, etc, I discovered discrimination against theory in CS departments.")

So either:

- post a link to some similar lists out there
- or suggest your favorite entries for such a list. Ideally, the top 10 list would be compelling enough to be more like 10 lines than 10 pages with explanations, but maybe I'm asking for too much...

Posted by Mihai at 11:57 AM 9 comments

## Tuesday, September 4, 2007

### Chernoff about STOC/FOCS

There is a constant stream of attacks on the CS publication model -- especially by Math purists, I feel. I have certainly seen a lot of commentary after the recent Koblitz nonsense. Though I have many qualms with our publication model, I strongly believe Math is a bad example of an ideal. Here's my analysis of it.

In Math, it's mainly about the results. Math is old, it knows exactly what it's doing, it has these problems that it's been obsessing about for ages, and its painfully slow intake of any real-world influence makes it very stationary. Thus, what matters most is that they get some new results. The best strategy is to stress young people out, and make them come up with one new and hard contribution. Sure, many fail and go to Wall Street. But rationally for "the good of Mathematics", the results matter and people don't. (There is of course a lot of revering of old Mathematicians with some amazing result, which is again rational because that's exactly how you motivate young people to try to have a similar achievement).

In CS, I believe, it's about the people. We:

- are a new field, and still sorting out fundamental aspects from nonsense.
- need people with strong influence and personality to shape and guide a large community in good research directions.
- are very closely related to one of the world's prime industries. Thus, the reality of this industry is our reality -- think of how theoretical research in parallel computers died with the temporary explosion in performance of single-processor machines, and how research in streaming flourished with dot-com businesses. We cannot and will not be a stationary field, and our in-take of real-world desiderata is high.

And how do you select the best people? Well, Chernoff bounds tell you that you should not rely on one trial. You should rely on several works (publications) to assess how fit the person is for this type of research. If we expect students to publish 3-5 STOC/FOCS papers by the time they graduate, and maybe 5 more as junior faculty, that's already enough to have reasonable confidence in this publication measurement. We then have a pretty good idea of how a person thinks, and what his skillset is.

Posted by Mihai at 2:48 PM 4 comments

## Monday, September 3, 2007

### Love thy predecessor (III): van Emde Boas

The data structure of van Emde Boas is by every definition a classic:

- it is solving a fundamental problem optimally (see my paper with Mikkel)
- it is very simple and elegant
- it has 101 known variants
- it has been extremely impactful in wildly different research directions (see cache oblivious layout, Tango trees).
- it is important in practice (see my last post about IP lookup and routing).
- and of course, it got Knuth excited ;)
On March 22, 1977, as I was drafting Section 7.1 of The Art of Computer Programming, I read four papers by Peter van Emde Boas that turned out to be more appropriate for Chapter 8 than Chapter 7. I wrote a five-page memo entitled ``Notes on the van Emde Boas construction of priority deques: An instructive use of recursion,'' and sent it to Peter on March 29, with copies also to Bob Tarjan and John Hopcroft.

We want to search predecessors among a set S of n integers of w bits each. We will build a data structure that has O(n) space, and O(lg w) query time. (The "universe" of the input is U=2

^{w}, so the bound is often cited as O(lglg U), which seems more impressive and less practical due to the double log; since the structure is neither "impressive", nor impractical, I don't like this notation.)

Another thing that I don't like is that people take the word "integers" literally. It actually means "values stored with precision w". If you read about how floating point numbers are represented in computers (I think there's an IEEE standard), you realize that comparing two floating point numbers is exactly the same as comparing two integers with the same bit representation. So forget that your data was floating point, and simply pretend it was integers as far as predecessor is concerned. [[ Personal note: I made quite an impression among Romanian IOIers when I realized this in the 9th grade :) It allowed everybody to use radix sort for real numbers, and everybody knew radix sort is simpler and faster than comparison-sorting. ]]

Longest common prefix. Predecessor search is equivalent to finding the longest common prefix between the query q and a value is S (where numbers are seen as strings of bits). To realize that, think of the binary trie of depth w that represents our universe. The longest common prefix p is a node in this trie. If q continues with a one, the predecessor is the maximum value in S with prefix p, which can be precomputed and stored at every node. If q continues with a zero, the successor is the minimum value in S with prefix p, and once we have the successor, we can use a linked list to get the predecessor.

Here is my artful depiction of the first case (query continues to the right of the longest common prefix):

Now, we're looking for the (length of the) longest common prefix. What strategy do we know for finding a number? Binary search. So we're going to binary search for the length of the longest common prefix, i.e. binary search on the vertical of the trie.

To do that, we construct a dictionary (aka hash table) in which we store every prefix of every value in S. If we conjecture the longest common prefix has length l, we take the first l bits of q, and see if this prefix exists in the dictionary. If it does, the longest common prefix is at least l, otherwise it is at most l.

The space. What I described above takes space O(nw), because we store every prefix of every number. We can fix that with a simple, yet very effective idea: bucketing.

Group every w consecutive values into a bucket, and let S' be the set of the smallest value in each bucket, |S'| = n/w. We first build the above data structure to search for predecessors among S'. The space is O((n/w) w) = O(n). After finding the predecessor in S', we know the correct bucket where the answer is, and we can binary search inside that bucket in additional time O(lg w).

Posted by Mihai at 5:41 PM 5 comments