WebDiarios de Motocicleta

Saturday, September 8, 2007

Love thy predecessor (IV): Approx Nearest Neighbor

The message of this post is:

Predecessor search is equivalent to (1+ε)-approximate nearest neighbor in any constant dimension d.

Perhaps the most appropriate citation is Timothy Chan [SODA 02] "Closest-point problems simplified on the RAM" (but I think this post sheds more light into the matter).

This is a pretty cool proof, and seems rather underappreciated and under-taught. A nice bonus is that it is based on space-filling curves, which some people on the practical side of nearest neighbor love to talk about (though Piotr believes k-d trees are preferable).

[Incidentally, we'll see that the curve mapping is obtained by bitwise operations, again making the point that when people want to solve some problem better, they couldn't care less about our comparison model.]

Basic idea. Never mind the 1+ε part, let's go for O(1)-approximation. Say we have two numbers, a and b, and want to see if |a-b|<2^k. Well, the obvious idea is to compute a-b :) The dumb and less obvious idea is to see if a and b are identical in the first w-k bits. Of course, this does not work (consider 111111 - 011111). But it almost works; e.g., the first w-k-1 bit are identical with constant probability if we add a random shift to the numbers.

In 2-d, the dumb idea starts to look better. We're going to define a mapping F(x,y) that takes the bits of x and y and interleaves them into an integer of 2w bits. Now, to test whether |X-x|<2^k and |Y-y|<2^k (with some constant approximation), you can perform just one test on F(X,Y) and F(x,y): test whether they start with the same ~2(w-k) bits.

Details. F is reducing a 2-d problem to a 1-d problem. We do not actually care about k: we can map the input set through F, and then search for the predecessor of F(query). Either the predecessor or successor has the longest prefix in common with the query, leading to a near neighbor.

Observe that we are actually heading towards a solution for the L_∞ nearest-neighbor problem, because the common prefix in F(.) outputs is sensitive to the largest distance on some coordinate. But that's Okay, since L_∞ and L₂ are constant factors away for d=constant.

Now what approximation is possible in L_∞? Remember we said we should perform a random translation. In d dimensions, we want all coordinates to have a long common prefix. A bad event with 1/d probability will happen, so on some coordinate i, x_i and X_i may have only w-k-lg d initial bits in common. Thus, we should have an O(d) approximation.

Deterministic! To derandomize this and simplify the analysis, we use the pigeonhole principle. There are d dimensions, so d bad events. Then, if we're careful, we can find d+1 fixed translations, such that at least one of them preserves approximate nearest neighbors through the F map, with an approximation of exactly d+1.

For i=0,...,d, translation #i adds to every coordinate floor(i 2^w/(d+1)). Simple enough.

To see why it works, first observe that the true nearest neighbor will not straddle a bad approximation barrier for at least one of these shifts (pigeonhole). Then, if it is beat by another point with that shift, that other point cannot be too bad, because the map cannot make points look closer (if there's a difference on one coordinate, it is preserved through interleaving). I won't work out the math here.

Space-filling curves. Alright, but where are these space filling curves, anyway? Well, that's just a nice way to view the algorithm. Note that F is a bijection between the interval [2^dw] and the cube [2^w]x...x[2^w]. Thus, the graph of F^-1 (the inverse disassociating the bits) is 1-d curve filling the space.

The algorithm looks like this. In turn, for each of d+1 space-filling curves (differing by translation) do:

map the input points to the linear order of the curve, and construct a predecessor structure.
given a query, consider its predecessor and successor on the curve as potential near neighbors.

Discussion: approximation. So far we've talked about f(d) approximation. In practice, it is claimed that this is enough. This is not because f(d)-approximation is acceptable, but because practical data sets don't have too many candidates masquerading as the nearest neighbor. If the nearest neighbor is at distance r, there are O(1) points at distance O(r). [[NB: this assumption is a special case of the "bounded doubling dimension" assumption. ]] In this case, we can check all "candidates", by considering more adjacent points on the space-filling curve, beyond the successor and predecessor.

In theory, there are a bunch of things you can do, increasing the space by 1/ε^O(d). Maybe the most obvious is gridding (storing a big lookup table with a grid around each point). Checking additional points left or right does not work per se (imagine the nearest neighbor at distance 1; you find an approximate neighbor at distance 2, and there are many many point very close to this neighbor). However, it does work if we walk the curve considering only points that are appropriately far from their predecessor. See Timothy's paper for details (though note: since we're only considering O(1) "levels" I don't think you need priority search trees...). The bottom line is that we need to consider about 1/ε^d additional points.

Discussion: equivalence. It is not hard to observe that predecessor search is a lower bound for (say) 2-approximate nearest neighbor, even in 1-d. We already know that predecessor search is equivalent to finding the longest common prefix. A 2-approximation to the nearest-neighbor essentially gives you somebody with the same longest common prefix (with some O(1) extra work).

Wednesday, September 5, 2007

Top 10 Theory Impact

Rasmus asks for a Top 10 list of theory results that have had a profound impact outside theory. Especially, he wants such a list with results from the past 25 years only.

That is something that we should all be armed with, yet I cannot find a good list. I was hoping to find something on TheoryMatters, but that project seems to have been hijacked by the lens-to-sciences view, and has very little concrete things about what we have already done. While that may be good for NSF and the US administration, we should also be prepared to show that Theory Matters for Computer Science.

(Incidentally, this reminds me of a great quote by Shang-Hua Teng: "When I came to the US, I learned about the notion of discrimination. But before I discovered discrimination against Blacks, Hispanics, etc, I discovered discrimination against theory in CS departments.")

So either:

post a link to some similar lists out there
or suggest your favorite entries for such a list. Ideally, the top 10 list would be compelling enough to be more like 10 lines than 10 pages with explanations, but maybe I'm asking for too much...

Tuesday, September 4, 2007

Chernoff about STOC/FOCS

There is a constant stream of attacks on the CS publication model -- especially by Math purists, I feel. I have certainly seen a lot of commentary after the recent Koblitz nonsense. Though I have many qualms with our publication model, I strongly believe Math is a bad example of an ideal. Here's my analysis of it.

In Math, it's mainly about the results. Math is old, it knows exactly what it's doing, it has these problems that it's been obsessing about for ages, and its painfully slow intake of any real-world influence makes it very stationary. Thus, what matters most is that they get some new results. The best strategy is to stress young people out, and make them come up with one new and hard contribution. Sure, many fail and go to Wall Street. But rationally for "the good of Mathematics", the results matter and people don't. (There is of course a lot of revering of old Mathematicians with some amazing result, which is again rational because that's exactly how you motivate young people to try to have a similar achievement).

In CS, I believe, it's about the people. We:

are a new field, and still sorting out fundamental aspects from nonsense.
need people with strong influence and personality to shape and guide a large community in good research directions.
are very closely related to one of the world's prime industries. Thus, the reality of this industry is our reality -- think of how theoretical research in parallel computers died with the temporary explosion in performance of single-processor machines, and how research in streaming flourished with dot-com businesses. We cannot and will not be a stationary field, and our in-take of real-world desiderata is high.

Thus, we are not in the business of squeezing one major publication out of PhD students and junior faculty. We are in the business of selecting the best people to stay in the field, and continue to do brilliant and relevant work, by a changing standard of relevance.

And how do you select the best people? Well, Chernoff bounds tell you that you should not rely on one trial. You should rely on several works (publications) to assess how fit the person is for this type of research. If we expect students to publish 3-5 STOC/FOCS papers by the time they graduate, and maybe 5 more as junior faculty, that's already enough to have reasonable confidence in this publication measurement. We then have a pretty good idea of how a person thinks, and what his skillset is.

Monday, September 3, 2007

Love thy predecessor (III): van Emde Boas

The data structure of van Emde Boas is by every definition a classic:

it is solving a fundamental problem optimally (see my paper with Mikkel)
it is very simple and elegant
it has 101 known variants
it has been extremely impactful in wildly different research directions (see cache oblivious layout, Tango trees).
it is important in practice (see my last post about IP lookup and routing).
and of course, it got Knuth excited ;)
On March 22, 1977, as I was drafting Section 7.1 of The Art of Computer Programming, I read four papers by Peter van Emde Boas that turned out to be more appropriate for Chapter 8 than Chapter 7. I wrote a five-page memo entitled ``Notes on the van Emde Boas construction of priority deques: An instructive use of recursion,'' and sent it to Peter on March 29, with copies also to Bob Tarjan and John Hopcroft.

Next time you teach a general algorithms course, consider teaching it. Below I describe the simplest variant that I know. There are many more complicated ones (and the original description in van Emde Boas is particularly painful).

We want to search predecessors among a set S of n integers of w bits each. We will build a data structure that has O(n) space, and O(lg w) query time. (The "universe" of the input is U=2^w, so the bound is often cited as O(lglg U), which seems more impressive and less practical due to the double log; since the structure is neither "impressive", nor impractical, I don't like this notation.)

Another thing that I don't like is that people take the word "integers" literally. It actually means "values stored with precision w". If you read about how floating point numbers are represented in computers (I think there's an IEEE standard), you realize that comparing two floating point numbers is exactly the same as comparing two integers with the same bit representation. So forget that your data was floating point, and simply pretend it was integers as far as predecessor is concerned. [[ Personal note: I made quite an impression among Romanian IOIers when I realized this in the 9th grade :) It allowed everybody to use radix sort for real numbers, and everybody knew radix sort is simpler and faster than comparison-sorting. ]]

Longest common prefix. Predecessor search is equivalent to finding the longest common prefix between the query q and a value is S (where numbers are seen as strings of bits). To realize that, think of the binary trie of depth w that represents our universe. The longest common prefix p is a node in this trie. If q continues with a one, the predecessor is the maximum value in S with prefix p, which can be precomputed and stored at every node. If q continues with a zero, the successor is the minimum value in S with prefix p, and once we have the successor, we can use a linked list to get the predecessor.

Here is my artful depiction of the first case (query continues to the right of the longest common prefix):

Now, we're looking for the (length of the) longest common prefix. What strategy do we know for finding a number? Binary search. So we're going to binary search for the length of the longest common prefix, i.e. binary search on the vertical of the trie.

To do that, we construct a dictionary (aka hash table) in which we store every prefix of every value in S. If we conjecture the longest common prefix has length l, we take the first l bits of q, and see if this prefix exists in the dictionary. If it does, the longest common prefix is at least l, otherwise it is at most l.

The space. What I described above takes space O(nw), because we store every prefix of every number. We can fix that with a simple, yet very effective idea: bucketing.

Group every w consecutive values into a bucket, and let S' be the set of the smallest value in each bucket, |S'| = n/w. We first build the above data structure to search for predecessors among S'. The space is O((n/w) w) = O(n). After finding the predecessor in S', we know the correct bucket where the answer is, and we can binary search inside that bucket in additional time O(lg w).

Friday, August 31, 2007

Led Zeppelin

I have typically not been alarmed by the prospects of oil running out. Europe has shown that railroads can work, so personally I couldn't care less if driving a car becomes 10x more expensive. The only issue is --- I really want my airtravel. And I'm sure most researchers, living the conference life, will agree.

Now, I have always assumed that balloons will provide a solution. Sure, we'll take it slower, and spend more time in the air (never mind those students), but we'll continue to get there. Unfortunately, it seems I was wrong...

Zeppelin NT (Neue Technologie) seems to be the most efficient commercially available "balloon" today. It is a semi-rigid design, german-engineered to optimize fuel-efficiency and speed. It can:

carry about 6 passengers in addition to 1 pilot, if you want to go up to 2000m~6600ft (which you do, because there are mountains on most interesting routes).
cruise at 80 km/h ~ 50mph, relative to the wind

To understand that this thing is really built with great care, consider that it costs about $7M a piece (around 100x more than the highend blimps on American skies).

Now, let's say you want to travel BOS->SFO to collaborate with the Berkeley people.

the distance is 4339 km
... which takes you 54 hours (which has the positive effect that you're actually going to read their papers before asking them questions). NB: this assumes no wind, so you may able to do slightly better by finding a nice altitude with good wind.
according to these technical specifications, the thing needs 60 kg of gas/hour at cruising speed, plus 50kg for takeoff and landing, so you spend 3304 kgs of gas.
with gasoline density 737.22 kg/m³ this is 4481 liters ~ 1184 gallons.
at current gas prices of $2.7/gallon, the trip costs $3197
... that is, $533 USD / person. Of course, I am ignoring the cost of the pilot, investment in the zeppelin, airport taxes etc --- which for current airlines are more than half of the cost.

Economy of scale. Of course, this calculation is missing the economy of scale. Right now we don't have huge zeppelins because nobody will pay to build them. But what if we did? Well, the useful load grows proportional to volume, i.e. like the cube of the radius. On the other hand, the drag force (which is what the engines are fighting) grows like the area, i.e. like the square of the radius. So this is sublinear growth!

For instance, if we carry 27 times more people, the price per person will drop by roughly 3 (yes, yes, I'm ignoring the pilot). Note that 162 passengers is really a lot for a BOS->SFO flight today; US airlines learned that US travellers are very spoiled, and the recipe for success is frequent small flights.

In any case, this means you should pay on the order of $178 USD / person. That is not bad, but it's about what you're paying today for a regular airline, with sevices included. This means the zeppelin is not actually saving gas, and has the same oil bottleneck.

Seems like there no Stairway to Heaven, and we might have to sell our soul.

Thursday, August 30, 2007

Love thy predecessor (II): The comparison model

As mentioned already, the super-standard way to solve the predecessor problem is by binary search. This has led people to define the comparison model: you have a bunch of abstract values, and all you can do with them is compare two values; the cost of your algorithm is equal to the number of comparisons performed.

Needless to say, actual computers store values with finite precision, so there are a lot more things that you can do besides compare two values (for example, computing a hash function). The most famous example of a faster algorithm that uses more than comparisons is the beautiful data structure of [van Emde Boas, FOCS 1975], which I will talk about some other time. Until then, suffice it to mention that Don Knuth finds this data structure very exciting in his forthcoming volume 4.

Doubting Thomases. Now, a rather large section of the theory community has developed the belief that we should stick to the comparison model in designing algorithms, ignoring faster algorithms that do more than compare elements. The most vocal adherent to this doctrine that I've met is David. (oh yes, I'm forgetting an anonymous NSF panel...)

Their argument is roughly as following: In real life, algorithms that get implemented only use comparison search. The finite-precision algorithms aren't actually worth it in practice. So we should study what actually matters in practice, which is comparison algorithms.

Nonsense!
We are doing the theory of the hard case, not the easy case. Say a programmer comes to you: "I have this array of 100 values, and I want to sort them before I run this exponential backtracking algorithm on them. What algorithm do you recommend?"... At this point, I hope you say "Buble-sort, dude!"... But after that, are you going to go to your office and study the beautiful properties of bubble sort?

Quite likely, all the smart algorithms we ever develop are not really needed in 99,99% of the cases, because that's not where the bottleneck is. If you want to do the theory of the easy case, we should stick to teaching intro to programming. We can include binary search in the course, and make 99,99% of the programmers happy.

Fortunately, there are hard problems in this world, and there is a place for us. And where there are hard problems, there are no rules. If somebody wants something faster than binary search, he wants a fast algorithm, he doesn't want a fast comparison-based algorithm.

Case study. Now, there is a highly compelling motivation for designing fast predecessor search: IP lookup. To route a packet on the Internet, the router has to figure out where the destination IP address fits in its routing table. This is precisely predecessor search -- details later. Thus, a predecessor search is required for every packet going through every router on the whole Internet. It is quite safe to say that: predecessor search is the most executed problem on the planet.

With Gigabit speeds, this is actually a hard problem (you have on the order of 100 machine cycles for each search). Comparison search is definetely out of the question, to the extent that it's not even a footnote in the papers, or a column in the table with experimental results.

A bunch of implementations used Patricia tries, and as you might expect adding heuristics to Patricia tries became a popular sport. Unfortunately, this was still not fast enough, and for serious routers, people threw in a lot of expensive hardware to tackle the problem.

The paper finally providing the theory sanity-check was [Mikael Degermark, Andrej Brodnik, Svante Carlsson, Stephen Pink: Small Forwarding Tables for Fast Routing Lookups]. The paper appeared in SIGCOMM, the authoritative forum of the networking community (yes, it is harder to get into SIGCOMM that into STOC/FOCS, though maybe for the wrong reasons).

Essentially, they are saying: take all that expensive hardware out, and start using a smarter algorithm. Their algorithm starts exactly with a van Emde Boas step (surprise), of course with the proper amount of engineering. In theory, van Emde Boas search is exponentially faster than Patricia tries, which we knew since 1975.

Love thy predecessor (I): In the beginning was the Word

Thinking about the predecessor problem is the CS-equivalent of a journey within, when you rediscover the very axioms of your existence. It is a center of the data-structure universe, yet it is so ingrained in us that few of us take the time to notice the problem exists. From this central, well-studied location, it shines a light on many controversies and research angles in the field.

The problem. Preprocess a set S of n numbers, such that the following query is answered efficiently: Given a number x, return pred_S(x) = max { y ≤ x | y in S }.

The most trivial solution is of course to sort S during preprocessing, and use binary search to answer queries. If we want a dynamic S, we can use a binary search tree instead. In fact, if somebody asks you what problem binary search trees solve, predecessor search is what you probably think of first.

The problem can be thought of as the online version of sorting (insert one number in a sorted list). It has countless small applications everywhere, plus some rather notable ones -- more later. A lot of very interesting mathematical understanding has been developed around the problem.

But before delving deeper, we have an interesting topic of discussion if we only consider the problem name...

The name. Of course, some people persistently call this "the successor problem". Whatever :) The real issue is that some people use "binary search" or "binary search tree" when they should say predecessor search. For example, it is popular to specify steps of algorithms like "... now binary search for this value in the set".

This is evil! Let us remind ourselves that we study:

the dictionary problem, not hash tables.
the predecessor problem, not binary search (trees).
the nearest neighbor problem, not k-d trees.
etc

The Book was written before time, and does not contain references to our meager inventions. Its chapters are dedicated to problems, and solving the problems (through one data structure or another) is what we're supposed to do. Studying some particular data structure we invented is missing the forest for the trees. Not the mention that it is a clear act of vanity, and after all superbia is the very first in SALIGIA.

I am always disappointed by programmers who don't understand the difference, and I tend to jump at the neck of academics who make the confusion :)

To give a recent example, a machine-learning expert was presenting a clustering algorithm in the terms "... and now you use the k-d tree to search for the nearest neighbor to this point...". With this mentality, it is understandable why the sport of adding heuristics to k-d trees is more popular than searching for a more efficient solution to the problem with a less myopic lens.

PS: This whole "misbranding" reminds me of Romania, where "sport shoes" are called "adidasi", "copy machines" are "xeroxuri", and so on. This doesn't seem to happen so much in English, but I'm sure there are examples. I look forward to hearing "I'll google it up on Yahoo".