Saturday, September 8, 2007

Love thy predecessor (IV): Approx Nearest Neighbor

The message of this post is:

Predecessor search is equivalent to (1+ε)-approximate nearest neighbor in any constant dimension d.
Perhaps the most appropriate citation is Timothy Chan [SODA 02] "Closest-point problems simplified on the RAM" (but I think this post sheds more light into the matter).
This is a pretty cool proof, and seems rather underappreciated and under-taught. A nice bonus is that it is based on space-filling curves, which some people on the practical side of nearest neighbor love to talk about (though Piotr believes k-d trees are preferable).

[Incidentally, we'll see that the curve mapping is obtained by bitwise operations, again making the point that when people want to solve some problem better, they couldn't care less about our comparison model.]

Basic idea. Never mind the 1+ε part, let's go for O(1)-approximation. Say we have two numbers, a and b, and want to see if |a-b|<2k. Well, the obvious idea is to compute a-b :) The dumb and less obvious idea is to see if a and b are identical in the first w-k bits. Of course, this does not work (consider 111111 - 011111). But it almost works; e.g., the first w-k-1 bit are identical with constant probability if we add a random shift to the numbers.

In 2-d, the dumb idea starts to look better. We're going to define a mapping F(x,y) that takes the bits of x and y and interleaves them into an integer of 2w bits. Now, to test whether |X-x|<2k and |Y-y|<2k (with some constant approximation), you can perform just one test on F(X,Y) and F(x,y): test whether they start with the same ~2(w-k) bits.

Details. F is reducing a 2-d problem to a 1-d problem. We do not actually care about k: we can map the input set through F, and then search for the predecessor of F(query). Either the predecessor or successor has the longest prefix in common with the query, leading to a near neighbor.

Observe that we are actually heading towards a solution for the L nearest-neighbor problem, because the common prefix in F(.) outputs is sensitive to the largest distance on some coordinate. But that's Okay, since L and L2 are constant factors away for d=constant.

Now what approximation is possible in L? Remember we said we should perform a random translation. In d dimensions, we want all coordinates to have a long common prefix. A bad event with 1/d probability will happen, so on some coordinate i, xi and Xi may have only w-k-lg d initial bits in common. Thus, we should have an O(d) approximation.

Deterministic! To derandomize this and simplify the analysis, we use the pigeonhole principle. There are d dimensions, so d bad events. Then, if we're careful, we can find d+1 fixed translations, such that at least one of them preserves approximate nearest neighbors through the F map, with an approximation of exactly d+1.

For i=0,...,d, translation #i adds to every coordinate floor(i 2w/(d+1)). Simple enough.

To see why it works, first observe that the true nearest neighbor will not straddle a bad approximation barrier for at least one of these shifts (pigeonhole). Then, if it is beat by another point with that shift, that other point cannot be too bad, because the map cannot make points look closer (if there's a difference on one coordinate, it is preserved through interleaving). I won't work out the math here.

Space-filling curves. Alright, but where are these space filling curves, anyway? Well, that's just a nice way to view the algorithm. Note that F is a bijection between the interval [2dw] and the cube [2w]x...x[2w]. Thus, the graph of F-1 (the inverse disassociating the bits) is 1-d curve filling the space.

The algorithm looks like this. In turn, for each of d+1 space-filling curves (differing by translation) do:
• map the input points to the linear order of the curve, and construct a predecessor structure.
• given a query, consider its predecessor and successor on the curve as potential near neighbors.
Discussion: approximation. So far we've talked about f(d) approximation. In practice, it is claimed that this is enough. This is not because f(d)-approximation is acceptable, but because practical data sets don't have too many candidates masquerading as the nearest neighbor. If the nearest neighbor is at distance r, there are O(1) points at distance O(r). [[NB: this assumption is a special case of the "bounded doubling dimension" assumption. ]] In this case, we can check all "candidates", by considering more adjacent points on the space-filling curve, beyond the successor and predecessor.

In theory, there are a bunch of things you can do, increasing the space by
1/εO(d). Maybe the most obvious is gridding (storing a big lookup table with a grid around each point). Checking additional points left or right does not work per se (imagine the nearest neighbor at distance 1; you find an approximate neighbor at distance 2, and there are many many point very close to this neighbor). However, it does work if we walk the curve considering only points that are appropriately far from their predecessor. See Timothy's paper for details (though note: since we're only considering O(1) "levels" I don't think you need priority search trees...). The bottom line is that we need to consider about 1/εd additional points.

Discussion: equivalence. It is not hard to observe that predecessor search is a lower bound for (say) 2-approximate nearest neighbor, even in 1-d. We already know that predecessor search is equivalent to finding the longest common prefix. A 2-approximation to the nearest-neighbor essentially gives you somebody with the same longest common prefix (with some O(1) extra work).