Sunday, October 28, 2007

Rational Search

We all know the number guessing game. I think of an integer between 1 and n, and you try to guess it by comparison queries, "is it larger than x?". Binary search takes ceiling(log2 n) queries, which is of course optimal.

But what if I think of a rational number a/b , with 0<a<b<n? Can you still guess it with a few comparison queries? Take a few seconds to show that you can guess it with 2log2 n + O(1) queries, and that's optimal. (It's a 1-line proof.)

x
x
x
x
x
x
This space intentionally left blank.
x
x
x
x
x
x


Two fractions with denominators up to n cannot be closer than 1/n2. Thus, you binary search for k such that the number is in the interval [k/n2 , (k+1)/n2). This interval contains a single fraction, so you can just go through all fractions and find the one inside. On the lower-bound side, there are on the order of n2 fractions, so it takes at least 2log2 n - O(1) queries with binary output.

The moral of the story is, of course: be careful about the difference between actual running time and query complexity. But now let us show that you can actually find the unique fraction in the small interval with running time O(lg n). There are several papers doing this, including:
  • Christos Papadimitriou, Efficient search for rationals, IPL 1979.
  • Steven P. Reiss, Efficient search for rationals, IPL 1979
  • Stephen Kwek and Kurt Mehlhorn, Optimal Search for Rationals, IPL 2003.
Here, I'm going to describe a nice solution based on Farey fractions. The Farey sequence of order n is the sorted list of irreducible fractions a/b, with abn. For example, for n=4, we have 0/1, 1/4, 1/3, 1/2, 2/3 3/4, 1/1.

The really cool property of irreducible fractions is that if a/b and c/d are consecutive in some Farey sequence, then the "mediant" (a+b)/(c+d) is the first fraction that appears between them in a higher-order sequence. For example, the first fraction to appear between 1/2 and 2/3 will be 3/5.

With this property, we can construct the Stern-Brocot tree, which is an infinite tree exploring all fractions less than 1. The following picture from Wikipedia should explain everything:

The intuitive algorithm is to go down this tree, looking for our fraction. Unfortunately, fractions with small denominators can appear very deep in the tree -- for example, 1/n is at depth n-1. Thus, we get O(n) time.

To improve this, we first show that the number of "turns" of the exploration is at most O(lg n). By turn, I mean switching from a going down on left children, to going down on right children (note that the 1/n example above had zero turns). Looking at the tree, it is easy to see that a turn essentially doubles denominators, so the property is easy to show.

Now we have to explore the straight descents efficiently. Let's say the fraction falls between a/b and c/d, and we take t steps to the left. This will land us at precisely (ta+b)/(tc+d). We can solve this equation for t, and find the minimum value which makes the fraction less than the target interval. Thus, we can calculate the number of consecutive left steps in constant time.

If we're too lazy to solve the equation, we can binary-search for t. Here we use galopping binary search, i.e. we take time O(lg t) to guess the unknown t.

Surprisingly, this will not give O(lg2n) time in total, but still takes O(lg n) time. This follows because the sum of the lg t values traces the number of times the denominator doubles, and it can only double O(lg n) times.

Overall, we have query complexity, Farey sequences, doubling arguments, galopping binary search, and summing up logarithms. This probably makes very nice material for an undergraduate "Math for CS" course.

(As a person who has never taken, nor taught such a course, I of course feel entitled to have opinions about it.)

Thursday, October 11, 2007

Range Queries (I)

I am now in Bonn, spending some time talking to Yakov Nekritch about range queries. Since this is a field I care about a lot, I will write a series of posts on it.

First, let me begin with my spiel on the importance of studying range queries. This is taken from the beginning of my talk at UW in May.

Once upon a time, you go to a bar in NY and you're trying to pick up a nice woman. She tells you she works in HR, and you tell her you are a theoretical computer scientist. She confesses she has not idea what that means, at which point you clarify "Oh, I am trying to understand the fundamental nature of computation". The blank stare you get is a clear indication that some more clarification is needed.

What do you then explain at this crucial moment of the night? Though I have not tried the following options myself, the prevailing opinion is that they are ill-advised:
  • I am trying to prove lower bounds for constant depth circuits with mod-6 gates.
  • I am trying to extract pure randomness from 2 independent sources with small min entropy.
  • I am trying to design codes of subexponential size with efficient local decodability.
  • I am trying to show that a diagonalization argument cannot separate P and NP, even if it is given a low degree extension of an oracle.
One can complain that too few people understand what computing and computation actually mean. That is a valid complaint, and we should indeed work with precollege education to change things.

But here you're in luck. The gorgeous HR woman actually uses computers every day, and she knows exactly what they're supposed to do. Chances are, she even has some exposure to a programming language! That is of course SQL, which is very useful for her work. The only trouble is -- for her, and for most of the planet, computers are a great tool for data storage and manipulation, not a great tool for finding Hamiltonian paths.

So what do you tell her? Anybody who has attended 5 minutes of a SQL tutorial has seen examples like:
SELECT * FROM employees
WHERE salary <= 70000 AND startdate <= 1998
In fact, she probably runs code like this on a daily basis. Thus, she already understands orthogonal range queries -- formally, you are given a set of points, and you ask for points in some rectangle.

You can now explain that your work is about how fast such queries can be executed, and, with her curiosity satisfied, move to more charming topics.


PS: This post is not about bad-mouthing complexity theory. For a long time, I have considered myself a complexity theorist, and I am interested in many topics in the field, well beyond my own research. The funny references are to some very particular young-and-eager complexity theorists, who have occasionally exasperated me with comments like "Yeap, data structures can be interesting sometimes, but really our goal is to understand computation". The point, if it was not clear enough, is that data structures are certainly not more artificial as our other computational questions -- maybe most computational questions that computer users care about are data-structure (database) questions.

Friday, October 5, 2007

Schedule

For the past two weeks, I have been enjoying the unique atmosphere of Romania. I promise to blog a bit about Romania and its universities next week, since most people are probably not aware of this parallel academic universe.

Now it is time to get back into the real world, and possibly do some work. Of course, on the way to doing work, I might as well have some fun. My upcoming schedule is as follows:
  • Oct 6: Timisoara (my first visit to this reportedly beautiful city in West Romania)
  • Oct 7: Budapest
  • Oct 8: Bratislava (first time in Slovakia)
  • Oct 9: Amsteram
  • Oct 10 -- Oct 16: Bonn
    I am visiting Yakov Nekritch, whose homepage picture reminds me of pre-1989 Romania (a mid-level Party administrator for agriculture, looking proud of his crops).

    On Monday, Oct 15, at 10am, I am giving the following talk:

    Dynamic Graph Algorithms invade Geometry
    After dynamic graph algorithms has been an area of intense research in data structures for almost 30 years, the time seems ripe for the field to tackle problems of a geometric nature. In dynamic graph algorithms, we have a graph that is being changed through updates (typically edge insertions and deletions), and we want to query it for various graph properties (say, whether two nodes are connected, the cost of an MST, the shortest path between two nodes etc).

    Now consider a scenario in which we have sensors distributed in the space, and each sensor can only talk to another sensor at radius r. The connectivity is defined by the intersection graph of discs of radius r/2 around every point. Can we handle a dynamic environment, e.g. a situation when new sensors can be inserted, or some sensors run
    out of battery?

    Our results apply to the intersection graphs of a very general class of objects (includings disks, segments, rectangles) and support the dynamic connectivity problem in time O(nc) per update and query, where c is a constant less than 1 depending on the geometric shape in question. This opens a very broad and promising research direction:
    • can new graph queries be supported (e.g. shortest path queries)?
    • what is the best running time for each important geometric shape?

    Joint work with Timothy Chan and Liam Roditty.

  • Oct 16 -- Oct 20: Brussels (first time in Belgium)

    Here, I am visiting Stefan Langerman. On Thrusday, Oct 18 at 12:30, I am giving a talk:

    Farey Sequences and Counting Primitive Lattice Points
    A primitive lattice point is a point (x,y) with x, y integers and gcd(x,y)=1. "Counting" primitive lattice points inside various planar shapes has been an active area of mathematics in the past decades. But by virtue of seeking an elementary formula, the mathematical definition of "counting" is "approximating asymptotically". We now present the first results on the informatics view of this problem: give an efficient algorithm to count primitive lattice points exactly.

    The problem is related to another entertaining mathematical problem that has crossed the border into informatics. The Farey sequence of order N is the sorted sequence of irreducible fractions a/b, with abN. There are well-known algorithms for generating this sequence in its entirety. We now describe algorithms for finding just one value of the sequence (say the K-th value) significantly faster.

    Joint work with Jakub Pawlewicz.

  • Oct 20 -- Oct 23: FOCS in Providence.
    The talk is on Monday morning at 8:30, and I am giving it (Mikkel and I use the alternation principle for choosing the speaker). The paper is here.

    Planning for Fast Connectivity Updates
    Understanding how a single edge deletion can affect the connectivity of a graph amounts to finding the graph bridges. But when faced with d>1 deletions, can we establish as easily how the connectivity changes? When planning for an emergency, we want to understand the structure of our network ahead of time, and respond swiftly when an emergency actually happens.

    We describe a linear-space representation of graphs which enables us to determine how a batch of edge updates can impact the graph. Given a set of d edge updates, in time O(d.polylgn) we can obtain the number of connected components, the size of each component, and a fast oracle for answering connectivity queries in the updated graph. The initial representation is polynomial-time constructible.
I will blog about each paper after the talk. If you want to meet in Brussels, Bonn, or Providence, drop me a line.

Tuesday, October 2, 2007

CRA Award

In 2005, I received CRA's Outstanding Undergraduate Award (my citation, with a nicely kempt picture).

I have been asked a bunch of times (most recently yesterday) for "insider" tips, my nomination materials, opinions etc. So why not post all of these publicly on the blog, to promote competition...

Here is the research statement and the CV that I submitted back then (2004). It's nice to see the CV has changed a bit since then, not only in the contact info section ;)

Below is some "advice" and personal opinions that I wrote in emails. But there is a catch that you must be aware of -- I certainly did not have a typical application. I had a bunch of research even back then, and I got the award based on work in my sophomore year, which is very very atypical. Thus, you can either think that (1) the application was good enough to earn me the award so early, so my advice is trustworthy; or (2) my circumstances were sufficiently weird that the application didn't really matter, so my advice is worthless.

For what it's worth it, here was the formula I used:
  • I got the advice (maybe it was even official advice from CRA) to describe carefully what my contribution was on the papers, so I do that to some extent in the statement. At a later stage in life (post PhD), this would probably be unusual.

  • I was also told that some contribution to humanity might not hurt, so the CV emphasized my involvement in organizing high school computer olympiads. Still I decided not to dramatize it too much.

  • I did not mention intended future work, broader impact, and the like. My reasoning was that empty talk is a skill that senior researchers need to develop for grants. It is understood that empty talk is needed, so including it will not hurt your reputation.

    But if young students write the kind of things we normally write in grants, they end up looking sily and childish. They should compete based on objective things they actually did, and get some respect based on objective criteria before the hype can begin.
Looking at the research statement in retrospect, I severely underestimated the impact of my predecessor lower bound with Mikkel (then in the writing). That was a fundamental paper --- the first separation between quasilinear space (say, n polylg n) and polynomial space (say n2). In data structures, all the action tends to happen below quadratic space, so this paper has already generated a lot of fall-out, and will probably be remembered into the future.

Still, the fact that I did not speculate on future impact was probably the right thing to do. As I said already, I believe the tone of a statement at the student level should be strongly objective.