WebDiarios de Motocicleta: August 2007

Friday, August 31, 2007

Led Zeppelin

I have typically not been alarmed by the prospects of oil running out. Europe has shown that railroads can work, so personally I couldn't care less if driving a car becomes 10x more expensive. The only issue is --- I really want my airtravel. And I'm sure most researchers, living the conference life, will agree.

Now, I have always assumed that balloons will provide a solution. Sure, we'll take it slower, and spend more time in the air (never mind those students), but we'll continue to get there. Unfortunately, it seems I was wrong...

Zeppelin NT (Neue Technologie) seems to be the most efficient commercially available "balloon" today. It is a semi-rigid design, german-engineered to optimize fuel-efficiency and speed. It can:

carry about 6 passengers in addition to 1 pilot, if you want to go up to 2000m~6600ft (which you do, because there are mountains on most interesting routes).
cruise at 80 km/h ~ 50mph, relative to the wind

To understand that this thing is really built with great care, consider that it costs about $7M a piece (around 100x more than the highend blimps on American skies).

Now, let's say you want to travel BOS->SFO to collaborate with the Berkeley people.

the distance is 4339 km
... which takes you 54 hours (which has the positive effect that you're actually going to read their papers before asking them questions). NB: this assumes no wind, so you may able to do slightly better by finding a nice altitude with good wind.
according to these technical specifications, the thing needs 60 kg of gas/hour at cruising speed, plus 50kg for takeoff and landing, so you spend 3304 kgs of gas.
with gasoline density 737.22 kg/m³ this is 4481 liters ~ 1184 gallons.
at current gas prices of $2.7/gallon, the trip costs $3197
... that is, $533 USD / person. Of course, I am ignoring the cost of the pilot, investment in the zeppelin, airport taxes etc --- which for current airlines are more than half of the cost.

Economy of scale. Of course, this calculation is missing the economy of scale. Right now we don't have huge zeppelins because nobody will pay to build them. But what if we did? Well, the useful load grows proportional to volume, i.e. like the cube of the radius. On the other hand, the drag force (which is what the engines are fighting) grows like the area, i.e. like the square of the radius. So this is sublinear growth!

For instance, if we carry 27 times more people, the price per person will drop by roughly 3 (yes, yes, I'm ignoring the pilot). Note that 162 passengers is really a lot for a BOS->SFO flight today; US airlines learned that US travellers are very spoiled, and the recipe for success is frequent small flights.

In any case, this means you should pay on the order of $178 USD / person. That is not bad, but it's about what you're paying today for a regular airline, with sevices included. This means the zeppelin is not actually saving gas, and has the same oil bottleneck.

Seems like there no Stairway to Heaven, and we might have to sell our soul.

Thursday, August 30, 2007

Love thy predecessor (II): The comparison model

As mentioned already, the super-standard way to solve the predecessor problem is by binary search. This has led people to define the comparison model: you have a bunch of abstract values, and all you can do with them is compare two values; the cost of your algorithm is equal to the number of comparisons performed.

Needless to say, actual computers store values with finite precision, so there are a lot more things that you can do besides compare two values (for example, computing a hash function). The most famous example of a faster algorithm that uses more than comparisons is the beautiful data structure of [van Emde Boas, FOCS 1975], which I will talk about some other time. Until then, suffice it to mention that Don Knuth finds this data structure very exciting in his forthcoming volume 4.

Doubting Thomases. Now, a rather large section of the theory community has developed the belief that we should stick to the comparison model in designing algorithms, ignoring faster algorithms that do more than compare elements. The most vocal adherent to this doctrine that I've met is David. (oh yes, I'm forgetting an anonymous NSF panel...)

Their argument is roughly as following: In real life, algorithms that get implemented only use comparison search. The finite-precision algorithms aren't actually worth it in practice. So we should study what actually matters in practice, which is comparison algorithms.

Nonsense!
We are doing the theory of the hard case, not the easy case. Say a programmer comes to you: "I have this array of 100 values, and I want to sort them before I run this exponential backtracking algorithm on them. What algorithm do you recommend?"... At this point, I hope you say "Buble-sort, dude!"... But after that, are you going to go to your office and study the beautiful properties of bubble sort?

Quite likely, all the smart algorithms we ever develop are not really needed in 99,99% of the cases, because that's not where the bottleneck is. If you want to do the theory of the easy case, we should stick to teaching intro to programming. We can include binary search in the course, and make 99,99% of the programmers happy.

Fortunately, there are hard problems in this world, and there is a place for us. And where there are hard problems, there are no rules. If somebody wants something faster than binary search, he wants a fast algorithm, he doesn't want a fast comparison-based algorithm.

Case study. Now, there is a highly compelling motivation for designing fast predecessor search: IP lookup. To route a packet on the Internet, the router has to figure out where the destination IP address fits in its routing table. This is precisely predecessor search -- details later. Thus, a predecessor search is required for every packet going through every router on the whole Internet. It is quite safe to say that: predecessor search is the most executed problem on the planet.

With Gigabit speeds, this is actually a hard problem (you have on the order of 100 machine cycles for each search). Comparison search is definetely out of the question, to the extent that it's not even a footnote in the papers, or a column in the table with experimental results.

A bunch of implementations used Patricia tries, and as you might expect adding heuristics to Patricia tries became a popular sport. Unfortunately, this was still not fast enough, and for serious routers, people threw in a lot of expensive hardware to tackle the problem.

The paper finally providing the theory sanity-check was [Mikael Degermark, Andrej Brodnik, Svante Carlsson, Stephen Pink: Small Forwarding Tables for Fast Routing Lookups]. The paper appeared in SIGCOMM, the authoritative forum of the networking community (yes, it is harder to get into SIGCOMM that into STOC/FOCS, though maybe for the wrong reasons).

Essentially, they are saying: take all that expensive hardware out, and start using a smarter algorithm. Their algorithm starts exactly with a van Emde Boas step (surprise), of course with the proper amount of engineering. In theory, van Emde Boas search is exponentially faster than Patricia tries, which we knew since 1975.

Love thy predecessor (I): In the beginning was the Word

Thinking about the predecessor problem is the CS-equivalent of a journey within, when you rediscover the very axioms of your existence. It is a center of the data-structure universe, yet it is so ingrained in us that few of us take the time to notice the problem exists. From this central, well-studied location, it shines a light on many controversies and research angles in the field.

The problem. Preprocess a set S of n numbers, such that the following query is answered efficiently: Given a number x, return pred_S(x) = max { y ≤ x | y in S }.

The most trivial solution is of course to sort S during preprocessing, and use binary search to answer queries. If we want a dynamic S, we can use a binary search tree instead. In fact, if somebody asks you what problem binary search trees solve, predecessor search is what you probably think of first.

The problem can be thought of as the online version of sorting (insert one number in a sorted list). It has countless small applications everywhere, plus some rather notable ones -- more later. A lot of very interesting mathematical understanding has been developed around the problem.

But before delving deeper, we have an interesting topic of discussion if we only consider the problem name...

The name. Of course, some people persistently call this "the successor problem". Whatever :) The real issue is that some people use "binary search" or "binary search tree" when they should say predecessor search. For example, it is popular to specify steps of algorithms like "... now binary search for this value in the set".

This is evil! Let us remind ourselves that we study:

the dictionary problem, not hash tables.
the predecessor problem, not binary search (trees).
the nearest neighbor problem, not k-d trees.
etc

The Book was written before time, and does not contain references to our meager inventions. Its chapters are dedicated to problems, and solving the problems (through one data structure or another) is what we're supposed to do. Studying some particular data structure we invented is missing the forest for the trees. Not the mention that it is a clear act of vanity, and after all superbia is the very first in SALIGIA.

I am always disappointed by programmers who don't understand the difference, and I tend to jump at the neck of academics who make the confusion :)

To give a recent example, a machine-learning expert was presenting a clustering algorithm in the terms "... and now you use the k-d tree to search for the nearest neighbor to this point...". With this mentality, it is understandable why the sport of adding heuristics to k-d trees is more popular than searching for a more efficient solution to the problem with a less myopic lens.

PS: This whole "misbranding" reminds me of Romania, where "sport shoes" are called "adidasi", "copy machines" are "xeroxuri", and so on. This doesn't seem to happen so much in English, but I'm sure there are examples. I look forward to hearing "I'll google it up on Yahoo".

Wednesday, August 22, 2007

Alan Turing and the nature of the beast

Between Scott's my-paper-was-rejected-from-FOCS post and the hot-topics discussion on Michael's blog, I have seen one too many references to Alan Turing as the great pioneer, founder of computer science, who would not have his paper accepted today in our sorry excuse for an scientific community. To best describe my opinion about this, I will borrow the following image from Scott's job-talk at MIT:
To see the problem with the argument, we need to look no further than the title:

Alan M. Turing, ``On Computable Numbers, With an Application to the Entscheidungsproblem,'' Proc. London Math. Soc., 2(42) (1936), 230-265.

In other words, Turing's trying to solve a very explicit open problem, posed in [Hilbert and Ackermann, 1928]. His writing follows the "standards" of STOC/FOCS papers today, including a serious attempt to convince the reader that the proof technique is only superficially similar to Gödel's, and a reminder that his result is technically stronger. I see no grand claims of the sort "we initiate the study of [buzz] of computability, as a way to understand the fundamental nature of the universe / human reasoning / ... [buzzz] and creating a field that will generate countless open problems [buzzzz] ".

Let's get over the dichotomy between the misunderstood genius who spawns new fields with a stroke of the pen, and the number cruncher who generates pages of unreable technicalities to improve some bound. (The rest of the world got over romanticism about a century ago.) We're not disparaging Turing by saying he was only after an open problem!

The progress that really matters (be it on open-ended or very well defined problems) is based on an original view of the problem, plus some work that anybody could have done (given time). Any serious attempt to solve a hard problem has the "what the heck do we actually mean to prove?" component. If you want to prove you have the skill to answer such questions, there's hardly any need to invent a new field.

Our beasts (open problems) have but one nature: an appetite for brilliance.

Tuesday, August 21, 2007

Cute Problem (I)

Update: This is a "cute problem" only under the additonal constraint that the degree of the tree T is small (say, lg n). I got myself confused over this when I wrote the post originally. ---

I will post cute problems (mostly from Informatics Olympiads) at random moments in time. Zou can spend a few minutes thinking about them to stay "in shape", or give them as assignments, or just ignore them.

[From IOI'07, which happened last week]
You are given a connected graph G with weights on edges, and a spanning tree T. You must remove a set of edges from G\T, such that no even-length cycle remains in G. Minimize the weight of the removed edges.

Here cycle = a closed walk which doesn't go through any edge or node twice.

Sunday, August 19, 2007

Informatics Olympiads: The End

[updated: more IMO people added]

I was told I should have a list of links to the posts, so here goes:

I -- the contest
II -- why should I care? (with cool statistics about what IOI contestants did later in life, and notable IOI-turned-theorists; see the comments for many people I missed in the original post)
III -- understanding results
IV -- important contests besides the IOI

Please, spread the word... Informatics olympiads are a part of our intellectual tradition, and a part that we can be proud of. A theory community that is aware of the olympiads is a stronger, richer community.

Finally, I cannot end without proper credit to the IMO, the parallel Math contest which had already been running for 4 decades when IOI started. Not surprisingly, it has produced some top-notch (not to mention prize-laden) CS theoreticians along the years. Here are a few that I am aware of:

Theorists with prizes:

László Lovász [ 1966, 1965, 1964, 1963] for Hungary
László Babai [ 1968, 1967, 1966] for Hungary
Peter Shor [ 1977] for U.S.A.
Johan Håstad [ 1977] for Sweden
Alexander Razborov [ 1979] for Soviet Union

Theorists with jobs:

Rina Panigrahy [ 1990, 1991 ] for India
Moses Charikar [ 1991, 1990 ] for India
Amit Chakrabarti [ 1993 ] for India
Subhash Khot [ 1994, 1995 ] for India
Martin Pál [ 1995 ] for Slovakia
Rohit Khandekar [ 1995 ] for India

Pre-job theorists:

Paul Valiant [ 1999, 2000, 1998 ] for U.S.A.
Shien-Jin Ong [ 2000 ] for Malaysia
Satyen Kale [ 1997 ] for India
Swastik Kopparty [ 2000 ] for India

Thursday, August 16, 2007

Informatics Olympiads (IV): Besides the IOI

IOI is the crown jewel of the CS competitions, but there are quite a few other competitons one has to know about.

Other high-school olympiads. Several regional olympiads have been created, modelled after the IOI. The ones I know are:

CEOI (Central European)
BOI (Balkan)
BOI again (Baltic)

If people from Asia or South America know other regional olympiads, please post a comment.

Policies for who goes to these olympiads vary from country to country and from year to year. Typical policies (Romania has tried all of them in different years):

send the top 4 to prepare them for the IOI
send younger kids (nonseniors) who haven't qualified to the IOI, so that they get experience for next year
send people on places 5-8 out of fairness (to allow other good students to get medals and build up a CV)

Regardless of which rule is employed each year, the students attending are obviously among the best. These olympiads are very competitive, and getting a medal in one of them is a serious achievement.

CEOI, in particular, is insane. It includes only the best countries from Europe, plus USA as a special guest (these countries typically get only gold and silver at the IOI). Getting a medal at CEOI is often likened to being among the top golds at the IOI.

Post high-school. Like the Putnam in mathematics, there are some olympiads for college students, but they are somewhat different in flavor:

the ACM Programming Contest: teams of 3 people try to solve 8 problems in 5 hours. By design, some problems are meant to have easy algorithms but painful implementation (the prototypical examples are from computational geometry).
TopCoder: individual contest over a long period of time, with many rounds conducted online. Huge prizes (in the 100K range). Each round is short (1 hour) so you have to be insanely fast at implementation.
IPSC (Internet Problem Solving Contest): teams of 3, working from home. You don't submit code, only answers to some truly massive problems.

Like the IOI, these are also highly competitive contests. Many IOIers participate, capitalizing on the previous exercise. Even more importantly, many students who weren't ready enough to make it to the stars during high school, are extremely motivated to add some prizes to their CV in university years.

In Eastern Europe, where most tenured faculty haven't heard the word research and attending classes is outright depressing, these competitions may be the only opportunity for students to not let their university years go to waste. (The other option is to work a full-time job as a programmer, for one of the miriad outsourcing companies.)

Unfortunately, the contests are less telling as to algorithmic skills than the high school olympiads. They rely on lots of coding under time pressure, so many pure algorithmists feel turned off. The fact that teams matter so much also makes results harder to interpret. (The successful teams typically include a very fast coder, and a theoretician who doesn't code... But who's who?)

Personally, I am an outsider (by choice) to these contests. With my slow coding and problems dealing with authority in teams, I would have been hopeless :)

Tuesday, August 14, 2007

Informatics Olympiads (III): Understanding Results

Assuming I convinced you that you should pay attention to Informatics Olympiads, here is your practical guide to how to interpret results from the various olympiads out there...

In the sports Olympiad, there are dozens of competitions (gymnastics, lifting, running etc etc), and 3 medals per competition. In science olympiads, there is one competition with more medals. Specifically:

the top 1/12 of the competitors get gold
the next 2/12 silver
the next 3/12 bronze
the bottom 6/12 gets nothing (and their names are not public)

Remember that this is among the best students from each country, so even placing in the top half (bronze) is a very significant achievement.

Competitor profiles generally fall in two different categories:

Students from very competitive countries (the big Eastern European countries, USA, China, Russia etc) generally have few, but "good" medals. In these countries, the competition to qualify for the IOI is fierce. Simply making it to the top 4 is much more difficult that doing well at the IOI (for reference, 80% of Romanians who qualified got silver or gold). There are many brilliant students who never qualify, and most students qualify only in senior year (after 4 years of training), thus getting a single shot at a medal.
People from less competitive countries have more, but usually weaker medals. It is not unusual for the best students from those countries to qualify each year of high school. On the other hand, at the IOI they are fighting an up-hill battle, as they don't have access to the top-notch preparation and months-long training camps that the competitive countries provide.

To give a perninent example for the need to adjust results by country... Romanian students spend up to 2 months on training camps each year, taking an IOI-like exam every other day. If the students bring back less than 3 gold+silver medals, there is a public outcry about the dimishing quality of education (not that there normally isn't...). Across the border, Moldova had never dreamt of a silver medal before Alex Andoni won 2 of them, and isn't expecting another one soon.

Informatics Olympiads (II): Why Should I Care?

Well, first of all, the IOI is "coming" whether you care or not :). The IMO, the parallel Math olympiad started in 1959 (in Romania!), is a major force in Mathematics. It generates instant respect for winners, reinforced by its track record of producing 8 Fields medalists (among which Gowers and Tao, plus pseudo-medalist Perelman), and countless other celebrities (László Lovász comes to mind).

The IOI was born much more recently (1989) and the medalists are only now "coming of age". Still there are quite a few IOI-turned-theorists that have already won acclaim (sorted by IOI years):

Mohammad Mahdian [ 1992, 1993 ], the quiet game theorist, now at Yahoo! Research
Mihai Bădoiu [ 1995; 1996 ], relentless metric embedder, now at Google Research
Mohammad-Taghi Hajiaghayi [ 1997 ], the force single-handedly responsible for biasing SODA statistics, now at AT&T Research
Alex Andoni [ 1999, 1998, 1997 ], high-dimensional algorithmist, still PhD@mit
Krzysztof Onak [ 2000, 1999 ], early in his PhD@mit but world famous for this.
your humble(?) blogger [ 2001, 2000, 1999 ]
... and I'm sure I'm missing people. Please post a comment.

Probably there are many more great examples in systems, AI etc. But let's stick to theory for now.

So there are top-notch people coming from IOI. But how are IOI people doing on average, you may ask? I sent out the word and in-came this REMARKABLE DATA about a cross-section of the IOI medalists that I had access to (namely, the Romanians). We have almost complete data about the educational and career paths of all 56 Romanians that ever won a medal at the IOI!

Here are a few quick facts:

≥ 2 faculty (one Cluj, one Utah); ≥ 4 expressed intention for academic career post-PhD
≥ 9 people at Google :)
≥ 13 PhDs -- MIT (6), Princeton (2), Berkley (1), the Netherlands (2), Romania (2)
undergrad -- MIT (10), Caltech (2), Princeton (1), Brown (1), Waterloo (1)
Bucharest Politechnic (13), Bucharest Univ. (5), Bremen, Germany (4)

I am sure you will agree that we are looking at a very successful group of people! Congratulations to all!

The take-home message. If I have not made it clear enough, the take-home message is that we want to be competing with Google, and get the IOI people as PhD students. The IOI is young, but it has already proven to be a good indicator for success, and it is ready to enter our community's conscience.

This is especially true in theory -- the IOI is mainly about algorithmic skill, and that is precisely what we want. The love of solving algorithmic problems is a shared value in the IOI and the theory community. The IOI is ours more than it belongs to any other field in Computer Science.

Informatics Olympiads (I): The Contest

Ok, so many of you have heard about Informatics Olympiads from me or other people. But what exactly are these things? This is your quick survival guide.

At a glance. The IOI (International Olympiad in Informatics) is the Computer Science equivalent of the IMO (International Math Olympiad). Each country sends 4 high-school students, who spend 5 hours on 2 consecutive days solving problems (3 problems are assigned each day). This is an individual contest (no team rankings, no collaboration).

The problems. A typical problem looks like this: write a program that takes an input of n ≤ .... things (numbers, edges in a graph, whatever), and outputs ..... based on the input in less than ..... miliseconds. At the end of the 5 hours, each contestant submits 3 programs. The committee has prepared a bunch of tests for each problem, and the grading is entirely automatic -- your programs are run on all tests, and your score is the percentage of tests that where answered correctly within the given time bound.

The tests typically have a growing complexity (growing n). As a contestant, you reason like this: the problem says n ≤ 10000, and 50 miliseconds / test. An algorithm of complexity, say, O(n²) will definetely handle all tests; one of complexity O(n²lg²n) will probably also work for most tests (getting say 80% of the score); one of complexity O(n³) definetely runs out of time for the bigger tests, but still should get around maybe 50% of the score; an exponential algorithm 2^O(n) will only get the very small tests (say 10% of the score).

Topics. The ideal problem is 90% algorithmic inspiration and 10% programming perspiration. The faster the algorithm, the more points you get, so the main challenge is algorithmic. It is assumed that contestants know a lot of algorithms theory (certainly anything in [CLRS] is fair game, and usually more). That is why you can't really trick the contestants by giving a problem where a classic algorithm applies. The good problems are very creative and contestants have to discover some interesting idea specific to the given problem.

An important requirement is that the solution needs to have an easy implementation. Programming should be a small component of the contest, and most of the 5 hours each day are spent thinking. The only important programming skill is that you can implement something correctly. A brilliant algorithmic idea plus a bug will not earn you points, because the program will output incorrect results...

Monday, August 13, 2007

Math in the news

At long last, the New York Times heard about the handshaking lemma. I feel avenged for the countless times I tried to explain this to people.

Thursday, August 9, 2007

Binary Search Trees (IV): Splay Trees

Though we can't prove splay trees are anywhere close to dynamic optimality, they are undobtedly the star of the game, and their "cool factor" has driven most of the early research on BSTs.

The splay tree is an amazing algorithm that does something extremely simple, and seems to get everything right without trying at all. If you haven't seen splay trees, go read what they do. The algorithm is described entirely by 2 drawings.

Top reasons to get excited about splay trees (these are all proven theorems):

1. static optimality: they run in O(first-order entropy of the access sequence).

This means splay trees are as good as the best static tree (see post I). If we knew the access frequencies in advance, we could build the optimal static tree by Knuth's dynamic program. But splay trees don't know the frequencies, don't try to learn them in any explicit way, and still match the entropy.

This has profound consequences in applications, since it means we can assume trees are "biased" in whatever way is best for us. Example: for dynamic connectivity in forests (a data structure used by flow algorithms), replacing red-black trees by splay trees automatically reduces the bounds from O(lg²n) to O(lg n).

2. dynamic finger: the cost is O(sum lg |x_i - x_i-1|).

So the smaller the jump between consecutive accesses, the faster you run. The practical consequence: you can forget about linked lists. Even if you tend to access consecutive elements, the splay tree is just as good (and of course better if you occasionally want to jump a longer distance).

Now, it is possible to build trees with the express purpose of achieving dynamic finger bounds, but they are rather nontrivial to implement. Proving that splay trees achieve dynamic finger gives a simple algorithm, and moves the entire burden on theory. Of course it is quite a burden... The proof is one of those famous long-and-difficult proofs of Computer Science [Cole: On the Dynamic Finger Conjecture for Splay Trees, SICOMP 2000; STOC'90]. Notice the gap between the conference and journal version to see what I mean by long and difficult.

3. working set: Say you are going to access some x. If, since the last time you accessed x, you have only accessed k distinct elements, the amortized running time of this operation is O(lg k). This should be a practical scenario -- it is exactly the reason computers have caches (to improve the case when a few elements are accessed frequently).

Proofs for 1. and 3. are in the original paper of [Sleator, Tarjan] and are surprisingly cute and simple. NB: this paper is famous for frustrating students, by combining something really cool with lots of uncool things. You probably want to ignore everything outside pages 4-9.

Splay-tree theory. Going beyond these results, there is a whole industry of getting excited about splay trees. Maybe a landmark which is somewhere in sight but not too close, is to prove a 2-finger bound: take two access sequences with a small finger bound, and interleave them arbitrarily; now show that splay trees are roughly as fast as on the individual sequences. You can generalize this to k-fingers, combining with the working-set property, along the lines "the running time is small if you access elements close to something in cache". I think the correct generalization is in [Iacono: Alternatives to splay trees, SODA'01], but it is not yet known if there exists a dynamic tree that achieves the bound. (Open problem!)

To get back to the splay industry, almost all work is technically very difficult; in addition, some of it is quite insightful, and appeals to people who like combinatorics. A nice example is Seth Pettie's recent proof [arXiv:0707.2160] of an O(lg alpha*(n)) running time for the 2-finger bound in a special case when each finger only moves by ±1 in each step. (This is called the deque bound for obvious reasons.) Yeap, this is really the inverse of Ackermann's function, starred.

My own take on the industry is that, while splay trees are cool and such, we must remember they are still just one particular algorithm. We can have fun proving these results, but I don't think they will make it into Encyclopedia Informatica. [[ But they will of course be referenced in wikipedia :)... someone should update the page given Seth's result. ]]

Binary Search Trees (III): Online

So far I have talked about the following optimization problem: given a sequence of accesses x=(x₁,..,x_m), compute the minimum cost needed by a dynamic BST to access those elements. This is an interesting enough problem (somewhat unusually, it is a Computer Science problem about Computer Science), and we should strive to solve it.

But the study of BSTs would not end with solving this problem: step 2 would be to get an online BST algorithm. This is the kind of "binary search trees" that we're all acustomed to: the algorithm gets an access request, does something, gets the next access, does something etc. Many, many online BST algorithms have been proposed, like red-black trees, AVL, 2-3 trees, BB[alpha] etc. Most of these only guarantee that the worst-case access takes O(lg n), i.e. they only try to keep the tree balanced.

The goal would be to not only solve the philosophical question of approximating the best possible cost for a given access sequence, but have an actual real tree that we can use, which always takes time close to the minimum possible cost.

Competitive analysis (background for nontheorists). Formally, our new problem is the field of online algorithms and competitive analysis. A competitive ratio of c means that for any access sequence x, your online algorithm (which sees accesses one at a time) does at most c times worse than the best possible cost for the sequence x. In other words, a c-competitive algorithm is in particular a c-approximation algorithm, with the bonus that it handles each access without knowing the future.

Note that it is not at all clear that competitive algorithms exist; if you don't know the future, what chance do you have to compete with someone who does? Competitive analysis is a well-studied field; for some problems good competitive algorithms exist, and for others we have impossibility results.

The metaphor for this field is the snowboard rental problem. Snowboards cost $30 to buy, and $1/day to rent. You do not know how many days of snow you'll have this year. What do you do? [Answer: rent for 30 days, then buy if there's still snow on the 31st day. If you are a little green man with perfect whether forecast, you pay min{D,30}, where D=number of days with snow. By this algorithm, you pay at most 2*min{D,30}, i.e. you are 2-competitive. More insight.]

Competitive BSTs. The obvious bound is O(lg n)-competitiveness, since the worst-case is logarithmic for any balanced tree. How about something better?

splay trees [Sleator, Tarjan: Self-Adjusting Binary Search Trees, J. ACM 1985; STOC'83] are conjectured to be O(1)-competitive, but nobody has shown any o(lg n) bound. More in post IV.
the IAN greedy is also conjectured to be O(1)-competitive but nothing is known (not even a logarithmic bound). See below.
Tango trees [Demaine, Harmon, Iacono, Patrascu: Dynamic Optimality--Almost, SICOMP 2007; FOCS'04] give the first proved result: O(lglg n)-competitiveness. They are also the best known approximation algorithm for the offline problem. See post V.

Historically, O(1)-competitiveness is called "dynamic optimality". The "dynamic optimality conjecture" claims that splay trees in particular have this property.

Is IAN Online?? Good question. Both in Ian Munro's paper, and in an older paper that introduced the algorithm independently [Joan Marie Lucas, Canonical forms for competitive binary search tree algorithms, Rutgers technical report DCS-TR-250], it is an offline algorithm called "order by next request". Since you are reordering the elements considering future accesses, this is really offline. However, by the black magic of interpreting the algorithm as a greedy in the geometric view, and a more complicated conversion back from the geometric problem into a tree, we can show that the algorithm can be made online with O(1) slowdown. This is surprising: the old conjecture that IAN is an O(1)-approximation algorithm is actually a conjecture that it is dynamically optimal.

So there is another contender for the dynamic optimality title, competing with splay trees.

Tuesday, August 7, 2007

Binary Search Trees (II): Lower Bounds

In the first post on Binary Search Trees (BSTs), I talked about a geometric view of the following optimization problem: given an access sequence (x₁, .., x_m), what is the best running time achievable by a dynamic BST on this sequence?

The typical approach to get an approximation algorithm is to identify a lower bound on the cost, and match it as close as possible. Traditionally, we have known two lower bounds for the BST problem, coming from [Robert E. Wilber: Lower Bounds for Accessing Binary Search Trees with Rotations. SICOMP 18(1): 56-67, 1989; FOCS'86].

In the rank-time geometric view, these two bounds and any other bounds we have been able to think of fall in a framework that we call cut bounds. Definitions:

a cut is a vertical segment from some (x,y) to (x,Y), say y < Y.
a cut system is an ordered list of cuts C₁, C₂, ... It will follow that one can assume w.l.o.g. that cuts are ordered by increasing Y, breaking ties by increasing x.
for some cut C_i defined by (x_i,y_i)--(x_i,Y_i) a point (x, y) is in range of the cut if y_i ≤ y < Y_i.
the point (x, y) is accessible to the cut C_iif it is in range of the cut, and for every cut C_j, j < i, that the point is in range of, the point is between C_iand C_j, i.e. min{x_i, x_j} ≤ x ≤ max{x_i, x_j}. In the example below, any point in the hashed area is inaccessible to C₇.
for every C_i, sort all accessible points of the access sequence by y coordinate. Let L_i be the numbers of transitions in this sorted list between points on the left of C_i and points on the right. In the example, L₇ = 4:

Theorem: Cost of best BST = Omega(sum L_i)
Proof: There is a very meaningful proof, to be discussed in a later post. --QED

Thus, the lower bound proven by some cut system is sum L_i. Wilber I is obtained by a cut system which is independent of the access sequence (discussion in a future post). Wilber II is obtained by considering a cut starting at every point, and going up to -∞. If you think about it, for each point (=cut), it counts the number of left-right transitions in a sequence of points converging towards it, when looking up (back in time):

This is annoyingly close (but not quite) to the greedy IAN. The cost of the greedy is essentially equal to the total number of points in a converging sequence, while the lower bound is equal to the number of alternations. Despite the similarity, we do not know any bound on the ratio of this upper and lower bound, and we do not know any separation.

The optimal cut bound. Since all lower bounds we know are cut bounds, the obvious question is to find the best bound achievable in this framework. For a given access sequence, define the following undirected graph:

the nodes are the unsatisfied rectangles (rectangles with two access points at opposite corners, containing no other point)
there is an edge between two rectangles iff one contains in its interior a corner of the other. Note this corner must be empty, otherwise the rectangle is satisfied and is not a node. Thus, there are only two possible cases for rectangles joined by an edge:We call the rectangles on the left positive-type and those on the right negative-type.

Theorem: The highest possible cut bound = the maximum independent set in the graph of rectangles.

Proof of "≤": For any cut bound, we find an independent set of the same size: take every transition across a cut, and draw the unsatisfied rectangle defined by the points on the left and right. (5 minutes with pencil and paper will convince you the rectangle are independent)

Proof of "≥": For any independent set, we need to construct a cut giving the same bound. For every rectangle, consider a cut with the same vertical range, that is close to the left edge for negative-type rectangle, and close to the right edge for positive-type. Order cuts by increasing Y, breaking ties by increasing x. Each cut will have a lower bound of exactly 1, paying for the rectangle (5 more minutes with pencil and paper). --QED

Note a very pleasing thing about the maximum independent set: it is flip- and rotation-invariant. Rectangle satisfiability is by definition flip- and rotation-invariant, so the optimal BST must have this property. Thus, whatever the right lower bound is, it must have this property also.

Now, it does not seem too useful to know that the best lower bound is a maximum independent set of some geometric graph, since that is both too hard to compute and too difficult to reason about. We have a neat greedy that approximates the maximum independent set within constant factors (it is similar to the IAN greedy, but taylored to this problem). Thus, the best lower bound is rather explicit. Unfortunately, we cannot construct a mathcing upper bound.

Open problems:

can we come up with any non-cut lower bounds? (that are better?)
is Wilber II asymptotically equal to the best bound (the maximum independent set)? Wilber conjectured so.
does IAN come anywhere close to the best lower bound? is there a separation?

SODA

An anonymous commenter on the Complexity Blog recently complained that on the SODA'08 PC, there's nobody whom you'd really describe as a data-structures person. That seems to be roughly correct. I can add my own experience to support the claim: I got an unprecedented number of review requests, namely 6. Most of these had nothing to do with my research, and just had the keyword "data structures" somewhere in the intro, which seems to indicate a lack of expertise in the area among PC members.

In line with the current trend of doing objective statistics on our conferences (namely citation counts), it will be interesting to watch the list of accepted papers. The presence of data-structure papers at SODA is usually large, and I have heard the phrase "SODA is the data structures conference" quite a few times. I do not know whether the lack of PC members in the area increases or decreases the number of accepted papers, but this is our opportunity to find out. Even if the numbers come out right, the quality of the papers may be more mixed, since presumably an inexperienced PC makes more random decisions. That would be harder to assess.

[[Obligatory note: I hold no direct interest in this discussion, having only submitted one streaming paper.]]

Friday, August 3, 2007

Binary Search Trees (I)

Binary search trees are something we all know (and love??) from Algorithms 101. To search for elements of a set S, we can build a binary tree, and write elements of S in the nodes, such that the in-order traversal of the tree gives the elements of S in sorted order. Clasically people say "the tree is in symmetric order", but I do not understand where the terminology comes from. Since searching is by comparisons, we may just as well assume S={1,..,n}.

Now, let's say I'm going to search for x₁, x₂, ..., x_m in this order. How well can a binary tree perform? Well, depends what you mean:

there are sequences where you need Omega(m lg n) comparisons (every comparison reveals one bit of entropy), and that is achieved by any balanced tree. But this worst-case analysis is not answering the question of how well you can do on some given access sequence x₁, ..., x_m.
If you intend to build a static tree and use it for the entire access sequence, all that matters about the access sequence are the frequencies f₁, .., f_n -- how many times you access every element i. Then, the famous dynamic program of Knuth finds the optimum tree [Optimum Binary Search Trees Acta Inf. 1: 14-25 (1971)]. Asymptotically, the cost is described by the entropy: sum f_i lg(m/f_i).
But who says your tree needs to be fixed? We all know trees can be dynamic, e.g. being changed by rotations. The rest of the post is about computing the minimum cost needed by a dynamic binary tree to solve a given access sequence.

Model. We can come up with lots of models for how to charge binary trees for their work, but they are all equivalent up to constant factors. So let's stick to a simple one: charge cost 1 for every node that you "touch" during every access (which includes touching by a comparison or a rotation). Note that if you consider any fancy transformation that reconfigures a subtree of k nodes in the tree, you can implement it with about 2k rotations [Sleator, Tarjan, Thurston: Rotation Distance, Triangulations, and Hyperbolic Geometry STOC 1986]

Geometric picture. Trees are ugly and I find it totally impossible to think about them. So let's represent the access sequence as a 2D diagram, like the following. Let A denote the set of (red) points in the diagram.

We are now going to summarize the behavior of the binary search tree in this diagram. If at time t (while performing access x_t), the algorithm touches a node with value z, we are going to put a blue dot at coordinate (z, t). Let B denote the union of red and blue dots. For instance:

Lemma: Pick any two points from B not on the same row or column. The rectangle having the two points as opposite corners contains at least one other point from B. (Above you see some examples of rectangles.)

Proof: Say some rectangle defined by (z₁, t₁) and (z₂, t₂) contains only these two points. Essentially, you can't access z₁ at time t₁ and z₂ at time t₂without accessing their lowest common ancestor at some point (either to go to the other node, or to move the other node through some rotations). See paper for details. --QED

That's nice. But here's the real surprise:

Lemma: Call a set B orthogonally satisfied if for any two points in B, the rectangle defined by them contains another point. For any set B superset A (the access sequence), with B orthogonally satisfied, B describes a valid behavior of a dynamic binary search tree.

Proof: Nontrivial, but doable. The issue is that the blue points on a row tell you what nodes to touch at that time. But you do not know what to do with those nodes -- whether and how to rotate them to be consistent with future blue points. We have an algorithm which looks at future blue points and rotates the current blue points in the best possible way. --QED

A cute research question. We have thus shown an equivalence. To compute the minimum cost of a dynamic binary tree for a given access sequence, we have to solve the following cute, simple, geometric problem:

*** Given a set A of points in 2D, with one point per row, find a minimum superset B which is othogonally satisfied.

So what do we know about this problem? Not much:

there is an O(lglg n) approximation algorithm (in a later post).
starting with arbitrary A, the problem is NP-hard. Alas, if A has more than one (red) point per row, it doesn't describe a real access sequence. Thus, this generalized problem is not very meaningful for binary trees.
there is a very natural greedy algorithm that we can't analyze.

Plenty of open problems for your enjoyment...

The greedy algorithm. Go down the rows one by one, and at every step add points to the current row to satisfy all rectangles formed with one point from this row and one point above. A picture is worth a thousand words... The green points are the ones added in this step, and the the hashed region is irrelevant (not "visible" by empty rectangles).

We like to call this algorithm IAN, as it originates in a paper of Ian Munro, where it is described in the language of trees [On the Competitiveness of Linear Search. ESA 2000]. Interpreting that algorithm in the language of 2D diagrams, we have this natural greedy approach.

There are simple examples where the algorithm is O(m) in cost away from optimum, but no worse examples are known. This means it may very well be an O(1)-approximation, but we do not know how to show any nontrivial bound.

Meet the team. These results are due to the MIT Binary-Trees Team^(TM), famous around CSAIL for perfecting the art of drawing diagrams on the boards in G6, to the point of installing a projector to display a grid on the board. The team includes Erik Demaine (now with his own dot-org domain), Dion Harmon (former student of Erik's), John Iacono (a most exceptional computer scientist, both in skill and personal style, and a good friend), and more recently Daniel Kane, rising mathematical star. Right now, the best pointer to the work is Dion's thesis. A publication is forthcoming -- except that two of us just got tenure, one just graduated and went to industry, and another is busy writing blogs.

Wednesday, August 1, 2007

China

In September, I will be attending China Theory Week. The official goal of the workshop is to invite "all the brilliant graduates from different universities to communicate, and to present their research results in the field of theoretical computer science". The list of invitees is available online.

Now, this seems to me like a wonderful example of public service. Not only do you provide an opportunity for these people to talk about research, but you make it clear to the world who the brilliant graduate students are :) No need to carefully pour over applications when deciding who to get for an internship, or who to interview for a job. The list of all brilliant students is now online.

Poking fun at Chinglish aside, the more serious moral dilemma is whether to lend any further legitimacy to this government by attending. While it seems standard to treat this sort of questions with great care, I feel the case of an oppressive communist regime is especially serious. Those of us who grew up on the other side of the Iron Curtain are probably particularly sensitive here.

In any case, I am going. My main defense is that Andy Yao is fighting a battle somewhat closer to home (promoting theory), and I can help. I cannot realistically put a dent on communism in China, but I can maybe help Andy's truly remarkable effort to open CS research to the said one billion people.

Speaking of this kind of laudable effort (ping to all my Eastern European and Indian friends), I talked to Andy about it when he visited MIT a couple of months ago. His opinion was crystal clear and memorable:

going back to promote science is a most wonderful, dignifying pursuit.
going back without a Turing award is, in one word, "suicidal".

Nevertheless, the braver among us have tried. Pranab Sen, of round elimination fame and formerly at NEC, is now at Tata Institute in Mumbay. Good luck, Pranab!

WebDiarios de Motocicleta