Friday, August 31, 2007

Led Zeppelin

I have typically not been alarmed by the prospects of oil running out. Europe has shown that railroads can work, so personally I couldn't care less if driving a car becomes 10x more expensive. The only issue is --- I really want my airtravel. And I'm sure most researchers, living the conference life, will agree.

Now, I have always assumed that balloons will provide a solution. Sure, we'll take it slower, and spend more time in the air (never mind those students), but we'll continue to get there. Unfortunately, it seems I was wrong...

Zeppelin NT (Neue Technologie) seems to be the most efficient commercially available "balloon" today. It is a semi-rigid design, german-engineered to optimize fuel-efficiency and speed. It can:

  • carry about 6 passengers in addition to 1 pilot, if you want to go up to 2000m~6600ft (which you do, because there are mountains on most interesting routes).

  • cruise at 80 km/h ~ 50mph, relative to the wind
To understand that this thing is really built with great care, consider that it costs about $7M a piece (around 100x more than the highend blimps on American skies).

Now, let's say you want to travel BOS->SFO to collaborate with the Berkeley people.
  • the distance is 4339 km
  • ... which takes you 54 hours (which has the positive effect that you're actually going to read their papers before asking them questions). NB: this assumes no wind, so you may able to do slightly better by finding a nice altitude with good wind.

  • according to these technical specifications, the thing needs 60 kg of gas/hour at cruising speed, plus 50kg for takeoff and landing, so you spend 3304 kgs of gas.
  • with gasoline density 737.22 kg/m3 this is 4481 liters ~ 1184 gallons.
  • at current gas prices of $2.7/gallon, the trip costs $3197
  • ... that is, $533 USD / person. Of course, I am ignoring the cost of the pilot, investment in the zeppelin, airport taxes etc --- which for current airlines are more than half of the cost.
Economy of scale. Of course, this calculation is missing the economy of scale. Right now we don't have huge zeppelins because nobody will pay to build them. But what if we did? Well, the useful load grows proportional to volume, i.e. like the cube of the radius. On the other hand, the drag force (which is what the engines are fighting) grows like the area, i.e. like the square of the radius. So this is sublinear growth!

For instance, if we carry 27 times more people, the price per person will drop by roughly 3 (yes, yes, I'm ignoring the pilot). Note that 162 passengers is really a lot for a BOS->SFO flight today; US airlines learned that US travellers are very spoiled, and the recipe for success is frequent small flights.

In any case, this means you should pay on the order of $178 USD / person. That is not bad, but it's about what you're paying today for a regular airline, with sevices included. This means the zeppelin is not actually saving gas, and has the same oil bottleneck.

Seems like there no Stairway to Heaven, and we might have to sell our soul.

Thursday, August 30, 2007

Love thy predecessor (II): The comparison model

As mentioned already, the super-standard way to solve the predecessor problem is by binary search. This has led people to define the comparison model: you have a bunch of abstract values, and all you can do with them is compare two values; the cost of your algorithm is equal to the number of comparisons performed.

Needless to say, actual computers store values with finite precision, so there are a lot more things that you can do besides compare two values (for example, computing a hash function). The most famous example of a faster algorithm that uses more than comparisons is the beautiful data structure of [van Emde Boas, FOCS 1975], which I will talk about some other time. Until then, suffice it to mention that Don Knuth finds this data structure very exciting in his forthcoming volume 4.

Doubting Thomases. Now, a rather large section of the theory community has developed the belief that we should stick to the comparison model in designing algorithms, ignoring faster algorithms that do more than compare elements. The most vocal adherent to this doctrine that I've met is David. (oh yes, I'm forgetting an anonymous NSF panel...)

Their argument is roughly as following: In real life, algorithms that get implemented only use comparison search. The finite-precision algorithms aren't actually worth it in practice. So we should study what actually matters in practice, which is comparison algorithms.

Nonsense!
We are doing the theory of the hard case, not the easy case. Say a programmer comes to you: "I have this array of 100 values, and I want to sort them before I run this exponential backtracking algorithm on them. What algorithm do you recommend?"... At this point, I hope you say "Buble-sort, dude!"... But after that, are you going to go to your office and study the beautiful properties of bubble sort?

Quite likely, all the smart algorithms we ever develop are not really needed in 99,99% of the cases, because that's not where the bottleneck is. If you want to do the theory of the easy case, we should stick to teaching intro to programming. We can include binary search in the course, and make 99,99% of the programmers happy.

Fortunately, there are hard problems in this world, and there is a place for us. And where there are hard problems, there are no rules. If somebody wants something faster than binary search, he wants a fast algorithm, he doesn't want a fast comparison-based algorithm.

Case study. Now, there is a highly compelling motivation for designing fast predecessor search: IP lookup. To route a packet on the Internet, the router has to figure out where the destination IP address fits in its routing table. This is precisely predecessor search -- details later. Thus, a predecessor search is required for every packet going through every router on the whole Internet. It is quite safe to say that: predecessor search is the most executed problem on the planet.

With Gigabit speeds, this is actually a hard problem (you have on the order of 100 machine cycles for each search). Comparison search is definetely out of the question, to the extent that it's not even a footnote in the papers, or a column in the table with experimental results.

A bunch of implementations used Patricia tries, and as you might expect adding heuristics to Patricia tries became a popular sport. Unfortunately, this was still not fast enough, and for serious routers, people threw in a lot of expensive hardware to tackle the problem.

The paper finally providing the theory sanity-check was [Mikael Degermark, Andrej Brodnik, Svante Carlsson, Stephen Pink: Small Forwarding Tables for Fast Routing Lookups]. The paper appeared in SIGCOMM, the authoritative forum of the networking community (yes, it is harder to get into SIGCOMM that into STOC/FOCS, though maybe for the wrong reasons).

Essentially, they are saying: take all that expensive hardware out, and start using a smarter algorithm. Their algorithm starts exactly with a van Emde Boas step (surprise), of course with the proper amount of engineering. In theory, van Emde Boas search is exponentially faster than Patricia tries, which we knew since 1975.

Love thy predecessor (I): In the beginning was the Word

Thinking about the predecessor problem is the CS-equivalent of a journey within, when you rediscover the very axioms of your existence. It is a center of the data-structure universe, yet it is so ingrained in us that few of us take the time to notice the problem exists. From this central, well-studied location, it shines a light on many controversies and research angles in the field.

The problem. Preprocess a set S of n numbers, such that the following query is answered efficiently: Given a number x, return predS(x) = max { yx | y in S }.

The most trivial solution is of course to sort S during preprocessing, and use binary search to answer queries. If we want a dynamic S, we can use a binary search tree instead. In fact, if somebody asks you what problem binary search trees solve, predecessor search is what you probably think of first.

The problem can be thought of as the online version of sorting (insert one number in a sorted list). It has countless small applications everywhere, plus some rather notable ones -- more later. A lot of very interesting mathematical understanding has been developed around the problem.

But before delving deeper, we have an interesting topic of discussion if we only consider the problem name...

The name. Of course, some people persistently call this "the successor problem". Whatever :) The real issue is that some people use "binary search" or "binary search tree" when they should say predecessor search. For example, it is popular to specify steps of algorithms like "... now binary search for this value in the set".

This is evil! Let us remind ourselves that we study:

  • the dictionary problem, not hash tables.
  • the predecessor problem, not binary search (trees).
  • the nearest neighbor problem, not k-d trees.
  • etc
The Book was written before time, and does not contain references to our meager inventions. Its chapters are dedicated to problems, and solving the problems (through one data structure or another) is what we're supposed to do. Studying some particular data structure we invented is missing the forest for the trees. Not the mention that it is a clear act of vanity, and after all superbia is the very first in SALIGIA.

I am always disappointed by programmers who don't understand the difference, and I tend to jump at the neck of academics who make the confusion :)

To give a recent example, a machine-learning expert was presenting a clustering algorithm in the terms "... and now you use the k-d tree to search for the nearest neighbor to this point...". With this mentality, it is understandable why the sport of adding heuristics to k-d trees is more popular than searching for a more efficient solution to the problem with a less myopic lens.


PS: This whole "misbranding" reminds me of Romania, where "sport shoes" are called "adidasi", "copy machines" are "xeroxuri", and so on. This doesn't seem to happen so much in English, but I'm sure there are examples. I look forward to hearing "I'll google it up on Yahoo".

Wednesday, August 22, 2007

Alan Turing and the nature of the beast

Between Scott's my-paper-was-rejected-from-FOCS post and the hot-topics discussion on Michael's blog, I have seen one too many references to Alan Turing as the great pioneer, founder of computer science, who would not have his paper accepted today in our sorry excuse for an scientific community. To best describe my opinion about this, I will borrow the following image from Scott's job-talk at MIT:
To see the problem with the argument, we need to look no further than the title:

  • Alan M. Turing, ``On Computable Numbers, With an Application to the Entscheidungsproblem,'' Proc. London Math. Soc., 2(42) (1936), 230-265.
In other words, Turing's trying to solve a very explicit open problem, posed in [Hilbert and Ackermann, 1928]. His writing follows the "standards" of STOC/FOCS papers today, including a serious attempt to convince the reader that the proof technique is only superficially similar to Gödel's, and a reminder that his result is technically stronger. I see no grand claims of the sort "we initiate the study of [buzz] of computability, as a way to understand the fundamental nature of the universe / human reasoning / ... [buzzz] and creating a field that will generate countless open problems [buzzzz] ".

Let's get over the dichotomy between the misunderstood genius who spawns new fields with a stroke of the pen, and the number cruncher who generates pages of unreable technicalities to improve some bound. (The rest of the world got over romanticism about a century ago.) We're not disparaging Turing by saying he was only after an open problem!

The progress that really matters (be it on open-ended or very well defined problems) is based on an original view of the problem, plus some work that anybody could have done (given time). Any serious attempt to solve a hard problem has the "what the heck do we actually mean to prove?" component. If you want to prove you have the skill to answer such questions, there's hardly any need to invent a new field.

Our beasts (open problems) have but one nature: an appetite for brilliance.

Tuesday, August 21, 2007

Cute Problem (I)

Update: This is a "cute problem" only under the additonal constraint that the degree of the tree T is small (say, lg n). I got myself confused over this when I wrote the post originally. ---


I will post cute problems (mostly from Informatics Olympiads) at random moments in time. Zou can spend a few minutes thinking about them to stay "in shape", or give them as assignments, or just ignore them.

[From IOI'07, which happened last week]
You are given a connected graph G with weights on edges, and a spanning tree T. You must remove a set of edges from G\T, such that no even-length cycle remains in G. Minimize the weight of the removed edges.

Here cycle = a closed walk which doesn't go through any edge or node twice.

Sunday, August 19, 2007

Informatics Olympiads: The End

[updated: more IMO people added]

I was told I should have a list of links to the posts, so here goes:

Please, spread the word... Informatics olympiads are a part of our intellectual tradition, and a part that we can be proud of. A theory community that is aware of the olympiads is a stronger, richer community.

Finally, I cannot end without proper credit to the IMO, the parallel Math contest which had already been running for 4 decades when IOI started. Not surprisingly, it has produced some top-notch (not to mention prize-laden) CS theoreticians along the years. Here are a few that I am aware of:

Theorists with prizes:
Theorists with jobs:
Pre-job theorists:

Thursday, August 16, 2007

Informatics Olympiads (IV): Besides the IOI

IOI is the crown jewel of the CS competitions, but there are quite a few other competitons one has to know about.

Other high-school olympiads. Several regional olympiads have been created, modelled after the IOI. The ones I know are:

  • CEOI (Central European)
  • BOI (Balkan)
  • BOI again (Baltic)
If people from Asia or South America know other regional olympiads, please post a comment.

Policies for who goes to these olympiads vary from country to country and from year to year. Typical policies (Romania has tried all of them in different years):
  • send the top 4 to prepare them for the IOI
  • send younger kids (nonseniors) who haven't qualified to the IOI, so that they get experience for next year
  • send people on places 5-8 out of fairness (to allow other good students to get medals and build up a CV)
Regardless of which rule is employed each year, the students attending are obviously among the best. These olympiads are very competitive, and getting a medal in one of them is a serious achievement.

CEOI, in particular, is insane. It includes only the best countries from Europe, plus USA as a special guest (these countries typically get only gold and silver at the IOI). Getting a medal at CEOI is often likened to being among the top golds at the IOI.

Post high-school. Like the Putnam in mathematics, there are some olympiads for college students, but they are somewhat different in flavor:
  • the ACM Programming Contest: teams of 3 people try to solve 8 problems in 5 hours. By design, some problems are meant to have easy algorithms but painful implementation (the prototypical examples are from computational geometry).

  • TopCoder: individual contest over a long period of time, with many rounds conducted online. Huge prizes (in the 100K range). Each round is short (1 hour) so you have to be insanely fast at implementation.

  • IPSC (Internet Problem Solving Contest): teams of 3, working from home. You don't submit code, only answers to some truly massive problems.
Like the IOI, these are also highly competitive contests. Many IOIers participate, capitalizing on the previous exercise. Even more importantly, many students who weren't ready enough to make it to the stars during high school, are extremely motivated to add some prizes to their CV in university years.

In Eastern Europe, where most tenured faculty haven't heard the word research and attending classes is outright depressing, these competitions may be the only opportunity for students to not let their university years go to waste. (The other option is to work a full-time job as a programmer, for one of the miriad outsourcing companies.)

Unfortunately, the contests are less telling as to algorithmic skills than the high school olympiads. They rely on lots of coding under time pressure, so many pure algorithmists feel turned off. The fact that teams matter so much also makes results harder to interpret. (The successful teams typically include a very fast coder, and a theoretician who doesn't code... But who's who?)

Personally, I am an outsider (by choice) to these contests. With my slow coding and problems dealing with authority in teams, I would have been hopeless :)