WebDiarios de Motocicleta: August 2009

Friday, August 28, 2009

Romania

I saw two recent news items that reflect on Romania.

Lance blogs about the ISI Web of Science. I first heard of this thing a few years ago, when the Romanian government instituted paper counting as the means for promotion in any public university. The flavor of paper counting was that you only count "articles" in ISI "A-ranked" publications. This means, for instance, that I could not get a job there, since all my journal papers are in special issues (so they don't count, plus I don't really have a clue which TCS journals are A-ranked).

Now, the basic idea of the Romanian government was not bad. The Romanian university system is fundamentally screwed up on all levels (I will post about this later). Anything that will force them to engage with the rest of the world is generally a good idea. For instance, I would support the idea of having committees that decide minimal area-specific requirements. Say, "you may not apply for a faculty job in TCS group before you have at least 2 papers in STOC, FOCS, SODA, or SoCG." The idea would not be paper counting, but forcing people to notice the outside world (they might like it, and stay involved). Anything except a formal requirement is unlikely to work (since all senior faculty never heard of the outside world, you cannot base the system on recommendation letters, or the idea that a hiring committee will make intelligent decisions).

But the ISI implementation of the idea turned out to be quite wrong. It does not work in TCS, for instance, and it is abused in the other disciplines, as well. The common strategy is to fight hard to get some Romanian journal in the ISI list (say, "Modern Metallurgy" as a somewhat fictional example). Then, accepting papers in that journal is no longer a question of merit, but one of connections (yes, of course your cousin's daughter will get a job in our department) and bribe (yes, editor, of course I will be glad to have you as co-investigator on my compiler optimizations grant). The journal starts being filled with articles like "Red-Black trees with Faster Deletions, and Applications to Steel Production."

On to the second piece of news. BBC and others ran articles on how Madonna was booed at her own concert in Romania, when she spoke against discrimination against gypsies. I thought it would be good to discuss some context for that.

Discrimination against gypsies has its roots in the vanilla-flavored discrimination against any minority which is poor, uneducated, and prone to engaging in criminal behavior. As with the black community in the US, the most frequent critique that you hear in Romania is that the "uneducated" feature was by choice, i.e. it stems from the traditional values of the community. There is probably more of a case for this in Romania (where the communist regime made education mandatory, showered students with fellowships, and made all decisions based on anonymous written exams) than in the US (where many inner city schools are in disarray).

As far as I can tell, education or a stable job eliminates discrimination; we certainly had gypsy teachers in high school, and I never heard any negative comment about it. Again this is different from the US, where affirmative action programs have created a way to rationalize discrimination, even, say, against black MIT alumni.

In any case, contemporary discrimination against gypsies is driven by something quite different: a certain segment of the European media and politicians. Today, there is wide-spread (and very vocal) discrimination against Romanians in some European countries, with extremist politicians calling for mass deportations and things like that.

Again, this is vanilla-flavored discrimination against the large immigrant group who is willing to do more work for less money. But as wage and unemployment numbers don't make good ratings, the media show is built on specific examples: the accordion-playing beggars with 3 naked children in their arms, the pick-pockets, the guys who ate the swans from some park, the shanty houses outside Rome...

The problem? These people are usually gypsies bearing Romanian passports. These highly publicized examples have created a great deal of resentment among Romanians, both towards the Europeans who fail to understand the local ethnic differences, and towards the easier target, the gypsy community.

To illustrate this feeling, I attach a song from a well-known Romanian hip-hop band entitled "Message for Europe." (Warning: strong language, especially in Romanian; the English subtitles are not too good, and nothing can really do justice to the brilliant lyrics -- but you get the idea.)

Wednesday, August 26, 2009

Longest Multipermutation

Here's an algorithmic problem for your enjoyment:

Define a multipermutation to be a sequence of integers that contains all numbers from 1 to some K at least once. Examples include (1,3,1,2,2), or (1,4,3,2); but not (1,2,3,5,5).

Given an array of n integers, find the longest multipermutation in O(n) time.

I saw a recent paper doing this in superlinear time, and of course I had to find a linear-time algorithm :) Can you also find one?

Sunday, August 23, 2009

A Technical Lemma

[You can also read this post in PDF.]

In a recent SODA submission, I had to prove a small technical lemma that I considered more or less straightforward. But what exactly is straightforward is in the eye of the beholder, and one of the referees did not like my (succinct and semicoherent) proof. I was thus asked to provide as detailed an explanation as I could, pronto. (Sound familiar? I maintain that we are all too often missing the forest for the trees, but that's a topic for another post.)

Anyway, I thought I would take the opportunity to explain this small technical lemma to my readers who fancy themselves working with entropy inequalities in the future. If you do lower bounds, such things will sooner or later feel quite standard (as standard as, say, maintaining some forward and backward pointers in your algorithm). So if you fancy yourself working in this area, you might consider going through this proof carefully as an exercise.

(A word of caution: Don't be discouraged if this seems overly technical. This is not "what we do" in lower bounds — the point is to discover non-straightforward stuff. Piano skills are not what makes a great composer.)

So, what exactly are we proving? Let A[0..n-1] be a bit vector, chosen uniformly at random. Let Q(i)=A[0]+...+A[i] be the partial sums of the vector (these were queries in my original problem).

I imagine the vector A being split into k blocks of n/k bits each. I denote by Q_Δ the Δ-th partial sum inside each block, i.e. the vector ( Q(Δ), Q(Δ + n/k), Q(Δ + 2n/k), ...).

We now look at Q₀ (the first query in each block) and Q_Δ for some arbitrary Δ. The main technical question is: how much more efficient is it to encode Q₀ and Q_Δ jointly, as opposed to encoding them separately?

Lemma 1. H(Q₀) + H(Q_Δ) − H(Q₀, Q_Δ)) = Ω(k)

Proof. The lemma is simple and intuitive. First remark that H(a, a+b, a+b+c) = H(a, b, c); furthermore, this is H(a)+H(b)+H(c) if a, b, and c are independent. Thus, H(Q₀) = H(sum in block 1) + H(sum in block 2) + ... Of course, the sum in a block is a binomial with n/k Bernoulli trials. Thus, H(Q₀) = (k-1) * h(n/k), where I use h to denote the entropy of a binomial.

Just the same, H(Q_Δ) = h(Δ) + (k-1) * h(n/k). Thus H(Q₀) + H(Q_Δ) ≈ 2k * h(n/k).

Finally, H(Q₀, Q_Δ)) = h(Δ) + h(n/k - Δ) + h(Δ) + h(n/k - Δ) + ... ≈ k*h(Δ) + k*h(n/k-Δ). Indeed, I can break (Q₀, Q_Δ) into 2k independent components: the sum up to Δ, the sum from there up to n/k, the sum from there up to n/k + Δ, the sum from there up to 2n/k, and so on.

To conclude, we look up the entropy of a binomial on Wikipedia and find that h(m) = 0.5 * ln(πem/2) + O(1/m). Thus, doubling the number of trials m increases the entropy by ln(2)/2 - O(1/m) = Ω(1) bits. The lemma follows: either Δ ≥ n/2k, and we have h(n/k) - h(n/k-Δ) = Ω(1), or otherwise h(n/k) - h(Δ) = Ω(1). □

Now we get to the interesting part. I need to prove that there is a lot of loss in a separate encoding of Q₀ and Q_Δ, even in a probabilistic universe where I've conditioned on some event. Of course, I need a lower bound on the probability of the event (for instance, if the event is A[0]=A[1]=...=0, all entropies are zero, so I can't prove anything).

How rare an event can I handle? The magic answer is Pr[E] ≥ 2^-εk, for some small constant ε. Why is this the right answer?

First, I should be able to handle such an event. If I have some variable X with some entropy H(X), and I tell you some event E has happened, how much did I knock off the entropy of X? You've just learned roughly lg(1/Pr[E]) bits of entropy (the fact that E happened). Intuitively, H(X) cannot decrease by more than what you learned. Thus, if I had a lower bound of Ω(k) on some entropic quantities, I should be able to maintain it under an event of measure 2^-εk.
Second, conditioning on such an event E is a very frequent trick in lower bounds — and 2^-εk is always the answer :) The event E fixes something (say, some M bits of memory have been read by the algorithm, so Pr[E] ≈ 2^-M). How can I bound the amount by which E simplified the problem? (In particular, I need to show the answer is not already known, otherwise there's no more
lowerbounding to do.)

A good way to bound the help E can render is to break the problem into 99*M subproblems. Intuitively, the M bits that the algorithm knows cannot help with 99*M independent subproblems; for some of the them, the algorithm must be clueless. Thus, the logic goes in the other direction: I'm considering k blocks precisely because I need to argue about an algorithm that has learned εk bits, i.e. I am in a universe where we condition on an event of probability 2^-εk.

Thus, what I really want to prove is:

Lemma 2. For any event E with Pr[E] ≥ 2^-εk, we have H(Q₀ | E) + H(Q_Δ | E) − H(Q₀, Q_Δ) | E) = Ω(k)

Proof. Remember the intuition from above: if I tell you that some event E happened, I am affecting your entropies by lg 1/Pr[E]. There's only one problem in implementing this intuition, namely that it's false. Imagine the following distribution: on odd days, a uniformly random vector of n bits; on even days, the vector (0,...,0). The entropy, which is an average, is n/2, i.e. quite high. But if I condition on an event with probability 1/2, I can bring the entropy down to zero.

What I need is concentration for entropies. And how do we always prove concentration? By using independence, which, luckily, we have here — remember how we broke the entropies into independent binomials. Of course, the formal implementation may not seem obvious yet, so let's do it carefully.

Let's assume Δ ≤ n/2k by symmetry. In this case, Lemma 1 boils down to: H(Q₀) + H(Q_Δ) − H(Q₀, Q_Δ)) ≈ 2k * h(n/k) - k*h(Δ) + k*h(n/k-Δ) ≥ k * (h(n/k) - h(Δ)) = Ω(k). I want to exploit the independence of k quantities, each with expectation h(n/k) - h(Δ).

These quantities come from the definition of entropy. Remember that H(X) = E[ lg 1/p(X) ], where p(.) is the probability density function of X. Let's make the following convenient definitions:

X_i = the sum of the bits in the i-th block. Let X = (X₀, X₁, ...).
Y_i = the sum of the first Δ bits in the i-th block. Let Y = (Y₀, Y₁, ...). Denote the probability densities for Y by q(.).

In the regime Δ ≤ n/2k, we used the lower bound H(Q₀) + H(Q_Δ) − H(Q₀, Q_Δ)) ≥ H(X) - H(Y). Thus, I am interested in analyzing H(X|E) - H(Y|E).

To this end, I write H(X) - H(Y) = E[ lg q(Y)/p(X) ] = E[ lg (q(Y₁) * q(Y₂) * ...) / (p(X₁) * p(X₂) * ...) ] = E[ lg q(Y₁) / (p(X₁) + lg q(Y₂) / p(X₂) + ...].

This is in the desired form, the sum of k independent quantities, since (X₁, Y₁) is independent of (X₂, Y₂), etc. Each term has expectation h(n/k) - h(Δ) ≥ ln(2)/2 - O(1/m), as we showed in Lemma 1. Thus, I can apply Chernoff and conclude that the sum is Ω(k) with probability 1 - 2^-Ω(k).

This probability is so large, that an event happening with probability 2^-εk does not scare me any more. Indeed, even if I condition on E, the sum is Ω(k) with probability 1-o(1). Thus, it's expectation is also Ω(k).

Have I just shown H(X|E) - H(Y|E) = Ω(k)? Well, not quite. I've shown E[ lg q(Y)/p(X) | E ] = E[lg 1/p(X) | E] - E[lg 1/q(Y) | E] = Ω(k). But the probability density function p(.) is just another function when I condition on E. Under this conditioning, the real probability density is another function p'(.), i.e. H(X|E) = E[ lg 1/p'(X) ].

First remark that the probability mass inside E is a subset of the mass in the general universe, i.e. p(x) ≥ p'(x) * Pr[E], for any x. Thus p'(x) ≤ p(x) * 2^εk and lg 1/p'(x) ≥ lg 1/p(x) - εk. So I can switch between p'(.) and p(.): E[lg 1/p'(X) | E] ≥ E[lg 1/p(X) | E] - εk.

Switching between q(.) and q'(.) is more tricky, since I want an upper bound. It is true that for many y, q'(y) could be much smaller than q(y), thus increasing the "entropy" lg 1/q'(y). Let's call y bad if q'(y) ≤ q(y) / 2^2εk. How much mass could the bad occupy? We has Sum_y q(y) = 1, so Sum_y∈Bad q'(y) ≤ 2^-2εk. Thus, even inside the event E, the bad y's only have mass 2^-εk. Now H(Y|E) is the average over the good y's and the bad y's. For the good, lg 1/q'(y) ≤ lg 1/q(y) + 2εk. For the bad, lg 1/q'(y) ≤ n (the smallest event in my distribution has probability 2^-n). But the weight of the bad inside E is 2^-εk = o(1/n) for reasonable k. Thus, the contribution of the bad is o(1).

We have thus shown E[lg 1/q'(Y) | E] ≤ E[lg 1/q(Y)] + 2εk. We conclude that H(X|E) - H(Y|E) ≥ E[ lg q(Y)/p(X) | E ] - 3εk = Ω(k). □

That's it. My hope was to present the proof as a sequence of interesting ideas, rather than a long technical thing filled with calculations. I don't know if I succeeded in conveying this feeling; let me know if you need help with any of the steps.

Friday, August 21, 2009

An Evening in Bucharest

5:30pm. I am at the "Brewers' Court," meeting with an old friend from computer olympiads, now faculty at Bucharest Polytechnic. We order 1-liter beers and I describe an algorithm answering the main open problem from a recent paper. The first question I get is quite polite (I tend to get a lot of respect in Romania), but can roughly be translated as "Dude, does this make any sense?" As is often the case, I show that I cannot focus without coauthors. My algorithm is obviously non-sense, except that it contains enough original ideas that a correct algorithm can be reconstructed in 10 minutes.

7:00pm. I take a few more steps on a path that I have been pursuing for a long time: I try to communicate some of the energy, passion, and standards of US research (sorely lacking in the parts of Europe that I have had contact with). We both have 1-liter beers.

8:00pm. My godfather arrives. We all have 1-liter beers, and talk about work in networking (he is working in software at a major corporation in this field; I am at ATT, of course).

9:00pm. Another friend from high school (CN Carol I, class of 2001) arrives. We have 1-liter beers with some food, and gossip about another high-school friend (now CEO of a software company).

11:00pm. Two Romanian re-pats (old friends from MIT) arrive. We have 1-liter beers, and talk about their start-up, the child that one has, etc.

12:30am. Another re-pat (Harvard) arrives. We have 1-liter beers, and talk about the cars that they drive at 150mph, the good old times at Physics Olympiads, and life.

1:30am. Yet another re-pat (MIT) arrives. We have 1-liter beers, and talk about the hedge fund where he works, the likely inflation for the dollar in the next 10 years, and life in Romania vs the US.

3:30am. I make it to my sister's apartment and write blog posts (at the end of my European trip, I am still jet-lagged).

Monday, August 3, 2009

More on Problem 2

You might be wondering what caused the hiatus in posting CEOI problems. My modus operandi as a researcher is to induldge in whatever obsession I currently have. For the past week, the raging obsession has been Problem 2 (tempered, of course, by the beaurocracy of joining AT&T).

Several people found the O(n⁴) solution, both during the real competition and on the blog. In addition, Berman, Karpinski, and Lingas also found this solution in a recent arXiv paper. (Thanks to Nate Xiaoming Xu for telling me about this paper!) Their paper is not about rectangle covers, but something about wireless antennas. This goes to show that olympiad problems are often deep enough to become research papers. The only additional skill that we have in research is keeping a straight face when we write the introductions. (Don't worry, it comes with age.)

As I announced in the comments to the previous blog post, I found an O(n³) solution. This appears to be tricky (several people tried to reconstruct my solution in the comments, but without success). My solution is described in this post.

I am still quite optimistic that O(n²) might be possible, but so far I haven't been able to beat cubic time. If you find a way, I will be impressed.

If you don't find a way but want to discuss promissing ideas, use the comments to this post. Think of this as an experimental poly-CS project. The discussion is open to anyone. Don't be shy to post, but please be use a nickname for identification ("CarpathianBlackMetal" is much more informative than "Anonymous"). Also number your comments (e.g. "This is comment #4 from CarpathianBlackMetal"). Do not delete a comment if you posted something wrong. Explaining why it is wrong is much more informative for the audience.

Now let me describe the solutions I know for this problem.

Observation 1. In an optimal solution, any two rectangles are either disjoint or "nested" (one of them is taller and has the base included in the other). If two rectangles were to intersect properly, the shorter one can be shrunk horizontally until the rectangles become disjoint.

Let the points have coordinates (x[i], y[i]). Assume they are sorted by x.

Observation 2. We can construct an optimal solution in which every rectangle satisfies:

it extends horizontally from some x[i] to some x[j];
it extends vertically up to A / (x[i] - x[j]), i.e. as high as possible.
it covers points i and j, i.e. y[i], y[j] ≤ A / (x[i] - x[j]).

Perhaps the only nonobvious condition is 3. But note that if the rectangle doesn't cover point i, I might as well move its edge inside (beyond point i), and thus increase the height. The only change that could happen would be to cover more points.

An O(n⁴) solution. We thus know that there are only O(n²) distinct rectangles to consider. Denote them by R(i,j); though not all pairs (i,j) are valid rectangles.

If R(i,j) is used in the solution, we have a well-contained subproblem to solve: cover all points between x[i] and x[j], which are not already covered by R(i,j), that is, they are above the rectangle. Let A(i,j) be the minimal number of rectangles needed to cover all such points. This is our dynamic program.

We fill the dynamic program by increasing values of j-i. To compute A(i,j), let S be the set of points k (i < k < j) which are not covered by the rectangle. Note that i and j are always covered (by condition 3. above). Thus, any information pertitent to covering S has already been computed by our dynamic program.

Let S={X1, X2, ... }. To compute A(i,j), we need to find splitters a < b < c < ... which minimize A(X1, Xa) + A(X{a+1}, Xb) + A(X{b+1}, Xc) + ...

This is a classic dynamic programming problem itself, or, if you wish, a shortest paths problem where A(x,y) is the length of the path from node x to node y. This subproblem can be solved in O(n²) time (which is "linear time", since there are n² subproblem A(x,y)). Thus, the original dynamic program is filled in O(n⁴).

An O(n³) solution. A common strategy for improving the running time is to try to maintain more information. Thus, I first reformulate the dynamic program above to use O(n³) state.

Let T(i,j,k) be the optimal number of rectangles to cover all points (x,y) with x[i] ≤ x ≤ x[j] and y ≥ heights[k], where heights contains all y's in sorted order. Unlike the previous solution, I consider all heights k, not just the maximal one for a fixed i and j.

This dynamic program is easy to fill in O(n⁴). For some triple (i,j,k), I will either:

draw a rectangle from x[i] to x[j] of maximal height. If this height is more than y[k], I get the optimum for covering the remaining points from T(i,j,new height).
if there is no rectangle spanning x[i] to x[j], there must be a break point m, such that T(i,j,k) = T(i,m,k) + T(m+1,j,k). Iterate to find the optimal such m.

To fill this in O(n³) requires a nice trick. The order of the computation will be: by increasing value of j-i, and, for some fixed (i,j), by increasing k.

To compute the optimal T(i,j,0), proceed exactly as above, taking time O(n). The rest of T(i,j,k) for k>0 can be computed in constant time per entry. For each T(i,j,k+1), I have two possibilities:

T(i,j,k+1) = T(i,j,k).
T(i,j,k+1) < T(i,j,k). Thus, the optimal solution for k+1 is different from the optimal solution for k. But the set of points that must be covered by these two solutions only differ by one point P (with y[P]=k). Thus, point P must not be covered by the solution for T(i,j,k+1) — if it were, the better solution for T(i,j,k+1) would also hold for T(i,j,k). Thus, x[P] is a break point in the solution T(i,j,k+1), and we have T(i,j,k+1) = T(i,P-1,k+1) + T(P+1,j,k+1).

Good luck thinking about this problem!

WebDiarios de Motocicleta