Friday, August 28, 2009
Romania
Posted by Mihai at 9:28 AM 16 comments
Wednesday, August 26, 2009
Longest Multipermutation
Here's an algorithmic problem for your enjoyment:
Define a multipermutation to be a sequence of integers that contains all numbers from 1 to some K at least once. Examples include (1,3,1,2,2), or (1,4,3,2); but not (1,2,3,5,5).I saw a recent paper doing this in superlinear time, and of course I had to find a linear-time algorithm :) Can you also find one?
Given an array of n integers, find the longest multipermutation in O(n) time.
Posted by Mihai at 3:05 PM 17 comments
Sunday, August 23, 2009
A Technical Lemma
[You can also read this post in PDF.]
In a recent SODA submission, I had to prove a small technical lemma that I considered more or less straightforward. But what exactly is straightforward is in the eye of the beholder, and one of the referees did not like my (succinct and semicoherent) proof. I was thus asked to provide as detailed an explanation as I could, pronto. (Sound familiar? I maintain that we are all too often missing the forest for the trees, but that's a topic for another post.)
Anyway, I thought I would take the opportunity to explain this small technical lemma to my readers who fancy themselves working with entropy inequalities in the future. If you do lower bounds, such things will sooner or later feel quite standard (as standard as, say, maintaining some forward and backward pointers in your algorithm). So if you fancy yourself working in this area, you might consider going through this proof carefully as an exercise.
(A word of caution: Don't be discouraged if this seems overly technical. This is not "what we do" in lower bounds — the point is to discover non-straightforward stuff. Piano skills are not what makes a great composer.)
So, what exactly are we proving? Let A[0..n-1] be a bit vector, chosen uniformly at random. Let Q(i)=A[0]+...+A[i] be the partial sums of the vector (these were queries in my original problem).
I imagine the vector A being split into k blocks of n/k bits each. I denote by QΔ the Δ-th partial sum inside each block, i.e. the vector ( Q(Δ), Q(Δ + n/k), Q(Δ + 2n/k), ...).
We now look at Q0 (the first query in each block) and QΔ for some arbitrary Δ. The main technical question is: how much more efficient is it to encode Q0 and QΔ jointly, as opposed to encoding them separately?
Lemma 1. H(Q0) + H(QΔ) − H(Q0, QΔ)) = Ω(k)
Proof. The lemma is simple and intuitive. First remark that H(a, a+b, a+b+c) = H(a, b, c); furthermore, this is H(a)+H(b)+H(c) if a, b, and c are independent. Thus, H(Q0) = H(sum in block 1) + H(sum in block 2) + ... Of course, the sum in a block is a binomial with n/k Bernoulli trials. Thus, H(Q0) = (k-1) * h(n/k), where I use h to denote the entropy of a binomial.
Just the same, H(QΔ) = h(Δ) + (k-1) * h(n/k). Thus H(Q0) + H(QΔ) ≈ 2k * h(n/k).
Finally, H(Q0, QΔ)) = h(Δ) + h(n/k - Δ) + h(Δ) + h(n/k - Δ) + ... ≈ k*h(Δ) + k*h(n/k-Δ). Indeed, I can break (Q0, QΔ) into 2k independent components: the sum up to Δ, the sum from there up to n/k, the sum from there up to n/k + Δ, the sum from there up to 2n/k, and so on.
To conclude, we look up the entropy of a binomial on Wikipedia and find that h(m) = 0.5 * ln(πem/2) + O(1/m). Thus, doubling the number of trials m increases the entropy by ln(2)/2 - O(1/m) = Ω(1) bits. The lemma follows: either Δ ≥ n/2k, and we have h(n/k) - h(n/k-Δ) = Ω(1), or otherwise h(n/k) - h(Δ) = Ω(1). □
Now we get to the interesting part. I need to prove that there is a lot of loss in a separate encoding of Q0 and QΔ, even in a probabilistic universe where I've conditioned on some event. Of course, I need a lower bound on the probability of the event (for instance, if the event is A[0]=A[1]=...=0, all entropies are zero, so I can't prove anything).
How rare an event can I handle? The magic answer is Pr[E] ≥ 2-εk, for some small constant ε. Why is this the right answer?
- First, I should be able to handle such an event. If I have some variable X with some entropy H(X), and I tell you some event E has happened, how much did I knock off the entropy of X? You've just learned roughly lg(1/Pr[E]) bits of entropy (the fact that E happened). Intuitively, H(X) cannot decrease by more than what you learned. Thus, if I had a lower bound of Ω(k) on some entropic quantities, I should be able to maintain it under an event of measure 2-εk.
- Second, conditioning on such an event E is a very frequent trick in lower bounds — and 2-εk is always the answer :) The event E fixes something (say, some M bits of memory have been read by the algorithm, so Pr[E] ≈ 2-M). How can I bound the amount by which E simplified the problem? (In particular, I need to show the answer is not already known, otherwise there's no more
lowerbounding to do.)
A good way to bound the help E can render is to break the problem into 99*M subproblems. Intuitively, the M bits that the algorithm knows cannot help with 99*M independent subproblems; for some of the them, the algorithm must be clueless. Thus, the logic goes in the other direction: I'm considering k blocks precisely because I need to argue about an algorithm that has learned εk bits, i.e. I am in a universe where we condition on an event of probability 2-εk.
Lemma 2. For any event E with Pr[E] ≥ 2-εk, we have H(Q0 | E) + H(QΔ | E) − H(Q0, QΔ) | E) = Ω(k)
Proof. Remember the intuition from above: if I tell you that some event E happened, I am affecting your entropies by lg 1/Pr[E]. There's only one problem in implementing this intuition, namely that it's false. Imagine the following distribution: on odd days, a uniformly random vector of n bits; on even days, the vector (0,...,0). The entropy, which is an average, is n/2, i.e. quite high. But if I condition on an event with probability 1/2, I can bring the entropy down to zero.
What I need is concentration for entropies. And how do we always prove concentration? By using independence, which, luckily, we have here — remember how we broke the entropies into independent binomials. Of course, the formal implementation may not seem obvious yet, so let's do it carefully.
Let's assume Δ ≤ n/2k by symmetry. In this case, Lemma 1 boils down to: H(Q0) + H(QΔ) − H(Q0, QΔ)) ≈ 2k * h(n/k) - k*h(Δ) + k*h(n/k-Δ) ≥ k * (h(n/k) - h(Δ)) = Ω(k). I want to exploit the independence of k quantities, each with expectation h(n/k) - h(Δ).
These quantities come from the definition of entropy. Remember that H(X) = E[ lg 1/p(X) ], where p(.) is the probability density function of X. Let's make the following convenient definitions:
- Xi = the sum of the bits in the i-th block. Let X = (X0, X1, ...).
- Yi = the sum of the first Δ bits in the i-th block. Let Y = (Y0, Y1, ...). Denote the probability densities for Y by q(.).
In the regime Δ ≤ n/2k, we used the lower bound H(Q0) + H(QΔ) − H(Q0, QΔ)) ≥ H(X) - H(Y). Thus, I am interested in analyzing H(X|E) - H(Y|E).
To this end, I write H(X) - H(Y) = E[ lg q(Y)/p(X) ] = E[ lg (q(Y1) * q(Y2) * ...) / (p(X1) * p(X2) * ...) ] = E[ lg q(Y1) / (p(X1) + lg q(Y2) / p(X2) + ...].
This is in the desired form, the sum of k independent quantities, since (X1, Y1) is independent of (X2, Y2), etc. Each term has expectation h(n/k) - h(Δ) ≥ ln(2)/2 - O(1/m), as we showed in Lemma 1. Thus, I can apply Chernoff and conclude that the sum is Ω(k) with probability 1 - 2-Ω(k).
This probability is so large, that an event happening with probability 2-εk does not scare me any more. Indeed, even if I condition on E, the sum is Ω(k) with probability 1-o(1). Thus, it's expectation is also Ω(k).
Have I just shown H(X|E) - H(Y|E) = Ω(k)? Well, not quite. I've shown E[ lg q(Y)/p(X) | E ] = E[lg 1/p(X) | E] - E[lg 1/q(Y) | E] = Ω(k). But the probability density function p(.) is just another function when I condition on E. Under this conditioning, the real probability density is another function p'(.), i.e. H(X|E) = E[ lg 1/p'(X) ].
First remark that the probability mass inside E is a subset of the mass in the general universe, i.e. p(x) ≥ p'(x) * Pr[E], for any x. Thus p'(x) ≤ p(x) * 2εk and lg 1/p'(x) ≥ lg 1/p(x) - εk. So I can switch between p'(.) and p(.): E[lg 1/p'(X) | E] ≥ E[lg 1/p(X) | E] - εk.
Switching between q(.) and q'(.) is more tricky, since I want an upper bound. It is true that for many y, q'(y) could be much smaller than q(y), thus increasing the "entropy" lg 1/q'(y). Let's call y bad if q'(y) ≤ q(y) / 22εk. How much mass could the bad occupy? We has Sumy q(y) = 1, so Sumy∈Bad q'(y) ≤ 2-2εk. Thus, even inside the event E, the bad y's only have mass 2-εk. Now H(Y|E) is the average over the good y's and the bad y's. For the good, lg 1/q'(y) ≤ lg 1/q(y) + 2εk. For the bad, lg 1/q'(y) ≤ n (the smallest event in my distribution has probability 2-n). But the weight of the bad inside E is 2-εk = o(1/n) for reasonable k. Thus, the contribution of the bad is o(1).
We have thus shown E[lg 1/q'(Y) | E] ≤ E[lg 1/q(Y)] + 2εk. We conclude that H(X|E) - H(Y|E) ≥ E[ lg q(Y)/p(X) | E ] - 3εk = Ω(k). □
That's it. My hope was to present the proof as a sequence of interesting ideas, rather than a long technical thing filled with calculations. I don't know if I succeeded in conveying this feeling; let me know if you need help with any of the steps.
Posted by Mihai at 5:41 AM 19 comments
Friday, August 21, 2009
An Evening in Bucharest
5:30pm. I am at the "Brewers' Court," meeting with an old friend from computer olympiads, now faculty at Bucharest Polytechnic. We order 1-liter beers and I describe an algorithm answering the main open problem from a recent paper. The first question I get is quite polite (I tend to get a lot of respect in Romania), but can roughly be translated as "Dude, does this make any sense?" As is often the case, I show that I cannot focus without coauthors. My algorithm is obviously non-sense, except that it contains enough original ideas that a correct algorithm can be reconstructed in 10 minutes.
7:00pm. I take a few more steps on a path that I have been pursuing for a long time: I try to communicate some of the energy, passion, and standards of US research (sorely lacking in the parts of Europe that I have had contact with). We both have 1-liter beers.
8:00pm. My godfather arrives. We all have 1-liter beers, and talk about work in networking (he is working in software at a major corporation in this field; I am at ATT, of course).
9:00pm. Another friend from high school (CN Carol I, class of 2001) arrives. We have 1-liter beers with some food, and gossip about another high-school friend (now CEO of a software company).
11:00pm. Two Romanian re-pats (old friends from MIT) arrive. We have 1-liter beers, and talk about their start-up, the child that one has, etc.
12:30am. Another re-pat (Harvard) arrives. We have 1-liter beers, and talk about the cars that they drive at 150mph, the good old times at Physics Olympiads, and life.
1:30am. Yet another re-pat (MIT) arrives. We have 1-liter beers, and talk about the hedge fund where he works, the likely inflation for the dollar in the next 10 years, and life in Romania vs the US.
3:30am. I make it to my sister's apartment and write blog posts (at the end of my European trip, I am still jet-lagged).
Posted by Mihai at 8:08 PM 0 comments
Monday, August 3, 2009
More on Problem 2
You might be wondering what caused the hiatus in posting CEOI problems. My modus operandi as a researcher is to induldge in whatever obsession I currently have. For the past week, the raging obsession has been Problem 2 (tempered, of course, by the beaurocracy of joining AT&T).
Observation 1. In an optimal solution, any two rectangles are either disjoint or "nested" (one of them is taller and has the base included in the other). If two rectangles were to intersect properly, the shorter one can be shrunk horizontally until the rectangles become disjoint.
Let the points have coordinates (x[i], y[i]). Assume they are sorted by x.
Observation 2. We can construct an optimal solution in which every rectangle satisfies:
- it extends horizontally from some x[i] to some x[j];
- it extends vertically up to A / (x[i] - x[j]), i.e. as high as possible.
- it covers points i and j, i.e. y[i], y[j] ≤ A / (x[i] - x[j]).
An O(n4) solution. We thus know that there are only O(n2) distinct rectangles to consider. Denote them by R(i,j); though not all pairs (i,j) are valid rectangles.
If R(i,j) is used in the solution, we have a well-contained subproblem to solve: cover all points between x[i] and x[j], which are not already covered by R(i,j), that is, they are above the rectangle. Let A(i,j) be the minimal number of rectangles needed to cover all such points. This is our dynamic program.
We fill the dynamic program by increasing values of j-i. To compute A(i,j), let S be the set of points k (i < k < j) which are not covered by the rectangle. Note that i and j are always covered (by condition 3. above). Thus, any information pertitent to covering S has already been computed by our dynamic program.
Let S={X1, X2, ... }. To compute A(i,j), we need to find splitters a < b < c < ... which minimize A(X1, Xa) + A(X{a+1}, Xb) + A(X{b+1}, Xc) + ...
This is a classic dynamic programming problem itself, or, if you wish, a shortest paths problem where A(x,y) is the length of the path from node x to node y. This subproblem can be solved in O(n2) time (which is "linear time", since there are n2 subproblem A(x,y)). Thus, the original dynamic program is filled in O(n4).
An O(n3) solution. A common strategy for improving the running time is to try to maintain more information. Thus, I first reformulate the dynamic program above to use O(n3) state.
Let T(i,j,k) be the optimal number of rectangles to cover all points (x,y) with x[i] ≤ x ≤ x[j] and y ≥ heights[k], where heights contains all y's in sorted order. Unlike the previous solution, I consider all heights k, not just the maximal one for a fixed i and j.
This dynamic program is easy to fill in O(n4). For some triple (i,j,k), I will either:
- draw a rectangle from x[i] to x[j] of maximal height. If this height is more than y[k], I get the optimum for covering the remaining points from T(i,j,new height).
- if there is no rectangle spanning x[i] to x[j], there must be a break point m, such that T(i,j,k) = T(i,m,k) + T(m+1,j,k). Iterate to find the optimal such m.
To compute the optimal T(i,j,0), proceed exactly as above, taking time O(n). The rest of T(i,j,k) for k>0 can be computed in constant time per entry. For each T(i,j,k+1), I have two possibilities:
- T(i,j,k+1) = T(i,j,k).
- T(i,j,k+1) < T(i,j,k). Thus, the optimal solution for k+1 is different from the optimal solution for k. But the set of points that must be covered by these two solutions only differ by one point P (with y[P]=k). Thus, point P must not be covered by the solution for T(i,j,k+1) — if it were, the better solution for T(i,j,k+1) would also hold for T(i,j,k). Thus, x[P] is a break point in the solution T(i,j,k+1), and we have T(i,j,k+1) = T(i,P-1,k+1) + T(P+1,j,k+1).
Posted by Mihai at 3:13 PM 11 comments