[You can also read this post in PDF.]
In a recent SODA submission, I had to prove a small technical lemma that I considered more or less straightforward. But what exactly is straightforward is in the eye of the beholder, and one of the referees did not like my (succinct and semicoherent) proof. I was thus asked to provide as detailed an explanation as I could, pronto. (Sound familiar? I maintain that we are all too often missing the forest for the trees, but that's a topic for another post.)
Anyway, I thought I would take the opportunity to explain this small technical lemma to my readers who fancy themselves working with entropy inequalities in the future. If you do lower bounds, such things will sooner or later feel quite standard (as standard as, say, maintaining some forward and backward pointers in your algorithm). So if you fancy yourself working in this area, you might consider going through this proof carefully as an exercise.
(A word of caution: Don't be discouraged if this seems overly technical. This is not "what we do" in lower bounds — the point is to discover non-straightforward stuff. Piano skills are not what makes a great composer.)
So, what exactly are we proving? Let A[0..n-1] be a bit vector, chosen uniformly at random. Let Q(i)=A+...+A[i] be the partial sums of the vector (these were queries in my original problem).
I imagine the vector A being split into k blocks of n/k bits each. I denote by QΔ the Δ-th partial sum inside each block, i.e. the vector ( Q(Δ), Q(Δ + n/k), Q(Δ + 2n/k), ...).
We now look at Q0 (the first query in each block) and QΔ for some arbitrary Δ. The main technical question is: how much more efficient is it to encode Q0 and QΔ jointly, as opposed to encoding them separately?
Lemma 1. H(Q0) + H(QΔ) − H(Q0, QΔ)) = Ω(k)
Proof. The lemma is simple and intuitive. First remark that H(a, a+b, a+b+c) = H(a, b, c); furthermore, this is H(a)+H(b)+H(c) if a, b, and c are independent. Thus, H(Q0) = H(sum in block 1) + H(sum in block 2) + ... Of course, the sum in a block is a binomial with n/k Bernoulli trials. Thus, H(Q0) = (k-1) * h(n/k), where I use h to denote the entropy of a binomial.
Just the same, H(QΔ) = h(Δ) + (k-1) * h(n/k). Thus H(Q0) + H(QΔ) ≈ 2k * h(n/k).
Finally, H(Q0, QΔ)) = h(Δ) + h(n/k - Δ) + h(Δ) + h(n/k - Δ) + ... ≈ k*h(Δ) + k*h(n/k-Δ). Indeed, I can break (Q0, QΔ) into 2k independent components: the sum up to Δ, the sum from there up to n/k, the sum from there up to n/k + Δ, the sum from there up to 2n/k, and so on.
To conclude, we look up the entropy of a binomial on Wikipedia and find that h(m) = 0.5 * ln(πem/2) + O(1/m). Thus, doubling the number of trials m increases the entropy by ln(2)/2 - O(1/m) = Ω(1) bits. The lemma follows: either Δ ≥ n/2k, and we have h(n/k) - h(n/k-Δ) = Ω(1), or otherwise h(n/k) - h(Δ) = Ω(1). □
Now we get to the interesting part. I need to prove that there is a lot of loss in a separate encoding of Q0 and QΔ, even in a probabilistic universe where I've conditioned on some event. Of course, I need a lower bound on the probability of the event (for instance, if the event is A=A=...=0, all entropies are zero, so I can't prove anything).
How rare an event can I handle? The magic answer is Pr[E] ≥ 2-εk, for some small constant ε. Why is this the right answer?
- First, I should be able to handle such an event. If I have some variable X with some entropy H(X), and I tell you some event E has happened, how much did I knock off the entropy of X? You've just learned roughly lg(1/Pr[E]) bits of entropy (the fact that E happened). Intuitively, H(X) cannot decrease by more than what you learned. Thus, if I had a lower bound of Ω(k) on some entropic quantities, I should be able to maintain it under an event of measure 2-εk.
- Second, conditioning on such an event E is a very frequent trick in lower bounds — and 2-εk is always the answer :) The event E fixes something (say, some M bits of memory have been read by the algorithm, so Pr[E] ≈ 2-M). How can I bound the amount by which E simplified the problem? (In particular, I need to show the answer is not already known, otherwise there's no more
lowerbounding to do.)
A good way to bound the help E can render is to break the problem into 99*M subproblems. Intuitively, the M bits that the algorithm knows cannot help with 99*M independent subproblems; for some of the them, the algorithm must be clueless. Thus, the logic goes in the other direction: I'm considering k blocks precisely because I need to argue about an algorithm that has learned εk bits, i.e. I am in a universe where we condition on an event of probability 2-εk.
Lemma 2. For any event E with Pr[E] ≥ 2-εk, we have H(Q0 | E) + H(QΔ | E) − H(Q0, QΔ) | E) = Ω(k)
Proof. Remember the intuition from above: if I tell you that some event E happened, I am affecting your entropies by lg 1/Pr[E]. There's only one problem in implementing this intuition, namely that it's false. Imagine the following distribution: on odd days, a uniformly random vector of n bits; on even days, the vector (0,...,0). The entropy, which is an average, is n/2, i.e. quite high. But if I condition on an event with probability 1/2, I can bring the entropy down to zero.
What I need is concentration for entropies. And how do we always prove concentration? By using independence, which, luckily, we have here — remember how we broke the entropies into independent binomials. Of course, the formal implementation may not seem obvious yet, so let's do it carefully.
Let's assume Δ ≤ n/2k by symmetry. In this case, Lemma 1 boils down to: H(Q0) + H(QΔ) − H(Q0, QΔ)) ≈ 2k * h(n/k) - k*h(Δ) + k*h(n/k-Δ) ≥ k * (h(n/k) - h(Δ)) = Ω(k). I want to exploit the independence of k quantities, each with expectation h(n/k) - h(Δ).
These quantities come from the definition of entropy. Remember that H(X) = E[ lg 1/p(X) ], where p(.) is the probability density function of X. Let's make the following convenient definitions:
- Xi = the sum of the bits in the i-th block. Let X = (X0, X1, ...).
- Yi = the sum of the first Δ bits in the i-th block. Let Y = (Y0, Y1, ...). Denote the probability densities for Y by q(.).
In the regime Δ ≤ n/2k, we used the lower bound H(Q0) + H(QΔ) − H(Q0, QΔ)) ≥ H(X) - H(Y). Thus, I am interested in analyzing H(X|E) - H(Y|E).
To this end, I write H(X) - H(Y) = E[ lg q(Y)/p(X) ] = E[ lg (q(Y1) * q(Y2) * ...) / (p(X1) * p(X2) * ...) ] = E[ lg q(Y1) / (p(X1) + lg q(Y2) / p(X2) + ...].
This is in the desired form, the sum of k independent quantities, since (X1, Y1) is independent of (X2, Y2), etc. Each term has expectation h(n/k) - h(Δ) ≥ ln(2)/2 - O(1/m), as we showed in Lemma 1. Thus, I can apply Chernoff and conclude that the sum is Ω(k) with probability 1 - 2-Ω(k).
This probability is so large, that an event happening with probability 2-εk does not scare me any more. Indeed, even if I condition on E, the sum is Ω(k) with probability 1-o(1). Thus, it's expectation is also Ω(k).
Have I just shown H(X|E) - H(Y|E) = Ω(k)? Well, not quite. I've shown E[ lg q(Y)/p(X) | E ] = E[lg 1/p(X) | E] - E[lg 1/q(Y) | E] = Ω(k). But the probability density function p(.) is just another function when I condition on E. Under this conditioning, the real probability density is another function p'(.), i.e. H(X|E) = E[ lg 1/p'(X) ].
First remark that the probability mass inside E is a subset of the mass in the general universe, i.e. p(x) ≥ p'(x) * Pr[E], for any x. Thus p'(x) ≤ p(x) * 2εk and lg 1/p'(x) ≥ lg 1/p(x) - εk. So I can switch between p'(.) and p(.): E[lg 1/p'(X) | E] ≥ E[lg 1/p(X) | E] - εk.
Switching between q(.) and q'(.) is more tricky, since I want an upper bound. It is true that for many y, q'(y) could be much smaller than q(y), thus increasing the "entropy" lg 1/q'(y). Let's call y bad if q'(y) ≤ q(y) / 22εk. How much mass could the bad occupy? We has Sumy q(y) = 1, so Sumy∈Bad q'(y) ≤ 2-2εk. Thus, even inside the event E, the bad y's only have mass 2-εk. Now H(Y|E) is the average over the good y's and the bad y's. For the good, lg 1/q'(y) ≤ lg 1/q(y) + 2εk. For the bad, lg 1/q'(y) ≤ n (the smallest event in my distribution has probability 2-n). But the weight of the bad inside E is 2-εk = o(1/n) for reasonable k. Thus, the contribution of the bad is o(1).
We have thus shown E[lg 1/q'(Y) | E] ≤ E[lg 1/q(Y)] + 2εk. We conclude that H(X|E) - H(Y|E) ≥ E[ lg q(Y)/p(X) | E ] - 3εk = Ω(k). □
That's it. My hope was to present the proof as a sequence of interesting ideas, rather than a long technical thing filled with calculations. I don't know if I succeeded in conveying this feeling; let me know if you need help with any of the steps.