Sunday, August 23, 2009

A Technical Lemma

[You can also read this post in PDF.]

In a recent SODA submission, I had to prove a small technical lemma that I considered more or less straightforward. But what exactly is straightforward is in the eye of the beholder, and one of the referees did not like my (succinct and semicoherent) proof. I was thus asked to provide as detailed an explanation as I could, pronto. (Sound familiar? I maintain that we are all too often missing the forest for the trees, but that's a topic for another post.)

Anyway, I thought I would take the opportunity to explain this small technical lemma to my readers who fancy themselves working with entropy inequalities in the future. If you do lower bounds, such things will sooner or later feel quite standard (as standard as, say, maintaining some forward and backward pointers in your algorithm). So if you fancy yourself working in this area, you might consider going through this proof carefully as an exercise.

(A word of caution: Don't be discouraged if this seems overly technical. This is not "what we do" in lower bounds — the point is to discover non-straightforward stuff. Piano skills are not what makes a great composer.)

So, what exactly are we proving? Let A[0..n-1] be a bit vector, chosen uniformly at random. Let Q(i)=A[0]+...+A[i] be the partial sums of the vector (these were queries in my original problem).

I imagine the vector A being split into k blocks of n/k bits each. I denote by QΔ the Δ-th partial sum inside each block, i.e. the vector ( Q(Δ), Q(Δ + n/k), Q(Δ + 2n/k), ...).

We now look at Q0 (the first query in each block) and QΔ for some arbitrary Δ. The main technical question is: how much more efficient is it to encode Q0 and QΔ jointly, as opposed to encoding them separately?

Lemma 1. H(Q0) + H(QΔ) − H(Q0, QΔ)) = Ω(k)

Proof. The lemma is simple and intuitive. First remark that H(a, a+b, a+b+c) = H(a, b, c); furthermore, this is H(a)+H(b)+H(c) if a, b, and c are independent. Thus, H(Q0) = H(sum in block 1) + H(sum in block 2) + ... Of course, the sum in a block is a binomial with n/k Bernoulli trials. Thus, H(Q0) = (k-1) * h(n/k), where I use h to denote the entropy of a binomial.

Just the same, H(QΔ) = h(Δ) + (k-1) * h(n/k). Thus H(Q0) + H(QΔ) ≈ 2k * h(n/k).

Finally, H(Q0, QΔ)) = h(Δ) + h(n/k - Δ) + h(Δ) + h(n/k - Δ) + ... ≈ k*h(Δ) + k*h(n/k-Δ). Indeed, I can break (Q0, QΔ) into 2k independent components: the sum up to Δ, the sum from there up to n/k, the sum from there up to n/k + Δ, the sum from there up to 2n/k, and so on.

To conclude, we look up the entropy of a binomial on Wikipedia and find that h(m) = 0.5 * ln(πem/2) + O(1/m). Thus, doubling the number of trials m increases the entropy by ln(2)/2 - O(1/m) = Ω(1) bits. The lemma follows: either Δ ≥ n/2k, and we have h(n/k) - h(n/k-Δ) = Ω(1), or otherwise h(n/k) - h(Δ) = Ω(1). □

Now we get to the interesting part. I need to prove that there is a lot of loss in a separate encoding of Q0 and QΔ, even in a probabilistic universe where I've conditioned on some event. Of course, I need a lower bound on the probability of the event (for instance, if the event is A[0]=A[1]=...=0, all entropies are zero, so I can't prove anything).

How rare an event can I handle? The magic answer is Pr[E] ≥ 2-εk, for some small constant ε. Why is this the right answer?

  • First, I should be able to handle such an event. If I have some variable X with some entropy H(X), and I tell you some event E has happened, how much did I knock off the entropy of X? You've just learned roughly lg(1/Pr[E]) bits of entropy (the fact that E happened). Intuitively, H(X) cannot decrease by more than what you learned. Thus, if I had a lower bound of Ω(k) on some entropic quantities, I should be able to maintain it under an event of measure 2-εk.

  • Second, conditioning on such an event E is a very frequent trick in lower bounds — and 2-εk is always the answer :) The event E fixes something (say, some M bits of memory have been read by the algorithm, so Pr[E] ≈ 2-M). How can I bound the amount by which E simplified the problem? (In particular, I need to show the answer is not already known, otherwise there's no more
    lowerbounding to do.)

    A good way to bound the help E can render is to break the problem into 99*M subproblems. Intuitively, the M bits that the algorithm knows cannot help with 99*M independent subproblems; for some of the them, the algorithm must be clueless. Thus, the logic goes in the other direction: I'm considering k blocks precisely because I need to argue about an algorithm that has learned εk bits, i.e. I am in a universe where we condition on an event of probability 2-εk.
Thus, what I really want to prove is:

Lemma 2. For any event E with Pr[E] ≥ 2-εk, we have H(Q0 | E) + H(QΔ | E) − H(Q0, QΔ) | E) = Ω(k)

Proof. Remember the intuition from above: if I tell you that some event E happened, I am affecting your entropies by lg 1/Pr[E]. There's only one problem in implementing this intuition, namely that it's false. Imagine the following distribution: on odd days, a uniformly random vector of n bits; on even days, the vector (0,...,0). The entropy, which is an average, is n/2, i.e. quite high. But if I condition on an event with probability 1/2, I can bring the entropy down to zero.

What I need is concentration for entropies. And how do we always prove concentration? By using independence, which, luckily, we have here — remember how we broke the entropies into independent binomials. Of course, the formal implementation may not seem obvious yet, so let's do it carefully.

Let's assume Δ ≤ n/2k by symmetry. In this case, Lemma 1 boils down to: H(Q0) + H(QΔ) − H(Q0, QΔ)) ≈ 2k * h(n/k) - k*h(Δ) + k*h(n/k-Δ) ≥ k * (h(n/k) - h(Δ)) = Ω(k). I want to exploit the independence of k quantities, each with expectation h(n/k) - h(Δ).

These quantities come from the definition of entropy. Remember that H(X) = E[ lg 1/p(X) ], where p(.) is the probability density function of X. Let's make the following convenient definitions:
  • Xi = the sum of the bits in the i-th block. Let X = (X0, X1, ...).

  • Yi = the sum of the first Δ bits in the i-th block. Let Y = (Y0, Y1, ...). Denote the probability densities for Y by q(.).

In the regime Δ ≤ n/2k, we used the lower bound H(Q0) + H(QΔ) − H(Q0, QΔ)) ≥ H(X) - H(Y). Thus, I am interested in analyzing H(X|E) - H(Y|E).

To this end, I write H(X) - H(Y) = E[ lg q(Y)/p(X) ] = E[ lg (q(Y1) * q(Y2) * ...) / (p(X1) * p(X2) * ...) ] = E[ lg q(Y1) / (p(X1) + lg q(Y2) / p(X2) + ...].

This is in the desired form, the sum of k independent quantities, since (X1, Y1) is independent of (X2, Y2), etc. Each term has expectation h(n/k) - h(Δ) ≥ ln(2)/2 - O(1/m), as we showed in Lemma 1. Thus, I can apply Chernoff and conclude that the sum is Ω(k) with probability 1 - 2-Ω(k).

This probability is so large, that an event happening with probability 2-εk does not scare me any more. Indeed, even if I condition on E, the sum is Ω(k) with probability 1-o(1). Thus, it's expectation is also Ω(k).

Have I just shown H(X|E) - H(Y|E) = Ω(k)? Well, not quite. I've shown E[ lg q(Y)/p(X) | E ] = E[lg 1/p(X) | E] - E[lg 1/q(Y) | E] = Ω(k). But the probability density function p(.) is just another function when I condition on E. Under this conditioning, the real probability density is another function p'(.), i.e. H(X|E) = E[ lg 1/p'(X) ].

First remark that the probability mass inside E is a subset of the mass in the general universe, i.e. p(x) ≥ p'(x) * Pr[E], for any x. Thus p'(x) ≤ p(x) * 2εk and lg 1/p'(x) ≥ lg 1/p(x) - εk. So I can switch between p'(.) and p(.): E[lg 1/p'(X) | E] ≥ E[lg 1/p(X) | E] - εk.

Switching between q(.) and q'(.) is more tricky, since I want an upper bound. It is true that for many y, q'(y) could be much smaller than q(y), thus increasing the "entropy" lg 1/q'(y). Let's call y bad if q'(y) ≤ q(y) / 22εk. How much mass could the bad occupy? We has Sumy q(y) = 1, so Sumy∈Bad q'(y) ≤ 2-2εk. Thus, even inside the event E, the bad y's only have mass 2-εk. Now H(Y|E) is the average over the good y's and the bad y's. For the good, lg 1/q'(y) ≤ lg 1/q(y) + 2εk. For the bad, lg 1/q'(y) ≤ n (the smallest event in my distribution has probability 2-n). But the weight of the bad inside E is 2-εk = o(1/n) for reasonable k. Thus, the contribution of the bad is o(1).

We have thus shown E[lg 1/q'(Y) | E] ≤ E[lg 1/q(Y)] + 2εk. We conclude that H(X|E) - H(Y|E) ≥ E[ lg q(Y)/p(X) | E ] - 3εk = Ω(k). □

That's it. My hope was to present the proof as a sequence of interesting ideas, rather than a long technical thing filled with calculations. I don't know if I succeeded in conveying this feeling; let me know if you need help with any of the steps.


Paul Beame said...

Thumbs up on your explanation overall.

A minor comment: Your first intuition point about knocking off εk-bits at most (which you later rescind) might be better expressed as an upper bound rather than a lower bound: "that's the most you can be expected to tolerate" because an adversary might reduce them by this much. Your point later in the proof is that an adversary might succeed even better than this if the entropy were distributed unevenly, but that doesn't contradict the right version of the intuition.

But what exactly is straightforward is in the eye of the beholder, and one of the referees did not like my (succinct and semicoherent) proof. I was thus asked to provide as detailed an explanation as I could, pronto.

Given the SODA call for papers it this shouldn't have been a surprise. The SODA CFP contains
" The submission must contain complete proofs. (Part of the proofs can be relegated to the appendix if they do not fit in the requisite number of pages.)"
"Straightforward" in the eye of the prover does not equal "complete". This is one of those lemmas that someone familiar with typical arguments in the area would expect to be true but that is different from having a "complete proof".

This requirement is likely going to be the norm for the future and seems to be a positive step. It will of course reduce some of the macho "I did the whole thing the night before" in future but that isn't much of a loss.

Anonymous said...

A complete proof should include both the forest and the trees. Since you did not follow the submission guidelines, I am surprised that they let you modify your submission during the refereeing process. For fairness they should have rejected it.

Anonymous said...

I would be more motivated to read this post if you told us what the main result of the paper is.

Mihai said...

Paul, very good idea of describing the intuition as an upper bound.

Anon: I did not skip the proof of this lemma (that would be quite dangerous, as things in lower bounds are often tricky). The problem was that my description was too succinct for a reader who had not done this before — and of course, that it used words like "clearly" and "standard." As I always complain, if you write "X", the referee finds it an obvious step, but if you write "One immediately sees that X", the referee says this is not obvious and a proof is needed.

The requirement to include full proofs is a good rule of thumb (and I actively advocated for it in the FOCS committee). But you cannot treat it as absolute: the readers are not theorem checkers, and what is a good proof for one, is something undecyphrable for another.

The main benefit of the complete-proofs rule is that the steps are formally stated. And then, the paper becomes falsifiable — a big difference.

Anonymous said...

Proofs written like this (verbose style) causes mainstream probabilists to often throw up their arms and start doubting if TCS people know what they are doing. I think mathematical proofs should be succint, lay out all or most of the steps, and avoid talking about intuition etc. If the proof is long it should be broken into steps using lemmas and/or propositions.
If something regarding the idea (or ideas) behind the proof needs to be explained, it can be done under a separate heading or as "Remarks". Mixing these things is usually not a good idea and should be discouraged as bad practice. TCS authors often indulge in this bad practice. The resulting proofs become error prone, difficult to verify, and then become subject of criticisms from people like Neal Koblitz (just one example, but ask almost any mathematician who tried to read TCS papers).

rgrig said...

I think mathematical proofs should be succint, [...] and avoid talking about intuition

I think that's why non-mathematicians throw up their arms and wonder if mathematicians are humans or machines. :p I'm not a TCS person and I very much enjoy reading posts in this style, which I can follow.

@Mihai: I read this to see if I remember anything from my probability courses. I got stuck at the Wikipedia step. I know I shouldn't "agonize" over a "well-known" result :) but I couldn't figure out an easy way to compute sum((n choose k) lh (n choose k)) and that bothers me.

Also, I think I never knew what "probability concentration" is and google results are lacking in tutorial-style material on this subject. Any pointers?

And thanks for posting this.

rgrig said...

OK, here's a very informal argument for the first lemma. I put it here because you said you want to convey intuitions and I would have liked reading this before your proof.

Q_Δ is made out of k numbers in the range 0..n/k so it takes k lg n/k bits.
To encode (Q_0,Q_Δ) we can use 2k numbers in the range 0..n/2k for a total of 2k lg n/2k = 2k lg n/k - 2k bits.

We have 2(k lg n/k) - (2k lg n/k - 2k) = 2k fewer bits when we encode Q_0 together with Q_Δ.

Pat Morin said...


Here's a good survey paper about concentration inequalities:

of the kind Mihai is referring to.

Mihai said...

Anon, I hope you are joking, in which case the delivery is quite good. On the off chance that you are serious, you are very seriously mistaken. Highly technical writing without a crisp explanation of why the math should hold has turned out to be wrong often enough.

Rgrig, by concentration I almost always mean the Chernoff bound :) Other concentration results that I have used are Azuma, Johnson-Lindenstrauss, and various tricks with finite moments.

As for the Wikipedia step, I do not need to calculate the exact constant. All I need is that, as I go from m trials to 2m trials, the entropy increases by Omega(1), which is pretty natural if your entropy behaves like c*lg n. Thus, I was not particularly interesting in reworking the entropy of the binomial; Wikipedia is good enough.

Anonymous said...

On the off chance that you are serious, you are very seriously mistaken. Highly technical writing without a crisp explanation of why the math should hold has turned out to be wrong often enough.

No I was not joking and you misunderstand my point. Mathematical papers should have
Definitions, Theorems, Proofs, as well as other explanatory texts (serving as Introduction, Summary of the main ideas, even philosophy etc.). However, just as there should not be any irredundancy and loose words in the Definitions and statements of the Theorems, the same should ideally hold for the proofs as well.

Of course you are free to be as verbose as you like outside
\begin{proof} ... \end{proof}
but a capable mathematical reader should be free to ignore this verbosity and still be able to understand the proof.

rgrig said...

@Pat Morin: Thanks for the link!

@Mihai: I understand the "Wikipedia is good enough" point of view, but it bothers me not to know where the result comes from :) (And you said we should ask for help if we need it, right? :p)

I didn't try to get the constant. I tried to get h(n)=Θ(lg n). The part h(n)=O(lg n) is easy, since you can put the count in an int with lg n bits. For the part h(n)=Ω(lg n) I tried to show h(n+1)-h(n)=Ω(1/n). But the difference still looks horrible. It feels like there must be some easy way to get h(n)=Θ(lg n), doesn't it?

Mihai said...

rgrig, for the Ω(lg n) part, you remember that the binomial has mass Ω(1/sqrt(n)) on all values in n/2 ± sqrt(n). This follows from Sterling (but is a very useful thing to remember).

Mihai said...

Anon, I don't think we should take the intuition outside the proofs, since they do tend to get long at times... Since you claim a good understanding of what a competent mathematician does, let me add that the competent TCS person only reads the intuition, if at all :)

Once we switch to electronic publishing, we might require that all formal claims in a proof are typeset in black, and all intuition in blue. But it would be bad to separate them.

In any case, the typical problem is that proofs are unreadable for being too technical, not unreadable because of too many explanations :)

Anonymous said...

Actually, the anon has a good point. I won't agree that intuition should categorically be kept out of begin/end proof. But generally it should be kept separate from the technical steps. Acceptable solutions are: putting it all before or immediately after the begin{proof} (depending on which works better), using remarks, or having separate paragraphs of intuition within the proof. What does not work is mixing sentences of intuition with formal sentences inside the proof. This easily becomes very confusing.

rgrig said...

@Mihai: Thanks, I can follow that. (Late reply because I've been traveling.)

ConfusedGradStudent said...

Dear Mihai,

I've got a question about the Soda paper associated with this post.

Suppose we have a static data structure for rank queries that takes up n+r bits. Let w be the number of bits in a word.

Through a probe elimination lemma you prove that r.w^O(t) > n. When I carry out the calculations however, I get r . w^O(1) . factorial(t) > n, which is a substantially weaker bound for super-constant t.

You claim that P_{i+1} = k_i . O(w). However I think it should be t times that since Probes(Q_0) = k_i . t_i, where t_i is the number of probes in the induction.

I checked this with a couple of friends here but we couldnt get this. Your help would be greatly appreciated. Thanks

Anonymous said...

sorry, I meant
"however, I get r . w^O(t) . factorial(t) > n,"

i.e. your bound divided by t factorial.


Mihai said...

@ConfusedGradStudent: I am certainly happy that people are reading my papers :)

There is indeed a factor of t! (I think even a bit worse, i.e. 2^O(tlg t) with a bigger constant).

But consider that t=O(lg n) [for t=lg n, there is nothing to prove, as we have a solution with O(1) redundancy]

Also, w=Omega(lg n). Thus, w^O(t) is in fact the same as w^O(t) * t!

Anonymous said...

Oh, silly me.

Thanks a lot!