A difficult inequality and 2 proofs

In a previous post, we showed how the maximum likelihood estimator can be motivated by the fact that if a distribution belongs to some parametric model \Theta, the true parameter \theta_0 can be shown to solve a certain variational problem:

\theta_0 = \mathrm{argmin}_{\theta \in \Theta}\, D(f(x | \theta_0) || g(x | \theta)) = \mathrm{argmin}_{\theta \in \Theta}\, \mathbb E_{x \sim f(x | \theta_0)}[ g(x | \theta)]

The reason why MLE works (under suitable regularity conditions), then, is that the quantity to be minimized on the right hand side can be directly estimated in the data via the analogy principle. Actually, this idea of characterizing parameters as solutions to some variational problem is quite general, and one natural way to produce the corresponding variational problem is by measuring the distance between distributions via some quasipseudometric.

For now, let us restrict ourselves to distributions absolutely continuous with respect to some measure (i.e. distributions with densities). One quasimetric we have already considered for this setting is the KL-divergence used for MLE:

D(f || g) = \int f \log (f/g)

While this is a natural distance measure from an information theoretic perspective, other metrics may be more relevant for other settings. For example, in the L^2 setting often considered in the semi-parametric efficiency literature, it is often more fruitful to consider the Hellinger distance defined by:

H(f, g)^2 = \frac 12 \int (\sqrt f - \sqrt g)^2

On the other, if we care about more about L^1 theory, then we may be primarily interested in the total variation metric, given by

V(f,g) = \int |f - g|

Exercise 1: Show that the Hellinger and total variation metrics are true metrics by checking the axioms.

Once we have multiple choices of distance measures, a useful exercise is to compare the topologies they generate. In particular, we want to know if convergence in one metric automatically implies convergence in another. If it can be shown that for two metrics d_1, d_2, we have

d_1(x,y) \leq f(d_2(x,y))

for some f continuous at 0 such that f(0) = 0, then we know that convergence in d_2 implies convergence in d_1. Some of these comparisons are easy:

Exercise 2: Show that H^2(f,g) \leq V(f,g) (Hint: use Jensen’s inequality)

Others are harder, as evidenced by the fact that they are named after people:

Proposition 3 (Pinsker’s inequality): V(f,g) \leq \sqrt{2 D(f || g)}

The difficulty of this proof stems from the fact that both the absolute value and the logarithm showing up in these distance measures are reasonably nonlinear functions. In what follows, I present two proofs of Pinsker’s inequality. The first is from this textbook, where I first encountered it, and the second is my modification, which I find to be more pedagogically useful.

Proof 1: It suffices to consider the case where f > 0 \implies g > 0, since otherwise, the KL-divergence is 0. Let \psi(x) = x\log x - x + 1. Then one can check (exercise) that \psi(0) = \psi'(0) = 0 while \psi''(x) > 0 for x > 0. Additionally, one can check that

\left(\frac43 + \frac23 x\right)\psi(x) \geq (x - 1)^2

Now, we get

\int |f - g| \leq \int g |f/g - 1| \leq \int g \sqrt{(f/g - 1)^2} \leq \int \sqrt{g\left(\frac43 + \frac23\right) f/g}\sqrt{g \psi(f/g)}

\leq \sqrt{\int f\left(\frac43 + \frac23\right)} \sqrt{\int g\left(\frac43 + \frac23\right) f/g}\sqrt{g \psi(f/g)} = \sqrt{2 D(f || g)}

where the inequality in the second to last step is Cauchy-Schartz.

I don’t like the proof mainly because I find it too “authoritarian” in the sense that some higher being tells you that \psi(x) = x \log x - x + 1 is the right bound to consider, and in fact, \left(\frac43 + \frac23\right) \psi(x) \geq (x-1)^2. The proof then follows from this brilliant insight. I prefer the following, more bottom up version:

Proof 2: We make the observation that

\int |f - g| = \int g |f/g - 1| = \int g \sqrt{(f/g - 1)^2}

It’s not terribly unrealistic that a mere mortal can make this step, since at the end of the day, we want to somehow take care of the largest non-linearity in the definition of KL-divergence, the \log(f/g) term. Let’s begin with an incorrect proof to demonstrate how we would ideally like our argument to go from here:

\int g \sqrt{(f/g - 1)^2} = \int \sqrt{g \left(\frac{(f/g - 1)^2}{f/g \log(f/g)}\right) g f/g \log(f/g)} \leq \sqrt{\int g\left(\frac{(f/g - 1)^2}{f/g \log(f/g)}\right)}\sqrt{\int g f/g \log(f/g)}

The second factor, then, will just be the square root of KL divergence as desired, while the first term can be further analyzed to yield a constant bound. Unfortunately, this does not quite work. The issue is that the square root is not defined for negative quantities, so the Cauchy-Schartz step above is invalid. At this point, we use the fact that x\log x is nicely bounded from below, so we have hope of applying a shift S the whole function into being strictly positive. However, we also need the transformation to be such that \int g x \log x + S(x) = \int g x \log x. As we might have guessed from proof 1, the right choice of S(x) will be 1 - x. Another way to see why this is a reasonable choice is to notice that the only place where x \log x < 0 is where x is close to 0, so subtracting x is more than compensated for by adding 1 where it matters. So now, applying the same arguments as above, we have that

\int g \sqrt{(f/g - 1)^2} \leq \sqrt{\int g\left(\frac{(f/g - 1)^2}{f/g \log(f/g) - f/g + 1}\right)}\sqrt{\int g\left(f/g \log (f / g) - f/g + 1\right)}

\leq \sqrt{\int g\left(\frac{(f/g - 1)^2}{f/g \log(f/g) - f/g + 1}\right)} \sqrt{D(f || g)}

Now, to show that the integral above is bounded in a way that does not depend on f,g, it suffices to show that

Exercise 4: Show that \xi(x) = \frac{(x - 1)^2}{x \log(x) - x + 1} is concave

Now by concavity, it can be bounded above by its first order Taylor expansion, which gives

\int g \xi(f/g) \leq \int g\cdot (\xi(a) + (f/g - a) \xi'(a)) \equiv \int g \cdot (A + B f/g) = A + B

since f,g are both probability densities. Provided \xi(a), \xi'(a) are finite, so are A, B and we already have a bound of the form V(f,g) \leq C \sqrt{D(f || g)} for some constant C not depending on f,g. For most topological purposes, this is sufficient, and to get the optimal constants above, all that is required is that we optimize our choice of a in the above Taylor expansion argument.

Exercise 5: Optimize the point at which we expand around to get the constant of \sqrt 2 in the statement of Pinsker’s inequality.

Proofs in the textbook style above are efficient for demonstrating that a result is true without providing much information about how a result was discovered. It is entirely reasonable that textbooks tend to opt for the former style, since proofs of the second style can be much longer to write down.

It seems to me, however that there are two main places where we might want to opt for a style 2 proof. The first is in proving important, but nontrivial results. This is because it is my view that textbook authors have a responsibility to show that the important results of their field are not the result of some inaccessible insight, but the result of careful thinking combined with the hard work necessary to stumble upon the handful of difficult, but ultimately accessible, discoveries required to make progress. The second setting where I think it is helpful to write proofs in the style of proof 2 above is in more introductory texts. In these settings, taking the time to fully motivate all of the optimizations and tweaks required to make the proof work has the added benefit of demonstrating to the reader how to think about problems. For relative beginners, the marginal benefit of this is probably high enough to justify the extra word count.

A good binary measure of whether someone deeply understands a proof is whether they are able to turn a style 1 proof into a style 2 proof. This is why this blog is such a useful tool for me. It has served as an excellent instrument for forcing myself to get to this level of understanding. While it is sometimes too tempting to be lazy and just present a top down textbook style proof, when possible, I try to hold myself to the standard of writing proofs of the second style as much as possible here.

Leave a comment