Asymptotic Normality in Parametric Models

Over the past series of posts, I have described the standard theory for proving consistency in a large class of parametric models. While this is useful for showing that the estimators we dream up are actually doing the right thing, it’s still a bit unsatisfactory: so far, for any fixed dataset, we don’t have a good grasp on how close we should expect our estimators to be to the truth; we don’t yet have a satisfactory way to describe our uncertainty about our estimators.

At the outset, describing our “uncertainty” about a parameter seems like a hopelessly complex task. It looks like it should be such a multifaceted concept that we would need an extremely high dimensional (or even infinite dimensional) object to capture all of the relevant aspects. As it will turn out, however, in many of the well behaved settings used in real world applications, this problem of describing uncertainty will have so much regularity to it that it will essentially boil down to a single number for each parameter.

Before diving in to the theory, let’s take a step back and think conceptually about what it would mean to quantify uncertainty. A good start might be that uncertainty is a measure of the probability that our estimator is wrong about the true parameter. This seems like a reasonable start, but it can’t be the full story. If our distribution is continuous, the probability that we are wrong for most estimators is 1. So this criterion is too strict: we need to define something a bit less demanding. In particular, one reasonable relaxation is to think of uncertainty as the answer to the query: “How likely is my estimator to be within \delta of the true value?”. We now have a fairly operational definition of uncertainty (it corresponds to something conceptually clear and precise). To actually compute it, we would need to know the distribution of our estimator, i.e. we need to be able to compute P(x \in S) for well behaved sets S. Computing this value exactly is too difficult in all but the most stylized models, but as it turns out, we can compute this value approximately in many cases given enough data.

At this point, it may be obvious to readers that I am alluding to the typical asymptotic normality results that many statistical estimators enjoy. The fact that we’re talking about normality is a good hint that we will want to bring in a central limit theorem somehow, but it may not be obvious how we do this. In this post, I will walk through a sketch of how the typical argument goes.

Remark 1: I have been procrastinating writing this post for a while now largely because in writing an initial draft, it struck me that the theory for convergence in distribution contains a lot of technicalities that I had taken for granted. Furthermore, unlike most of what we’ve seen so far, I do not know of a satisfactorily elementary way to deal with these technicalities. There will therefore be more handwaving than usual in this post, but I will try to at least call out the exact points where the arguments sweep things under the rug.

The first idea that we will need for connecting the CLT to the estimators we’ve been describing so far is the following famous “device” of Cramer and Wold:

Theorem 2 (Cramer-Wold Theorem): A sequence of random vectors X_n = (X_{1n}\cdots X_{kn}) \in \mathbb R^k converges in distribution to X = (X_1\cdots X_k) if and only if for every fixed t \in \mathbb R^k, we have that t \cdot X_n \overset{d}\to t \cdot X.

One of the directions in the above result should be clear:

Exercise 3: Show that if sequence of random vectors X_n = (X_{1n}\cdots X_{kn}) \in \mathbb R^k converges in distribution to X = (X_1\cdots X_k) then every fixed t \in \mathbb R^k, we have that t \cdot X_n \overset{d}\to t \cdot X.

The reverse direction is both more surprising and more technical to prove. Essentially, it follows from analyzing the characteristic functions of X_n and X and using the Levy continuity theorem I alluded to in a previous post on the central limit theorem.

This theorem is incredibly powerful because it establishes that in proving asymptotic normality results, there is no loss of generality in focusing on the scalar case. Thus, for example, the usual i.i.d. central limit theorem immediately generalizes to arbitrary dimensions.

The next key idea we will need is that the continuous mapping theorem holds for convergence in distribution:

Theorem 4 (Continuous mapping in Euclidean space): If X_n \overset{d}\to X  are Borel measurable random variables and g:\mathbb R^k \to \mathbb R^p is continuous, then g(X_n) \overset{d}\to g(X).

Proof Sketch: Let P_n be the probability measure for X_n and let P be the probability measure for X. The main technicality here is that we will use the portmanteau theorem to give a characterization of convergence in distribution that is easy to work with. In particular, it is necessary and sufficient to show that for any open set

U \subseteq \mathbb R^p, \liminf P_n(g(X_n) \in U) \geq P(g(X) \in U).

But this is equivalent to

\liminf P_n(X_n \in g^{-1}(U)) \geq P(X \in g^{-1}(U)).

But by the continuity of g, g^{-1}(U) is open in \mathbb R^k, so using the portmanteau theorem in the other direction, we are done.

Remark 5: The basic idea of this proof is entirely topological in nature and thus works in much more abstract situations, provided the random variables are measurable with respect to the Borel \sigma-algebra (which itself is a natural way to construct a measure given only a topological structure).

Remark 6: The portmanteau theorem invoked in the above sketch is a fundamental result and gives a nice characterization of the topology of convergence in distribution. In particular, together with the Riesz-Markov-Kakutani representation theorem, it shows that the topology of convergence in distribution is exactly equal to a certain weak topology from linear functional analysis, and thus, much of the machinery of locally convex vector spaces can be used to answer questions about convergence in distribution.

With the following observation, the continuous mapping theorem immediately implies Slutsky’s lemma (a result so useful, that even University of Chicago economics undergrads learn its statement).

Exercise 7: Show that for some constant c, X_n \overset{p}\to c is equivalent to X_n \overset{d}\to X where P(X = c) = 1.

Exercise 8: Use the above to show Slutsky’s lemma: suppose X_n \overset{d}\to X and Y_n \overset{p} \to c. Then

  1. X_n + Y_n \overset{d} \to X + c
  2. Y_nX_n \overset{d} \to cX
  3. X_n / Y_n \overset{d} \to X / c provided that c \neq 0

(Hint: addition, multiplication, and division are all continuous binary operations)

Remark 9: This is the same Slutsky as the Slutsky decomposition from consumer theory. One of my favorite pieces of math history is that Slutsky was a Russian mathematician worked on mathematical economics up until about the time of the Bolshevik revolution when he suddenly developed an interest in probability theory.

With all of this machinery in place, we are finally ready to derive the interesting asymptotic normality results we’ve been building up towards. So far, the asymptotic normality theory we’ve looked at has all been restricted to linear situations. The CLT is about sample averages, which is a linear way to summarize data, and the Cramer-Wold device was about dot products, which are also linear. The idea for how to extend this to nonlinear settings is that we want to approximate nonlinear objects by linear ones. The question then becomes: when is this approximation a “good” one? But this question of when linear approximations are locally good is basically the definition of the derivative. In particular, we have

Proposition 10: A function f is differentiable at a point c with gradient \nabla f if and only if its first order Taylor expansion is good in the sense that

f(X) = f(c) + \nabla f(c)^T(X - c) + g(X - c)

with \lim_{t\to0}g(t) = 0.

If X_n \overset{p}\to \mu and X_n \overset{d}\to \mathcal N(\mu, \Sigma), this with Slutsky’s lemma immediately gives us the usual delta method.

Theorem 11 (Delta method, part 1): If X_n \overset{p}\to \mu and X_n \overset{d}\to \mathcal N(\mu,\Sigma) and f is continuously differentiable at \mu, then

f(X_n) \overset{d}\to \mathcal N\left(f(\mu), \nabla f(\mu)^T \Sigma \nabla f(\mu)\right)

Proof: By continuous mapping for convergence in probability, g(X_n - \mu) \overset{p}\to 0 in the Taylor expansion above. The result is now immediate from continuous mapping for convergence in distribution and Slutsky’s lemma.

Clearly, the above generalizes to the case where the range of g(X) is multidimensional. When \mu is a critical point for f (i.e. a point of 0 gradient), the above result is not technically wrong, but sort of useless. However, if f has more smoothness, we can still easily compute an asymptotic distribution in the case where f is univariate. In particular, we will assume that f admits a second order Taylor expansion:

f(x) = f(\mu) + f''(\mu) (x - \mu)^2 + g(x - \mu)

where \lim_{t\to0}g(t) = 0.

Exercise 12 (Delta method, part 2): Show that if the above Taylor expansion is valid, and X_n \overset{d}\to \mathcal N(\mu, \sigma^2), then f(X_n) - f(\mu) \overset{d}\to \frac{\sigma^2 f''(\mu)}{n}\chi^2_1 where \chi_1^2 is the chi squared distribution with one degree of freedom. (Hint: Remember that the chi squared distribution with one degree of freedom is just the distribution of a squared standard normal random random variable).

In particular, theorem 11 now immediately gives us an asymptotic distribution for sufficiently well behaved method of moments estimators.

Corollary 13: Consider a method of moments estimator using moment conditions \mathbb E[g(X;\theta)] = 0, and suppose this moment system can be exactly solved: \hat\theta = h\left(\frac{1}{n} \sum_{i=1}^nX_n\right) for some continuously differentiable h. Suppose additionally that the variance-covariance matrix of X_1 exists and is given by \Sigma. Then

(\hat\theta - \theta) \overset{d}\to \mathcal N(0, D f(\mu) \Sigma D f(\mu)^T)

where D f(\mu) is the derivative of f at \mu.

Finally, we turn to the case of M-estimators. Recall that an M-estimator attempts to optimize an objective function Q(\bar X_n;\theta) where \theta_0 optimizes the population objective function Q(\mathbb E[X], \theta_0). The idea here will be that we will use the first order condition of the optimization problem to represent the M estimator as a term that satisfies the conditions of the delta method with some asymptotically negligible estimation error. Assuming sufficient smoothness, we have that that \nabla_\theta Q(\bar X_n, \hat\theta_n) = 0. Suppose, additionally, that \hat\theta_n \overset{p}\to \theta_0 (i.e. the estimator is consistent). Then we have (using yet another version of Taylor’s theorem)

\nabla_\theta Q(\bar X_n, \theta_0) = \nabla_{\theta\theta} Q(\bar X_n, \bar\theta_n)(\theta_0 - \hat\theta_n)

where \bar \theta_n is between \hat\theta_n and \theta_0. Therefore, we must have that \bar\theta_n \to \theta_0 since \hat\theta_n \to \theta_0. We can invert this equation to get

\sqrt n (\hat\theta_n - \theta_0) = -\nabla_{\theta\theta} Q(\bar X_n, \bar\theta_n)^{-1} \sqrt n\nabla_\theta Q(\bar X_n,\theta_0)

Assuming all the relevant moments exist, and Q is twice continuously differentiable, by continuous mapping for convergence in probability, we have that \nabla_{\theta\theta} Q(\bar X_n, \bar\theta_n)^{-1}\overset{p}\to Q(\mathbb E[X], \theta_0)^{-1} \equiv H^{-1} since we have already assumed that \hat \theta_n is consistent. Furthermore, using the delta method, we have that \sqrt n\nabla_\theta Q(\bar X_n, \theta_0) \overset{d}\to \mathcal N(0,\Sigma) where \Sigma = \nabla_{\theta,\mathbb E[X]} Q(\mathbb E[X];\theta_0)^T \mathbb V\mathrm{ar}(X) \nabla_{\theta,\mathbb E[X]} Q(\mathbb E[X];\theta_0). Finally, putting everything together using Slutsky’s theorem, we have our desired end result:

Theorem 13 (Asymptotic normality for sufficiently “regular” M-estimators): If \hat \theta_n \overset{p}\to \theta_0, Q is twice continuously differentiable, and all relevant moments exist, then

\sqrt n (\hat\theta - \theta_0) \overset{d}\to \mathcal N(0, H^{-1} \Sigma H^{-1})

Remark 14: The “sandwich” form of the asymptotic variance arises because \mathbb V\mathrm{ar}(AX) = A^T\mathbb V\mathrm{ar}(X) A

Exercise 15: Use the above result to formulate and prove conditions under which the maximum likelihood estimator is consistent and asymptotically normal. Use the information matrix equality to simplify the expression for the asymptotic variance.

Exercise 16: Suppose your likelihood model is incorrectly specified, but the smoothness and finite moment assumptions still hold. What does the MLE now converge to? What happens to the asymptotic variance under misspecification? (Hint: the asymptotic object the MLE converges to is related to an information projection)

When Q is not as smooth as we’d like in finite sample, it still may be possible to recover the results above, provided we have some asymptotic smoothness assumptions.  Additionally, there are other topics that merit discussion such as asymptotic efficiency, hypothesis testing, estimation of semi-parametric models, etc. But this seems to be a good stopping point for an exposition. Finally, I should add the caveat that everything developed here has been asymptotic. This theory is relevant for large datasets but may be misleading when n is small. How small is too small? That depends on the particular application and will vary. The most common tool for gaining intuition about this is by doing simulation studies. Practitioners should always keep this caveat in mind when using asymptotic theory to analyze their estimators.

Before finishing this series on the basics of asymptotic statistics, I wanted to reexamine my initial goal when I started writing as well as provide some good starting points for further reading. In my first post in this series, I mused a bit about how much different an alien civilization’s statistics would be. My undergrad self would have probably answered “quite different”. This view is mostly driven by the fact that my first college statistics course talked a lot about philosophical disagreements like frequentist vs Bayesian frameworks for thinking about data, the (somewhat unnatural) interpretation of p-values, etc. However, this view probably emphasizes the differences in opinion too much and downplays the common threads that have been present throughout: that there is a lot of regularity in randomness, that samples are often sufficient to reasonably accurately characterize the population, and that by studying this regularity, we can construct approximately apples to apples comparisons between seemingly disparate phenomenon. Like with the any other branch of mathematics, the specific language might look different across different histories, but the laws of probability themselves are inescapable.

Finally, here is a list of texts for further study (these are just the texts I’ve learned from, so it is by no means comprehensive):

  1. Billingsley’s Probability and Measure is a classic text (and where I first learned measure theory). It’s a bit dense, the proofs are not always the most elegant, and the notation can be quite un-modern at times. Nonetheless, it is completely accessible to anyone with an undergraduate real analysis background and some persistence.
  2. A more modern book is Rick Durrett’s Probability: Theory and Examples, although I have not read through much of it and sadly, the link to a pdf of the book I once had no longer exists.
  3. My first course in mathematical statistics was from Jun Shao’s Mathematical Statistics (unfortunately, there does not seem to be an online copy, and I had to borrow it from my university’s library over winter break). Aside from the chapter on nonparametric statistics, it should be accessible after working through a text like Billingsley. The earlier chapters especially feature good discussions about many non-asymptotic questions I have neglected here such as some basic theory for exponential families and a discussion of things like sufficient statistics.
  4. This series has been largely inspired by Newey and McFadden’s 1994 handbook of econometrics chapter on large sample estimation. This chapter presents quite general results that try to tie together as many things into the same framework as possible, which makes for a fairly difficult read (although it should be doable if the books above are mastered). Much of these posts has been an attempt at reworking the main arguments presented there (as well as filling in some gaps) in a way that (hopefully) trades off a bit of generality for cleaner exposition.
  5. Measure theory isn’t all about probability – fully rigorous theories of integration usually take measure as a starting point. I have found no better textbook for introducing measure theory than Terrance Tao’s (publicly available) lecture notes. His writing style tends to emphasize how even seemingly crazy formulas usually have some natural and intuitive grounding. I’ve found this incredibly useful for demystifying math and making it less intimidating.

Leave a comment