Compactness and its statistical uses

In a previous post, I discussed parametric M-estimators, and proved a consistency result for a large class of such estimators in finite dimensional settings. A lot of the heavy lifting in the proof was done by a compactness assumption. In this post, I will run through some of the basic theory about compact sets that one might encounter in a real analysis course (I was intending to wrap up my discussion of general parametric statistics theory by discussing asymptotic normality results, but I’ve been thinking a lot about compactness lately and wanted to write some thoughts down while they are still relatively fresh in my mind). The goal will be to use these foundations to both get a deeper understanding of why compactness is so important in parametric settings as well as how it hints at the strategies necessary for developing much of the classical theory of nonparametric regression. We begin with the familiar definition:

Definition 1: A subset K of a topological space X is compact if for every collection of open sets \{U_\alpha\}_{\alpha \in I} (where I is an arbitrary cardinality index set) satisfying K \subseteq \cup_{\alpha \in I} U_\alpha (we say in this case that \{U_\alpha\}_{\alpha \in I} is an open cover of K, there exist some finite sub-collection such that K \subseteq \cup_{i = 1}^n U_{\alpha_i} (we say that \{U_{\alpha_i}\} is a finite sub-cover).

The fact that the sub-cover here is finite provides a hint that compactness is really a property of “smallness” in some topological sense. Why is topological smallness useful in probability? The easiest answer is that it is often easier to estimate the probability of small objects. Let me make this more precise.

Recall the following two axioms of probability:

  1. If A \subseteq B, the P(A) \leq P(B) (monotonicity)
  2. If A_i are mutually disjoint, then P\left(\cup_{i=1}^nA_i\right) = \sum_{i=1}^nP(A_i) where n is possibly infinity (countable additivity)

Exercise 2: Use the above axioms to show that probabilities are countably sub additive: P\left(\cup_{i=1}^n A_i\right) \leq \sum_{i=1}^n P(A_i)

When n is finite, the above inequality can be especially useful.

Exercise 3: Let n < \infty and consider the array of events A_i^m where i = 1,\ldots n and m = 1,\ldots,. \lim_{m\to\infty} P(A_i^m) = 0 for each i individually if and only if \lim_{m\to\infty} P(\cup_{i=1}^n A_i^m) = 0.

The proof of the uniform law of large numbers uses this fact. Compactness allowed us to get a finite set of parameter values \theta_1, \theta_2,\ldots,\theta_m and bound P\left(\cup_{i=1}^m\{h(X,\theta_j) - \mathbb E[h(X,\theta_j)] > \delta\}\right). The above exercise shows that once we have reduced to this finite case, we lose no power in considering each of these probabilities separately.

The discussion above has still been entirely trivial. But what happens when n = \infty? This infinity is only mildly concerning. We are still working with a countable infinity, so we have another trick up our sleeves. The contents of exercise 3 are no longer exactly true, but we can rescue the only if direction by inserting a stronger hypothesis.

Exercise 4: With the notation of exercise 3, show that if \lim_{m\to\infty} P(A_i^m) = 0 for each i individually, and \sum_{i=1}^\infty P(A_i^m) < \infty for each m individually, then \lim_{m\to\infty} P(\cup_{i=1}^n A_i^m) = 0.

The above exercise hints that we are able to tolerate mild violations to the “smallness” that compactness gives us and still recover results of interest. As currently stated, however, the idea of a “mild” violation is vague and imprecise. This is only natural as the definition of compactness we are using is itself quite “qualitative” (an easy way to convince yourself that it’s qualitative is to recognize that the definition can be made purely in words: a set is compact if any open cover has a finite sub-cover). However, we will see that there is a more “quantitative” way to characterize compactness in metric spaces. First a set of lemmas:

Lemma 5: Let K be compact and let C be a closed subset of K. Then C is also compact.

Proof: Fix some open cover of C. Then this open cover with the addition of C^c is an open cover of K since the complement of a closed set is open. By compactness of K, it has a finite sub-cover. If C^c is in this sub-cover, it is clear that the remaining sets are a finite sub-cover of the original open cover of C. Otherwise, the sub-cover covers all of K \supseteq C and is a finite sub-cover of the original cover of C. In either case, we have found a finite sub-cover.

Lemma 6: If K is compact in a metric space, then K is closed.

Proof: It suffices to show that K^c is open. To this end, fix p \in K^c. For each point q \in K, let r_q = \frac12 d(p,q). Then consider the balls B_{r_q}(q) \cap B_{r_q}(p) = \emptyset. Observe that the B_{r_q}(q) are an open cover for K so there is some finite sub cover B_{r_{q_1}}(q_1),\ldots B_{r_{q_n}}(q_n). It is clear that \cap_{i=1}^n B_{r_{q_i}}(p) \subseteq K^c (since it is disjoint from the finite sub cover) and is open since the finite intersection of open sets is open. Since p was arbitrary, K^c is open.

Remark 7: The above proof generalizes exactly if we relax “metric space” to “Hausdorff”  by replacing the balls from the above proof with some more qualitative topological concepts. However, since our whole goal here is to develop a more quantitative theory, we will not pursue this generalization here.

Exercise 8: Let K be a compact subset of some metric space (X,d). Show that K also satisfies:

  1. For any \varepsilon > 0, there is a finite set of points x_1,\ldots,x_n such that K \subseteq \cup_{i=1}^n B_\varepsilon(x_i) (total boundedness)
  2. For any x_1,x_2,\ldots \subseteq K, there exists some subsequence x_{n_1},x_{n_2},\ldots that converges to a point in K: x_{n_i} \to x \in K (sequential compactness).

The set of points in total boundedness is sometimes called an \varepsilon-net (Hint: for 2, consider the closure of x_1,x_2,\ldots as a set. Since K is closed, this closure is a subset of K and hence compact by lemma 5. If you are stuck at this step, the proof of lemma 9 may provide some inspiration.)

Total boundedness is of particular interest for the purposes of this post because it highlights the quantitative side of compactness. For a fixed value of \varepsilon, there is some minimal n points required to approximate any point in K to \varepsilon-precision. Informally speaking, this optimum covering allows us to use \varepsilon balls to measure the size of K.

As it turns out, “closed and totally bounded” and “sequentially compact” are both equivalent characterizations of compactness in metric spaces. To see this, it will be useful to first show the equivalence between sequential compactness and closed and totally bounded first:

Lemma 9: Let K be subset of a metric space (X,d). Then the following are equivalent:

  1. K is sequentially compact
  2. K is closed and totally bounded

Proof: Let us first show 1) => 2). By the definition of sequential compactness, we have that any convergent sequence in K converges to a point in K (all subsequences of convergent sequences converge to the same point). Thus K satisfies one of the metric space definitions of closed. To show total boundedness, fix \varepsilon> 0 and construct a sequence of points in the following inductive manner: Pick any x_1 \in K. Having picked x_1,\ldots x_{n-1}, pick x_n to be such that d(x_n,x_i) > \varepsilon. If this cannot be done at any step, it is clear that x_1,\ldots,x_{n-1} can be taken to be the \varepsilon-net as in the definition of total boundedness. Suppose x_n can be found for every natural number n. We have just constructed an infinite sequence, so by sequential compactness, the sequence has a convergent (and hence Cauchy) subsequence. But then we can find x_i, x_j with d(x_i,x_j) < \varepsilon (in fact, we can find infinitely many such pairs!). But this contradicts the construction of our sequence, and hence, the process described above must terminate. Now, we prove 2) => 1). Consider a sequence x_1,x_2,\ldots. Because total boundedness gives us a finite \varepsilon-net for each fixed \varepsilon and our sequence is infinite, it must be the case that this sequence ends up in an \varepsilon radius around at least one of the points infinitely often. Let us begin by taking \varepsilon = 1. Let p_1 be the point such that x_1,x_2,\ldots is within radius 1 of p_1 infinitely often as discussed above and let x_{n_{11}},x_{n_{12}},\ldots be the subsequence obtained by restricting the original sequence to only the elements within radius 1. Observe that:

  1. B_1(p_1) \cap K \subseteq K is totally bounded (exercise)
  2. x_{n_{11}},x_{n_{12}},\ldots is a sequence within a totally bounded set

It is clear that for each m = 1,2,\ldots, we can apply the above procedure with \varepsilon = \frac1m. We now have a nested collection of subsequences:

x_{n_{11}}, x_{n_{12}},\ldots

x_{n_{21}}, x_{n_{22}},\ldots

and so on.

Exercise 10: Complete the proof. In particular, show

  1. A subset of a totally bounded set is totally bounded
  2. The sub-sequence x_{n_{mm}}, m = 1,2,\ldots is convergent.
  3. By closedness of K, conclude that this sequence in fact converges to an element of K.

Remark 11: The proof of 2) => 1) required a good deal more creativity than the reverse direction, but it boiled down to two standard tricks in analysis:

  1. Combining the finiteness from total boundedness with the infinite nature of a sequence allowed us to use an infinite version of the pigeonhole principle.
  2. Using 1, we were able to construct a nested (infinite) array of subsequences. We turned this array into a single subsequence by taking the diagonal elements of this array, a common trick pioneered by Georg Cantor.

Remark 12: Stepping back a bit from the technicalities, the equivalence above stems from the fact that both definitions in a sense restrict our “movement” around the set. Sequential compactness disallows any “escape” to infinity by requiring that eventually, our sequence has to “bump into itself” forever. Similarly, total boundedness restricts escape to infinity by the finiteness of the \varepsilon-nets.

The reason why we first showed equivalence of “closed and totally bounded” and “sequentially compact” is that it will actually be easier to show that “closed and totally bounded and sequentially compact” implies “compact”. We will need one more lemma to do the heavy lifting:

Lemma 13 (Lebesgue number theorem): Let K be sequentially compact. Let \{U_\alpha\}_{\alpha \in I} be an open cover. Then there exists \delta > 0  (called the Lebesgue number) such that every set of the form B_\delta(x) \cap K for x \in K is a subset of some U_\alpha.

Proof: By contradiction. Suppose no such \delta exists. Then we can find a sequence x_1,x_2,\ldots such that B_{\frac1n}(x_n) is not a subset of any U_\alpha. By sequential compactness, there exists some convergent subsequence x_{n_1}, x_{n_2},\ldots that converges to some x \in K. Then since \{U_\alpha\}_{\alpha \in I} is a cover, x \in U_\alpha for some \alpha. But since U_\alpha is open, we have in fact x \in B_{\varepsilon}(x) \subseteq U_\alpha. But now, choose k sufficiently large such that

  1. d(x,x_{n_k}) < \varepsilon / 2
  2. \frac1{n_k} < \varepsilon / 2

Then we have that B_{n_k}(x_{n_k}) \subseteq U_\alpha, which contradicts our choice of x_1,x_2,\ldots.

The Lebesgue number theorem tells us that for every open cover, for some \delta > 0 each set of the form B_\delta(x) for x \in K can be identified with a member of the open cover. This mapping allows us to use what we know about \varepsilon coverings from the definition of total boundedness with the arbitrary open covers from the definition of compactness.

Exercise 14: Finish the proof that the following are equivalent in metric spaces:

  1. K is compact
  2. K is sequentially compact
  3. K is closed and totally bounded

Exercise 15: Prove that in Euclidean space, totally bounded is equivalent to bounded (a set is bounded if it is contained in some ball of radius M) and hence deduce the Heine-Borel theorem: A subset of \mathbb R^d is compact if and only if it’s closed and bounded. (Hint: it may be easier to equip \mathbb R^d with the equivalent \ell^\infty, or sup norm: |x| = \max_{i=1,\ldots,d} x_i.)

What have we gained for all of this effort? With our original definition of compactness, we had a binary – a set is either compact or not compact. Now, we have a way to compare the “degree” of compactness using the  \varepsilon-nets in the definition of totally bounded. In particular, we make the following definitions.

Definition 16: For a metric space (X,d), the \varepsilon-covering number N(\varepsilon, d, X) is given by the minimum cardinality set x_1,\ldots, x_n such that \{B_\varepsilon(x_i)\}_{i=1}^n covers X. The metric entropy is defined as \log N(\varepsilon,d,X).

Remark 17: One may wonder what this “entropy” has to do with thermodynamic or information entropy. It turns out, this naming convention only primarily has to do with the presence of the logarithm. A followup question is why log shows up so often. David Tong has a nice answer to this question in the context of statistical Physics:

Why do we take the log in the definition? One reason is that it makes the numbers less silly.

Indeed, in statistics, we take asymptotes so much that our numbers must eventually get “silly” (both in the sense of getting sillily large and sillily small).

Remark 18: We can now compare the “complexity” or “size” of two compact sets by comparing their metric entropies. A deeper answer to why we take logs has to do with the fact that by taking logs, we get more precise control over the complexity of our sets. In nonparametric statistics, a natural way to construct consistent estimators in infinite dimensional parameter spaces is to use “sieve” estimation where for each finite n, we restrict our estimator to lie in some finite dimensional subspace and grow the dimension to infinity as n grows to infinity (think for example, of a non-linear regression where we grow the degree of the polynomial as we get more data). The rate at which we should grow this finite dimensional subspace lies on something of a knife’s edge – grow the complexity too quickly and you might always overfit too much to even get consistency. Grow it too slowly and you will suffer from high approximation error and converge at suboptimal rates. Taking the logarithm turns out to give estimates of error that are sufficiently precise to make this tradeoff between complexity and approximation error optimally.

Remark 19: As a bit of a teaser, we can be even more specific about why the logarithm is the right transformation to make our analysis of error precise. A lot of non-parametric theory assumes that the random variables we’re dealing with are “sub-gaussian” in the sense that \mathbb P(|X - \mathbb E[X]| > t) \leq A\exp(-vt^2) or at least “sub-exponential” in the sense that \mathbb P(|X - \mathbb E[X]| > t) \leq A\exp(-vt) for constants A,v not depending on t (such distributions have quickly dying tails). Essentially, the logarithm comes in because it is the inverse of the exponential in these probability estimates.

Not only does this notion of metric entropy allow us to quantify the size of compact sets precisely, it also gives us a strategy to quantitatively characterize how “close” non-compact sets are to being compact. For example, if a set S is non-compact, but \sigma-compact (i.e. it is contained in the countable union of compact sets), then the estimates on the metric entropy of each set in the union can give us an idea of how well behaved we can expect the limit to be. This is essentially what the sieve estimators in remark 18 will try to do. But to do this, we will need to develop some theory to help us estimate the metric entropy of sets of interest.

Leave a comment