In a previous post, I discussed parametric M-estimators, and proved a consistency result for a large class of such estimators in finite dimensional settings. A lot of the heavy lifting in the proof was done by a compactness assumption. In this post, I will run through some of the basic theory about compact sets that one might encounter in a real analysis course (I was intending to wrap up my discussion of general parametric statistics theory by discussing asymptotic normality results, but I’ve been thinking a lot about compactness lately and wanted to write some thoughts down while they are still relatively fresh in my mind). The goal will be to use these foundations to both get a deeper understanding of why compactness is so important in parametric settings as well as how it hints at the strategies necessary for developing much of the classical theory of nonparametric regression. We begin with the familiar definition:
Definition 1: A subset of a topological space
is compact if for every collection of open sets
(where
is an arbitrary cardinality index set) satisfying
(we say in this case that
is an open cover of
, there exist some finite sub-collection such that
(we say that
is a finite sub-cover).
The fact that the sub-cover here is finite provides a hint that compactness is really a property of “smallness” in some topological sense. Why is topological smallness useful in probability? The easiest answer is that it is often easier to estimate the probability of small objects. Let me make this more precise.
Recall the following two axioms of probability:
- If
, the
(monotonicity)
- If
are mutually disjoint, then
where
is possibly infinity (countable additivity)
Exercise 2: Use the above axioms to show that probabilities are countably sub additive:
When is finite, the above inequality can be especially useful.
Exercise 3: Let and consider the array of events
where
and
.
for each
individually if and only if
.
The proof of the uniform law of large numbers uses this fact. Compactness allowed us to get a finite set of parameter values and bound
. The above exercise shows that once we have reduced to this finite case, we lose no power in considering each of these probabilities separately.
The discussion above has still been entirely trivial. But what happens when ? This infinity is only mildly concerning. We are still working with a countable infinity, so we have another trick up our sleeves. The contents of exercise 3 are no longer exactly true, but we can rescue the only if direction by inserting a stronger hypothesis.
Exercise 4: With the notation of exercise 3, show that if for each
individually, and
for each
individually, then
.
The above exercise hints that we are able to tolerate mild violations to the “smallness” that compactness gives us and still recover results of interest. As currently stated, however, the idea of a “mild” violation is vague and imprecise. This is only natural as the definition of compactness we are using is itself quite “qualitative” (an easy way to convince yourself that it’s qualitative is to recognize that the definition can be made purely in words: a set is compact if any open cover has a finite sub-cover). However, we will see that there is a more “quantitative” way to characterize compactness in metric spaces. First a set of lemmas:
Lemma 5: Let be compact and let
be a closed subset of
. Then
is also compact.
Proof: Fix some open cover of . Then this open cover with the addition of
is an open cover of
since the complement of a closed set is open. By compactness of
, it has a finite sub-cover. If
is in this sub-cover, it is clear that the remaining sets are a finite sub-cover of the original open cover of
. Otherwise, the sub-cover covers all of
and is a finite sub-cover of the original cover of
. In either case, we have found a finite sub-cover.
Lemma 6: If is compact in a metric space, then
is closed.
Proof: It suffices to show that is open. To this end, fix
. For each point
, let
. Then consider the balls
. Observe that the
are an open cover for
so there is some finite sub cover
. It is clear that
(since it is disjoint from the finite sub cover) and is open since the finite intersection of open sets is open. Since
was arbitrary,
is open.
Remark 7: The above proof generalizes exactly if we relax “metric space” to “Hausdorff” by replacing the balls from the above proof with some more qualitative topological concepts. However, since our whole goal here is to develop a more quantitative theory, we will not pursue this generalization here.
Exercise 8: Let be a compact subset of some metric space
. Show that
also satisfies:
- For any
, there is a finite set of points
such that
(total boundedness)
- For any
, there exists some subsequence
that converges to a point in
:
(sequential compactness).
The set of points in total boundedness is sometimes called an -net (Hint: for 2, consider the closure of
as a set. Since
is closed, this closure is a subset of
and hence compact by lemma 5. If you are stuck at this step, the proof of lemma 9 may provide some inspiration.)
Total boundedness is of particular interest for the purposes of this post because it highlights the quantitative side of compactness. For a fixed value of , there is some minimal
points required to approximate any point in
to
-precision. Informally speaking, this optimum covering allows us to use
balls to measure the size of
.
As it turns out, “closed and totally bounded” and “sequentially compact” are both equivalent characterizations of compactness in metric spaces. To see this, it will be useful to first show the equivalence between sequential compactness and closed and totally bounded first:
Lemma 9: Let be subset of a metric space
. Then the following are equivalent:
is sequentially compact
is closed and totally bounded
Proof: Let us first show 1) => 2). By the definition of sequential compactness, we have that any convergent sequence in converges to a point in
(all subsequences of convergent sequences converge to the same point). Thus
satisfies one of the metric space definitions of closed. To show total boundedness, fix
and construct a sequence of points in the following inductive manner: Pick any
. Having picked
, pick
to be such that
. If this cannot be done at any step, it is clear that
can be taken to be the
-net as in the definition of total boundedness. Suppose
can be found for every natural number
. We have just constructed an infinite sequence, so by sequential compactness, the sequence has a convergent (and hence Cauchy) subsequence. But then we can find
with
(in fact, we can find infinitely many such pairs!). But this contradicts the construction of our sequence, and hence, the process described above must terminate. Now, we prove 2) => 1). Consider a sequence
. Because total boundedness gives us a finite
-net for each fixed
and our sequence is infinite, it must be the case that this sequence ends up in an
radius around at least one of the points infinitely often. Let us begin by taking
. Let
be the point such that
is within radius 1 of
infinitely often as discussed above and let
be the subsequence obtained by restricting the original sequence to only the elements within radius 1. Observe that:
is totally bounded (exercise)
is a sequence within a totally bounded set
It is clear that for each , we can apply the above procedure with
. We now have a nested collection of subsequences:
and so on.
Exercise 10: Complete the proof. In particular, show
- A subset of a totally bounded set is totally bounded
- The sub-sequence
,
is convergent.
- By closedness of
, conclude that this sequence in fact converges to an element of
.
Remark 11: The proof of 2) => 1) required a good deal more creativity than the reverse direction, but it boiled down to two standard tricks in analysis:
- Combining the finiteness from total boundedness with the infinite nature of a sequence allowed us to use an infinite version of the pigeonhole principle.
- Using 1, we were able to construct a nested (infinite) array of subsequences. We turned this array into a single subsequence by taking the diagonal elements of this array, a common trick pioneered by Georg Cantor.
Remark 12: Stepping back a bit from the technicalities, the equivalence above stems from the fact that both definitions in a sense restrict our “movement” around the set. Sequential compactness disallows any “escape” to infinity by requiring that eventually, our sequence has to “bump into itself” forever. Similarly, total boundedness restricts escape to infinity by the finiteness of the -nets.
The reason why we first showed equivalence of “closed and totally bounded” and “sequentially compact” is that it will actually be easier to show that “closed and totally bounded and sequentially compact” implies “compact”. We will need one more lemma to do the heavy lifting:
Lemma 13 (Lebesgue number theorem): Let be sequentially compact. Let
be an open cover. Then there exists
(called the Lebesgue number) such that every set of the form
for
is a subset of some
.
Proof: By contradiction. Suppose no such exists. Then we can find a sequence
such that
is not a subset of any
. By sequential compactness, there exists some convergent subsequence
that converges to some
. Then since
is a cover,
for some
. But since
is open, we have in fact
. But now, choose
sufficiently large such that
Then we have that , which contradicts our choice of
.
The Lebesgue number theorem tells us that for every open cover, for some each set of the form
for
can be identified with a member of the open cover. This mapping allows us to use what we know about
coverings from the definition of total boundedness with the arbitrary open covers from the definition of compactness.
Exercise 14: Finish the proof that the following are equivalent in metric spaces:
is compact
is sequentially compact
is closed and totally bounded
Exercise 15: Prove that in Euclidean space, totally bounded is equivalent to bounded (a set is bounded if it is contained in some ball of radius ) and hence deduce the Heine-Borel theorem: A subset of
is compact if and only if it’s closed and bounded. (Hint: it may be easier to equip
with the equivalent
, or sup norm:
.)
What have we gained for all of this effort? With our original definition of compactness, we had a binary – a set is either compact or not compact. Now, we have a way to compare the “degree” of compactness using the -nets in the definition of totally bounded. In particular, we make the following definitions.
Definition 16: For a metric space , the
-covering number
is given by the minimum cardinality set
such that
covers
. The metric entropy is defined as
.
Remark 17: One may wonder what this “entropy” has to do with thermodynamic or information entropy. It turns out, this naming convention only primarily has to do with the presence of the logarithm. A followup question is why log shows up so often. David Tong has a nice answer to this question in the context of statistical Physics:
Why do we take the log in the definition? One reason is that it makes the numbers less silly.
Indeed, in statistics, we take asymptotes so much that our numbers must eventually get “silly” (both in the sense of getting sillily large and sillily small).
Remark 18: We can now compare the “complexity” or “size” of two compact sets by comparing their metric entropies. A deeper answer to why we take logs has to do with the fact that by taking logs, we get more precise control over the complexity of our sets. In nonparametric statistics, a natural way to construct consistent estimators in infinite dimensional parameter spaces is to use “sieve” estimation where for each finite , we restrict our estimator to lie in some finite dimensional subspace and grow the dimension to infinity as
grows to infinity (think for example, of a non-linear regression where we grow the degree of the polynomial as we get more data). The rate at which we should grow this finite dimensional subspace lies on something of a knife’s edge – grow the complexity too quickly and you might always overfit too much to even get consistency. Grow it too slowly and you will suffer from high approximation error and converge at suboptimal rates. Taking the logarithm turns out to give estimates of error that are sufficiently precise to make this tradeoff between complexity and approximation error optimally.
Remark 19: As a bit of a teaser, we can be even more specific about why the logarithm is the right transformation to make our analysis of error precise. A lot of non-parametric theory assumes that the random variables we’re dealing with are “sub-gaussian” in the sense that or at least “sub-exponential” in the sense that
for constants
not depending on
(such distributions have quickly dying tails). Essentially, the logarithm comes in because it is the inverse of the exponential in these probability estimates.
Not only does this notion of metric entropy allow us to quantify the size of compact sets precisely, it also gives us a strategy to quantitatively characterize how “close” non-compact sets are to being compact. For example, if a set is non-compact, but
-compact (i.e. it is contained in the countable union of compact sets), then the estimates on the metric entropy of each set in the union can give us an idea of how well behaved we can expect the limit to be. This is essentially what the sieve estimators in remark 18 will try to do. But to do this, we will need to develop some theory to help us estimate the metric entropy of sets of interest.