Some probability theory

A popular musing about math is to ask whether an advanced alien civilization would have the same mathematics as us. A common answer is that while the language and symbols would almost certainly be different, the most important underlying concepts would be isomorphic.

A quip I had after my first undergrad statistics class is that the real question we should be asking is whether an alien civilization would have a statistics that is even remotely similar to ours. The frustration I was getting at here is the fact that so much of what we learned seemed arbitrary and just based on the modeling decisions made by one or two statisticians. For example, would an alien civilization favor Bayesian or frequentist statistics? Maybe neither?

My current view is that these sorts of questions distract from a core of ideas so natural that if anyone who believes that aliens have isomorphic mathematics, would have to conclude that they have at least nearly isomorphic statistics. In the next series of posts, I hope to convince readers of this position. I will probably fail at this, but maybe along the way, we can learn some statistics.

The core idea behind much of statistics is that when a phenomenon is sufficiently regular (both in the sense of occurring over and again and in the sense of being “well behaved” in a technical sense), collecting more and more data should allow us to understand it arbitrarily well. The key way that this intuition gets formalized is by studying the “convergence” behavior of various random processes. We already saw this to an extent in the form of the central limit theorem. In that case, the the phenomenon we wanted to study was the mean of a distribution. The regularity was that we assumed that we could get data from the same distribution over and again, and that the distribution at least had finite variance. The convergence we obtained was a concept known as convergence in distribution. In one sense, the result is remarkably general. We had to make relatively few assumptions about the underlying distribution to get a very precise result. In another sense, it seems remarkably weak. The only phenomenon it allows us to study right now is expectations, and it requires us to adopt a certain view of convergence.

We will begin by studying convergence in more detail. To do so, we will need a more precise language. I will spend this post walking through some of the basics of how mathematicians formalize the intuitive notion of probability. To do this properly, we need to first introduce the concept of the \sigma-field:

Definition 1: Let \Omega be some set. A \sigma-field, \mathcal F is a collection of subsets of \Omega such that

  1. \Omega \in \mathcal F
  2. If F \in \mathcal F, then F^c \equiv \Omega \backslash F \in \mathcal F (closure under complements).
  3. If F_1,F_2,\ldots \in \mathcal F, then \cup_{i=1}^\infty F_i \in \mathcal F (closure under countable union).

The backslash here is set difference. Such a concept seems unnatural, but is needed in some form. Intuitively, \sigma-fields correspond to our definition of sets that are well behaved enough to warrant our study. Studying every possible set will turn out to be too difficult since it will in general lead to paradoxes. We therefore will need to restrict our study somewhat, but hopefully as little as possible. In particular, 2 and 3 guarantee that we can do logic on whatever collection \mathcal F we restrict our attention to since set complements corresponds to logical not, and set unions correspond to logical or. Using De Morgan’s laws, “or” and “not” are sufficient to construct other basic logical operations. But given this, one might ask why we require countable union, and not just finite union. This is largely motivated by the fact that it will allow us just enough structure to ensure that we will be able to study sequences we construct. Note that countable additivity is understood to include finite additivity as a special case (just set F_i = \emptyset for all sufficiently large i). Sets F\in\mathcal F will be called measurable which is suggestive of the interpretation that \mathcal F is our model of things that are well behaved enough for us to study. We can now define a probability space.

Definition 2: A probability space is a triple (\Omega, \mathcal F, \mathbb P) where \mathcal F is a \sigma-field on \Omega, and \mathbb P: \mathcal F \to [0,1] is a function that takes a measurable set to a probability satisfying:

  1. \mathbb P(\Omega) = 1.
  2. If F_1,F_2,\ldots are mutually disjoint (i.e. F_i \cap F_j = \emptyset for i \neq j), then \sum_{i=1}^\infty \mathbb P(F_i) = \mathbb P\left(\cup_{i=1}^\infty F_i\right)

The elements F \in \mathcal F are called events. 1 corresponds to the idea that something from \Omega has to have happened. This tells us that we have to design our probability spaces to model all possible outcomes.  2 captures an intuition we have about how probabilities should behave. If I roll a 6 sided die, the probability I roll a 1 or a 2 is the probability that I roll a 1 plus the probability I roll a 2. At this point, it should be noted how important our definition of \mathcal F was. It exactly corresponds to what was needed to to ensure the operations used in defining \mathbb P are well defined. I will adopt the convention that \omega is a generic element of \Omega As it stands, this definition seems sort of barren. We are missing a lot of other familiar facts that all probabilities should satisfy. The following should assuage these concerns:

Exercise 3: Show that probability spaces satisfy the following:

  1. \mathbb P(\emptyset) = 0 (null empty set)
  2. \mathbb P(F^c) = 1 - \mathbb P(F) (probability of complements)
  3. If F' \subseteq F, then \mathbb P(F') \leq \mathbb P(F) (monotonicity)
  4. For arbitrary F_1,F_2,\ldots\in\mathcal F, \mathbb P(\cup_{i=1}^\infty F_i) \leq \sum_{i=1}^\infty \mathbb P(F_i) (union bound)

Another useful property of \sigma-fields is that they are monotone in the following sense:

Lemma 4: Let F_1,F_2,\ldots be such that F_k \subset F_{k+1} for all k. Then \mathbb P(F_k) \to \mathbb P(\cup_{k=1}^\infty F_k).

Proof: The idea is to study related sets where we can use countable additivity. To that effect, construct F_k' = F_k \backslash F_{k-1}. It is clear that \cup_{k=1}^n F_k' = \cup_{k=1}^n F_k, and \cup_{k=1}^\infty F_k' = \cup_{k=1}^\infty F_k. Additionally, since the F_k are nested, we have that all the F_k' are disjoint. Then

\mathbb P(F_k) = \mathbb P(\cup_{k=1}^n F_k') = \sum_{k=1}^n \mathbb P(F_k') \to \sum_{k=1}^\infty P(F_k') = \mathbb P(\cup_{k=1}^\infty F_k') = \mathbb P(\cup_{k=1}^\infty F_k).

Exercise 5: The above result is called continuity from below. Prove continuity from above: If F_1,F_2,\ldots is such that F_{k+1} \subset F_k, then \mathbb P(F_k) \to \mathbb P(\cap_{k=1}^\infty F_k). (Hint: Use De Morgan’s laws to reduce this to the assumptions of lemma 4)

For the purposes of many results, it will be no loss to consider settings where some desired property holds for every \omega \in \Omega except for some exceptional set N such that \mathbb P(N) = 0. When a property satisfies this, we say that it holds almost surely or for almost all \omega.

At this point, textbooks about probability theory will walk through chapters of results on constructing probability spaces. I will spare you the gory details here. As an antidote to the abstractness here, I will give an example to play around with.

Example 6: Let \Omega = [0,1]. Let \mathcal F be the collection of all subsets of [0,1] that can be constructed using countable unions and intersections of the open subsets of [0,1]. Let \mathbb P((a,b)) = b - a and  \mathbb P(A) = \inf\{\mathbb P(F): F \in \mathcal F, F = \cap_{i=1}^m \cup_{j=1}^n (a_{ij},a_{ij})\}. This is the uniform measure on [0,1] and may be recognized as corresponding to the uniform distribution on [0,1].

Exercise 7: Show that the above is a probability space. In addition, show that every countable subset of [0,1] is in \mathcal F and in fact, has probability 0. Conclude that the probability of drawing a rational number from [0,1] is 0.

Given a probability space, a function f: \Omega \to \mathbb R is measurable if f^{-1}(U) \in \mathcal F (i.e. the preimage of U under f is measurable) for each open set U \subset \mathbb R. So measurable functions are just those functions that play well with both the topology of the real number line and the measurable sets at the same time.

Definition 8: A random variable X is any measurable function X: \Omega \to \mathbb R.

For an arbitrary topology, the Borel \sigma-field is the \sigma-field comprising all sets that can be constructed as countable unions and intersections of open and closed sets (show that this is indeed a \sigma-field). Sets in this \sigma-field are called Borel measurable, or Borel sets. If B is Borel measurable, then we will say that \mathbb P(X \in B) = \mathbb P(X^{-1}(B)). Hopefully, it is now immediately obvious why we defined random variables in this way: it is precisely what was needed to ensure that probabilities of the most well behaved sets (in topology, the open sets are intuitively the ones that are the most normal) are well defined. Thinking about random variables as functions is useful because it allows us to distinguish between the random variable and its realization. The random variable itself is the entire function and associated probability of taking on various values while the realization, X(\omega) is the evaluation of X at a particular \omega \in \Omega. This single distinction is one of the main reasons for the long detour through technical definitions. Usually, thinking about whether functions or sets are measurable is not needed for everyday users of probability. It’s amazingly difficult to construct examples of non-measurable objects. However, the distinction between random variables and realizations will be useful, and hence, I would like to make sure readers at least understand it at a conceptual level. The following result is trivial (in the precise sense of following straight from definitions) if you use the right definition of continuity. Showing it is a good check of if you grasp the definition sufficiently well.

Exercise 9: Show that if X_1,X_2,\ldots,X_k are random variables, and g:\mathbb R^k \to \mathbb R is continuous, then g(X_1,\ldots,X_k) is a random variable.

Often, we can ignore the underlying probability space altogether and just look at how the random variable itself behaves. We call the distribution of X the mapping U\mapsto \mathbb P(X \in U) for each Borel set U. In particular, it suffices to know the familiar c.d.f. F_X(x) = \mathbb P(X \leq x). I am now able to define two of the most useful concepts of convergence:

Definition 10:

  1. A sequence of random variables X_n converges in probability to X (denoted X_n \overset p\to X) if for every \varepsilon,\delta, there exists N such that for all n \geq N, \mathbb P(|X_n - X| < \delta) < \varepsilon (in light of exercise 7, |X_n - X| is a random variable. The set \{x: x < \delta\}\subset \mathbb R is open and hence the above probability makes sense).
  2. A sequence of random variables X_n converges almost surely to X (denoted X_n \overset{a.s.}\to X if \mathbb P(X_n \to X) = 1 (note here the usefulness of this concept of realization. More rigorously, the probability on the left is \mathbb P(\{\omega\in \Omega: X_n(\omega) \to X(\omega)\}). The definition thus presupposes the measurability of this set).

Lemma 11: X_n \overset{a.s.}\to X \implies X_n\overset p\to X.

Proof: Let A_k = \cup_{n \geq k} \{\omega: |X_n(\omega) - X(\omega)| > \varepsilon\}. Clearly A_{k+1} \subset A_k. Define A_\infty = \cap_{k=1}^\infty A_k. Translating the set notation into logic, A_\infty is the set of \omega where |X_n(\omega) - X(\omega)| > \varepsilon infinitely often. It is clear that X_n(\omega) \not\to X(\omega) if this is true, hence A_\infty \subset N = \{\omega: X_n(\omega) \not\to X(\omega)\}. By monotonicity and the definition of almost sure convergence, this means \mathbb P(A_\infty) = 0. Using continuity from above, we have \mathbb P(A_k) \to \mathbb P(A_\infty) = 0. Finally, note that \{\omega: |X_k(\omega) - X(\omega)| > \varepsilon\} \subset A_k. Thus, using monotonicity one more time \mathbb P(|X_k(\omega) - X(\omega)| > \varepsilon) \leq \mathbb P(A_k) \to 0, which is exactly the definition of convergence in probability.

A natural question to ask is whether or not the opposite implication holds. The following exercise asks you to show that indeed, it does not.

Exercise 12 (Typewriter sequence): Consider again, the uniform probability measure. Let

X_{m,n}(\omega) = \begin{cases} 1 & \omega < \frac mn \\ 0 & \text{otherwise}\end{cases}

for m = 1,2,\ldots,n. Show that X_{m,n} \overset p\to \mathbf 1 along the path n\to\infty, m\to n for each n (where \mathbf 0 here is understood to be the constant function that maps everything to 0), but X_n \overset{a.s}{\not\to} \mathbf 0

The definition of almost sure convergence is not easy to work with. It asks about the probability of some sequence of random variables converging. Convergence and probability are difficult enough concepts on their own, so thinking about both at the same time can be daunting. The following result is the most common strategy for showing convergence almost surely:

Lemma 13 (First Borel-Cantelli Lemma): Let A_1,A_2,\ldots be events in a probability space. Suppose \sum_{k=1}^n \mathbb P(A_k) < \infty. Then \mathbb P(\cap_{n=1}^\infty\cup_{k \geq n} A_k) = 0.

Proof: An equivalent condition to \sum_{k=1}^n \mathbb P(A_k) < \infty is that \sum_{k\geq n} \mathbb P(A_k) \to 0. Using continuity from above, we have P(\cap_{n=1}^\infty\cup_{k\geq n} A_k) = \lim_{n\to\infty} P(\cup_{k\geq n} A_k). But now, using countable subadditivity, P(\cup_{k\geq n} A_k \leq \sum_{k\geq n} \to 0. Putting these together gives us exactly that \mathbb P(\cap_{n=1}^\infty\cup_{k \geq n} A_k) = 0.

To see why this is exactly what is needed to show convergence almost surely, notice that taking the A_k to be as in the proof of Lemma 11 and \cap_{n=1}^\infty\cup_{k \geq n} A_k \equiv A_\infty, convergence almost surely is exactly the requirement that \mathbb P(A_\infty) = 0. The following exercise can be used to formalize the famous idea that “a monkey hitting random typewriter keys for an infinite amount of time will eventually write the complete works of William Shakespeare” (as well as the complete works of William Shakespeare with a single typo, etc).

Exercise 14 (Second Borel-Cantelli Lemma): Events A_1,A_2,\ldots are called independent if \mathbb P(\cup_{k=1}^\infty A_k) = \prod_{k=1}^\infty \mathbb P(A_k) (check that in the finite case, this corresponds to the intuition we are used to). Prove the following partial converse to the Borel-Cantelli lemma: Suppose \sum_{k=1}^\infty \mathbb P(A_k) = \infty for independent events A_1,\ldots. Then \mathbb P\left(\left(\cap_{n=1}^\infty\cup_{k\geq n} A_k\right)^c\right) = 0. (Hint: Use De Morgan’s law to convert \cap_{n=1}^\infty\cup_{k\geq n} A_k to a similar statement about A_k^c. Use the definition of independence to get a statement about a certain product of probabilities. Next, use the fact that 1 - x \leq e^{-x} for all x. This fact will help turn a product into a sum. Conclude by using the assumption we made about the sum.)

istock-18586699-monkey-computer_brick-16e5064d3378a14e0e4c2da08857efe03c04695e-s800-c85
My roommate said I should include more figures so here you go: a modern monkey at a modern typewriter.

I will close out this post by defining expected value, as this is the operation under which many of the standard convergence results will hold. To begin, let us consider the intuitive definition of expected value, or mean, in the case when X only takes on finitely many values. Suppose, for instance, I flip a coin with probability p of landing heads and probability 1- p of landing tails. Let X = 1 if it lands heads and X = -1 if it lands tails. Then intuitively, we want to define expected value so that the mean of X is p \cdot 1 + (1-p) \cdot -1 = 2p - 1. Similarly, the expected value of a fair die roll should be 1 \cdot 1/6 + 2 \cdot 1/6 + \cdots 6 \cdot 1/6 = 3.5. More generally, the formula we want to use is the probability weighted sum of values X takes on. Formally,

Definition 15: We say that a random variable X is discrete if it takes on a countable number of values. We say that it is simple if it takes on a finite number of values. For discrete random variables, we define the expected value of X taking on values n (with n possibly infinite) different values x_1,x_2,\ldots to be

\mathbb E[X] = \sum_{k=1}^n x_k \mathbb P(X = x_k)

provided that \sum_{k=1}^n |x_k| \mathbb P(X = x_k) < \infty

At first glance, it may be somewhat confusing why we need to restrict that the expected value sums to a finite number. However, if \sum_{k=1}^\infty |x_k| \mathbb P(X = x_k) = \infty while \sum_{k=1}^\infty x_k \mathbb P(X = x_k) < \infty, then a famous theorem of Riemann will show that the order of the x_k can be rearranged to make the sum any real number, or even \pm \infty. In this case, it is not very sensible to even talk about expectations since it is sensitive to how we order the possible values of x_k. Expectations are well behaved because they satisfy the following important property:

Lemma 16 (Linearity of Expectations): Let X,Y be arbitrary discrete random variables on a probability space. Then \mathbb E[X] + \mathbb E[Y] = \mathbb E[X + Y].

Proof: \mathbb E[X] + \mathbb E[Y] = \sum_{i=1}^n x_i \mathbb P(X = X_i) + \sum_{j=1}^m y_j \mathbb P(Y = y_j)

= \sum_{i=1}^n \sum_{j=1}^m x_i \mathbb P(X = X_i, Y = y_j) + \sum_{i=1}^n \sum_{j=1}^m y_j \mathbb P(X = X_i, Y = y_j)

= \sum_{i=1}^n \sum_{j=1}^m x_i + y_j \mathbb P(X = X_i, Y = y_j) = \sum_{k=1}^p (x+y)_k \mathbb P(X + Y = k)

= \mathbb E[X+Y]. As an exercise, justify the second equality more rigorously.

We would now like to extend the definition of integration to arbitrary measurable functions. The basic idea will be that measurable functions can be arbitrarily well approximated by simple functions. For example, we can show a result such as the following.

Proposition 17 (Approximation from below by simple functions): Let X be a non-negative measurable function. Then there exists a sequence of simple functions X_1,X_2,\ldots such that X_n < X and X_n \to X almost surely.

Proof: In fact, it is possible to get sure convergence. Define X_n to take on values in \frac1{2^n},\frac2{2^n},\ldots,n. Let X_n^{-1}\left(\{\frac{a}{2^n}\}\right) = X^{-1}\left(\left[\frac{a}{2^n},\frac{a+1}{2^n}\right)\right) except for the \frac{a}{2^n} = n. Let X_n^{-1}(\{n\}) = X^{-1}([n,\infty)).

The above construction is a common theme in the theory of integration in general and probability in general: in order to prove something about expected values (which turn out to be integrals) the following road map often works:

  1. First show the fact for indicator functions
  2. Extend to simple functions using linearity
  3. Extend to non-negative functions by approximation from below
  4. Extend to arbitrary functions by the fact that X = X^+ - X^- where X^+ = X \mathbf 1\{X \geq 0\} and X^- = -X \mathbf 1\{X \leq 0\}.

Motivated by this type of construction, we can define expectations in general:

Definition 18: Let X be an arbitrary random variable. Then

\mathbb E[X] = \sup\{\mathbb E[X^*]: X^* \text{ simple}, X^* \leq X\}

When the distribution of X has a p.d.f. (I will not rigorously define this here, but assume that the reader is familiar) for example, this definition gives precisely that E[X] = \int_{-\infty}^\infty x f_X(x)\,\mathrm dx. Readers with some familiarity with real analysis may recognize this as a special case of the definition of the Lebesgue integral. Approximation from below means that the linearity of expectations carries over from the discrete case to the general case. Additionally, other familiar properties of integrals  are preserved by this definition of expectations. There are three (four including the corollary) classical results that most contribute to the usefulness of this definition of expectation. Their proofs are somewhat long and technical, so I omit them here (although Wikipedia actually proves all of them if you are interested)

Theorem 19 (Monotone Convergence): Suppose X_1,X_2,\ldots are such that X_n \leq X_{n+1} and X_n \to X for almost all \omega. Then \mathbb E[X_n] \to \mathbb E[X].

Theorem 20 (Fatou’s Lemma): Let X_1,X_2,\ldots be such that \liminf_{n\to\infty} X_n = X. Then  \mathbb E[X] \leq \liminf_{n\to\infty} \mathbb E[X_n]

Corollary 21 (Reverse Fatou’s Lemma): Let X_1,X_2,\ldots be such that there is a non-negative Y \geq X_n for all n with \mathbb E[Y] < \infty. Then \limsup_{n\to\infty} \mathbb E[X_n] \leq \mathbb E[\limsup_{n\to\infty} X_n].

Theorem 22 (Dominated Convergence): Let X_1,X_2,\ldots be such that there is non-negative Y \geq |X_n| for all n with \mathbb E[Y] < \infty. Assume further that X_n \to X. Then \mathbb E[X_n] \to \mathbb E[X].

Monotone convergence tells us if the process by which the X_n converge to X is sufficiently well behaved, integration and taking limits may be swapped. Fatou’s lemma tells us that when swapping integrals and liminfs in general, mass may be lost, but not created. The Reverse Fatou Lemma gives some conditions under which mass can also not be created for limsups. Finally, dominated convergence tells us that if we can find some function that dominates the sequence (in the sense of being larger in absolute value) and this dominating function is well behaved, then this good behavior is passed on to the sequence it dominates. In practice, dominated convergence is the most commonly seen directly applied result.

In this post, I introduced a wide array of technical machinery that seems a bit disparate at this point. In a sequel, I will be able to discuss the laws of large numbers, which tie these concepts together. The long term goal will be to use these results to motivate a general framework for coming up with statistical estimators.

Leave a comment