Some probability theory

A popular musing about math is to ask whether an advanced alien civilization would have the same mathematics as us. A common answer is that while the language and symbols would almost certainly be different, the most important underlying concepts would be isomorphic.

A quip I had after my first undergrad statistics class is that the real question we should be asking is whether an alien civilization would have a statistics that is even remotely similar to ours. The frustration I was getting at here is the fact that so much of what we learned seemed arbitrary and just based on the modeling decisions made by one or two statisticians. For example, would an alien civilization favor Bayesian or frequentist statistics? Maybe neither?

My current view is that these sorts of questions distract from a core of ideas so natural that if anyone who believes that aliens have isomorphic mathematics, would have to conclude that they have at least nearly isomorphic statistics. In the next series of posts, I hope to convince readers of this position. I will probably fail at this, but maybe along the way, we can learn some statistics.

The core idea behind much of statistics is that when a phenomenon is sufficiently regular (both in the sense of occurring over and again and in the sense of being “well behaved” in a technical sense), collecting more and more data should allow us to understand it arbitrarily well. The key way that this intuition gets formalized is by studying the “convergence” behavior of various random processes. We already saw this to an extent in the form of the central limit theorem. In that case, the the phenomenon we wanted to study was the mean of a distribution. The regularity was that we assumed that we could get data from the same distribution over and again, and that the distribution at least had finite variance. The convergence we obtained was a concept known as convergence in distribution. In one sense, the result is remarkably general. We had to make relatively few assumptions about the underlying distribution to get a very precise result. In another sense, it seems remarkably weak. The only phenomenon it allows us to study right now is expectations, and it requires us to adopt a certain view of convergence.

We will begin by studying convergence in more detail. To do so, we will need a more precise language. I will spend this post walking through some of the basics of how mathematicians formalize the intuitive notion of probability. To do this properly, we need to first introduce the concept of the $\sigma$ -field:

Definition 1: Let $\Omega$ be some set. A $\sigma$ -field, $\mathcal F$ is a collection of subsets of $\Omega$ such that

$\Omega \in \mathcal F$
If $F \in \mathcal F$ , then $F^c \equiv \Omega \backslash F \in \mathcal F$ (closure under complements).
If $F_1,F_2,\ldots \in \mathcal F$ , then $\cup_{i=1}^\infty F_i \in \mathcal F$ (closure under countable union).

The backslash here is set difference. Such a concept seems unnatural, but is needed in some form. Intuitively, $\sigma$ -fields correspond to our definition of sets that are well behaved enough to warrant our study. Studying every possible set will turn out to be too difficult since it will in general lead to paradoxes. We therefore will need to restrict our study somewhat, but hopefully as little as possible. In particular, 2 and 3 guarantee that we can do logic on whatever collection $\mathcal F$ we restrict our attention to since set complements corresponds to logical not, and set unions correspond to logical or. Using De Morgan’s laws, “or” and “not” are sufficient to construct other basic logical operations. But given this, one might ask why we require countable union, and not just finite union. This is largely motivated by the fact that it will allow us just enough structure to ensure that we will be able to study sequences we construct. Note that countable additivity is understood to include finite additivity as a special case (just set $F_i = \emptyset$ for all sufficiently large $i$ ). Sets $F\in\mathcal F$ will be called measurable which is suggestive of the interpretation that $\mathcal F$ is our model of things that are well behaved enough for us to study. We can now define a probability space.

Definition 2: A probability space is a triple $(\Omega, \mathcal F, \mathbb P)$ where $\mathcal F$ is a $\sigma$ -field on $\Omega$ , and $\mathbb P: \mathcal F \to [0,1]$ is a function that takes a measurable set to a probability satisfying:

$\mathbb P(\Omega) = 1$ .
If $F_1,F_2,\ldots$ are mutually disjoint (i.e. $F_i \cap F_j = \emptyset$ for $i \neq j$ ), then $\sum_{i=1}^\infty \mathbb P(F_i) = \mathbb P\left(\cup_{i=1}^\infty F_i\right)$

The elements $F \in \mathcal F$ are called events. 1 corresponds to the idea that something from $\Omega$ has to have happened. This tells us that we have to design our probability spaces to model all possible outcomes. 2 captures an intuition we have about how probabilities should behave. If I roll a 6 sided die, the probability I roll a 1 or a 2 is the probability that I roll a 1 plus the probability I roll a 2. At this point, it should be noted how important our definition of $\mathcal F$ was. It exactly corresponds to what was needed to to ensure the operations used in defining $\mathbb P$ are well defined. I will adopt the convention that $\omega$ is a generic element of $\Omega$ As it stands, this definition seems sort of barren. We are missing a lot of other familiar facts that all probabilities should satisfy. The following should assuage these concerns:

Exercise 3: Show that probability spaces satisfy the following:

$\mathbb P(\emptyset) = 0$ (null empty set)
$\mathbb P(F^c) = 1 - \mathbb P(F)$ (probability of complements)
If $F' \subseteq F$ , then $\mathbb P(F') \leq \mathbb P(F)$ (monotonicity)
For arbitrary $F_1,F_2,\ldots\in\mathcal F$ , $\mathbb P(\cup_{i=1}^\infty F_i) \leq \sum_{i=1}^\infty \mathbb P(F_i)$ (union bound)

Another useful property of $\sigma$ -fields is that they are monotone in the following sense:

Lemma 4: Let $F_1,F_2,\ldots$ be such that $F_k \subset F_{k+1}$ for all $k$ . Then $\mathbb P(F_k) \to \mathbb P(\cup_{k=1}^\infty F_k)$ .

Proof: The idea is to study related sets where we can use countable additivity. To that effect, construct $F_k' = F_k \backslash F_{k-1}$ . It is clear that $\cup_{k=1}^n F_k' = \cup_{k=1}^n F_k$ , and $\cup_{k=1}^\infty F_k' = \cup_{k=1}^\infty F_k$ . Additionally, since the $F_k$ are nested, we have that all the $F_k'$ are disjoint. Then

$\mathbb P(F_k) = \mathbb P(\cup_{k=1}^n F_k') = \sum_{k=1}^n \mathbb P(F_k') \to \sum_{k=1}^\infty P(F_k') = \mathbb P(\cup_{k=1}^\infty F_k') = \mathbb P(\cup_{k=1}^\infty F_k)$ .

Exercise 5: The above result is called continuity from below. Prove continuity from above: If $F_1,F_2,\ldots$ is such that $F_{k+1} \subset F_k$ , then $\mathbb P(F_k) \to \mathbb P(\cap_{k=1}^\infty F_k)$ . (Hint: Use De Morgan’s laws to reduce this to the assumptions of lemma 4)

For the purposes of many results, it will be no loss to consider settings where some desired property holds for every $\omega \in \Omega$ except for some exceptional set $N$ such that $\mathbb P(N) = 0$ . When a property satisfies this, we say that it holds almost surely or for almost all $\omega$ .

At this point, textbooks about probability theory will walk through chapters of results on constructing probability spaces. I will spare you the gory details here. As an antidote to the abstractness here, I will give an example to play around with.

Example 6: Let $\Omega = [0,1]$ . Let $\mathcal F$ be the collection of all subsets of $[0,1]$ that can be constructed using countable unions and intersections of the open subsets of $[0,1]$ . Let $\mathbb P((a,b)) = b - a$ and $\mathbb P(A) = \inf\{\mathbb P(F): F \in \mathcal F, F = \cap_{i=1}^m \cup_{j=1}^n (a_{ij},a_{ij})\}$ . This is the uniform measure on $[0,1]$ and may be recognized as corresponding to the uniform distribution on $[0,1]$ .

Exercise 7: Show that the above is a probability space. In addition, show that every countable subset of $[0,1]$ is in $\mathcal F$ and in fact, has probability 0. Conclude that the probability of drawing a rational number from $[0,1]$ is 0.

Given a probability space, a function $f: \Omega \to \mathbb R$ is measurable if $f^{-1}(U) \in \mathcal F$ (i.e. the preimage of $U$ under $f$ is measurable) for each open set $U \subset \mathbb R$ . So measurable functions are just those functions that play well with both the topology of the real number line and the measurable sets at the same time.

Definition 8: A random variable $X$ is any measurable function $X: \Omega \to \mathbb R$ .

For an arbitrary topology, the Borel $\sigma$ -field is the $\sigma$ -field comprising all sets that can be constructed as countable unions and intersections of open and closed sets (show that this is indeed a $\sigma$ -field). Sets in this $\sigma$ -field are called Borel measurable, or Borel sets. If $B$ is Borel measurable, then we will say that $\mathbb P(X \in B) = \mathbb P(X^{-1}(B))$ . Hopefully, it is now immediately obvious why we defined random variables in this way: it is precisely what was needed to ensure that probabilities of the most well behaved sets (in topology, the open sets are intuitively the ones that are the most normal) are well defined. Thinking about random variables as functions is useful because it allows us to distinguish between the random variable and its realization. The random variable itself is the entire function and associated probability of taking on various values while the realization, $X(\omega)$ is the evaluation of $X$ at a particular $\omega \in \Omega$ . This single distinction is one of the main reasons for the long detour through technical definitions. Usually, thinking about whether functions or sets are measurable is not needed for everyday users of probability. It’s amazingly difficult to construct examples of non-measurable objects. However, the distinction between random variables and realizations will be useful, and hence, I would like to make sure readers at least understand it at a conceptual level. The following result is trivial (in the precise sense of following straight from definitions) if you use the right definition of continuity. Showing it is a good check of if you grasp the definition sufficiently well.

Exercise 9: Show that if $X_1,X_2,\ldots,X_k$ are random variables, and $g:\mathbb R^k \to \mathbb R$ is continuous, then $g(X_1,\ldots,X_k)$ is a random variable.

Often, we can ignore the underlying probability space altogether and just look at how the random variable itself behaves. We call the distribution of $X$ the mapping $U\mapsto \mathbb P(X \in U)$ for each Borel set $U$ . In particular, it suffices to know the familiar c.d.f. $F_X(x) = \mathbb P(X \leq x)$ . I am now able to define two of the most useful concepts of convergence:

Definition 10:

A sequence of random variables $X_n$ converges in probability to $X$ (denoted $X_n \overset p\to X$ ) if for every $\varepsilon,\delta$ , there exists $N$ such that for all $n \geq N$ , $\mathbb P(|X_n - X| < \delta) < \varepsilon$ (in light of exercise 7, $|X_n - X|$ is a random variable. The set $\{x: x < \delta\}\subset \mathbb R$ is open and hence the above probability makes sense).
A sequence of random variables $X_n$ converges almost surely to $X$ (denoted $X_n \overset{a.s.}\to X$ if $\mathbb P(X_n \to X) = 1$ (note here the usefulness of this concept of realization. More rigorously, the probability on the left is $\mathbb P(\{\omega\in \Omega: X_n(\omega) \to X(\omega)\})$ . The definition thus presupposes the measurability of this set).

Lemma 11: $X_n \overset{a.s.}\to X \implies X_n\overset p\to X$ .

Proof: Let $A_k = \cup_{n \geq k} \{\omega: |X_n(\omega) - X(\omega)| > \varepsilon\}$ . Clearly $A_{k+1} \subset A_k$ . Define $A_\infty = \cap_{k=1}^\infty A_k$ . Translating the set notation into logic, $A_\infty$ is the set of $\omega$ where $|X_n(\omega) - X(\omega)| > \varepsilon$ infinitely often. It is clear that $X_n(\omega) \not\to X(\omega)$ if this is true, hence $A_\infty \subset N = \{\omega: X_n(\omega) \not\to X(\omega)\}$ . By monotonicity and the definition of almost sure convergence, this means $\mathbb P(A_\infty) = 0$ . Using continuity from above, we have $\mathbb P(A_k) \to \mathbb P(A_\infty) = 0$ . Finally, note that $\{\omega: |X_k(\omega) - X(\omega)| > \varepsilon\} \subset A_k$ . Thus, using monotonicity one more time $\mathbb P(|X_k(\omega) - X(\omega)| > \varepsilon) \leq \mathbb P(A_k) \to 0$ , which is exactly the definition of convergence in probability.

A natural question to ask is whether or not the opposite implication holds. The following exercise asks you to show that indeed, it does not.

Exercise 12 (Typewriter sequence): Consider again, the uniform probability measure. Let

$X_{m,n}(\omega) = \begin{cases} 1 & \omega < \frac mn \\ 0 & \text{otherwise}\end{cases}$

for $m = 1,2,\ldots,n$ . Show that $X_{m,n} \overset p\to \mathbf 1$ along the path $n\to\infty$ , $m\to n$ for each $n$ (where $\mathbf 0$ here is understood to be the constant function that maps everything to $0$ ), but $X_n \overset{a.s}{\not\to} \mathbf 0$

The definition of almost sure convergence is not easy to work with. It asks about the probability of some sequence of random variables converging. Convergence and probability are difficult enough concepts on their own, so thinking about both at the same time can be daunting. The following result is the most common strategy for showing convergence almost surely:

Lemma 13 (First Borel-Cantelli Lemma): Let $A_1,A_2,\ldots$ be events in a probability space. Suppose $\sum_{k=1}^n \mathbb P(A_k) < \infty$ . Then $\mathbb P(\cap_{n=1}^\infty\cup_{k \geq n} A_k) = 0$ .

Proof: An equivalent condition to $\sum_{k=1}^n \mathbb P(A_k) < \infty$ is that $\sum_{k\geq n} \mathbb P(A_k) \to 0$ . Using continuity from above, we have $P(\cap_{n=1}^\infty\cup_{k\geq n} A_k) = \lim_{n\to\infty} P(\cup_{k\geq n} A_k)$ . But now, using countable subadditivity, $P(\cup_{k\geq n} A_k \leq \sum_{k\geq n} \to 0$ . Putting these together gives us exactly that $\mathbb P(\cap_{n=1}^\infty\cup_{k \geq n} A_k) = 0$ .

To see why this is exactly what is needed to show convergence almost surely, notice that taking the $A_k$ to be as in the proof of Lemma 11 and $\cap_{n=1}^\infty\cup_{k \geq n} A_k \equiv A_\infty$ , convergence almost surely is exactly the requirement that $\mathbb P(A_\infty) = 0$ . The following exercise can be used to formalize the famous idea that “a monkey hitting random typewriter keys for an infinite amount of time will eventually write the complete works of William Shakespeare” (as well as the complete works of William Shakespeare with a single typo, etc).

Exercise 14 (Second Borel-Cantelli Lemma): Events $A_1,A_2,\ldots$ are called independent if $\mathbb P(\cup_{k=1}^\infty A_k) = \prod_{k=1}^\infty \mathbb P(A_k)$ (check that in the finite case, this corresponds to the intuition we are used to). Prove the following partial converse to the Borel-Cantelli lemma: Suppose $\sum_{k=1}^\infty \mathbb P(A_k) = \infty$ for independent events $A_1,\ldots$ . Then $\mathbb P\left(\left(\cap_{n=1}^\infty\cup_{k\geq n} A_k\right)^c\right) = 0$ . (Hint: Use De Morgan’s law to convert $\cap_{n=1}^\infty\cup_{k\geq n} A_k$ to a similar statement about $A_k^c$ . Use the definition of independence to get a statement about a certain product of probabilities. Next, use the fact that $1 - x \leq e^{-x}$ for all $x$ . This fact will help turn a product into a sum. Conclude by using the assumption we made about the sum.)

istock-18586699-monkey-computer_brick-16e5064d3378a14e0e4c2da08857efe03c04695e-s800-c85 — My roommate said I should include more figures so here you go: a modern monkey at a modern typewriter.

I will close out this post by defining expected value, as this is the operation under which many of the standard convergence results will hold. To begin, let us consider the intuitive definition of expected value, or mean, in the case when $X$ only takes on finitely many values. Suppose, for instance, I flip a coin with probability $p$ of landing heads and probability $1- p$ of landing tails. Let $X = 1$ if it lands heads and $X = -1$ if it lands tails. Then intuitively, we want to define expected value so that the mean of $X$ is $p \cdot 1 + (1-p) \cdot -1 = 2p - 1$ . Similarly, the expected value of a fair die roll should be $1 \cdot 1/6 + 2 \cdot 1/6 + \cdots 6 \cdot 1/6 = 3.5$ . More generally, the formula we want to use is the probability weighted sum of values $X$ takes on. Formally,

Definition 15: We say that a random variable $X$ is discrete if it takes on a countable number of values. We say that it is simple if it takes on a finite number of values. For discrete random variables, we define the expected value of $X$ taking on values $n$ (with $n$ possibly infinite) different values $x_1,x_2,\ldots$ to be

$\mathbb E[X] = \sum_{k=1}^n x_k \mathbb P(X = x_k)$

provided that $\sum_{k=1}^n |x_k| \mathbb P(X = x_k) < \infty$

At first glance, it may be somewhat confusing why we need to restrict that the expected value sums to a finite number. However, if $\sum_{k=1}^\infty |x_k| \mathbb P(X = x_k) = \infty$ while $\sum_{k=1}^\infty x_k \mathbb P(X = x_k) < \infty$ , then a famous theorem of Riemann will show that the order of the $x_k$ can be rearranged to make the sum any real number, or even $\pm \infty$ . In this case, it is not very sensible to even talk about expectations since it is sensitive to how we order the possible values of $x_k$ . Expectations are well behaved because they satisfy the following important property:

Lemma 16 (Linearity of Expectations): Let $X,Y$ be arbitrary discrete random variables on a probability space. Then $\mathbb E[X] + \mathbb E[Y] = \mathbb E[X + Y]$ .

Proof: $\mathbb E[X] + \mathbb E[Y] = \sum_{i=1}^n x_i \mathbb P(X = X_i) + \sum_{j=1}^m y_j \mathbb P(Y = y_j)$

$= \sum_{i=1}^n \sum_{j=1}^m x_i \mathbb P(X = X_i, Y = y_j) + \sum_{i=1}^n \sum_{j=1}^m y_j \mathbb P(X = X_i, Y = y_j)$

$= \sum_{i=1}^n \sum_{j=1}^m x_i + y_j \mathbb P(X = X_i, Y = y_j) = \sum_{k=1}^p (x+y)_k \mathbb P(X + Y = k)$

$= \mathbb E[X+Y]$ . As an exercise, justify the second equality more rigorously.

We would now like to extend the definition of integration to arbitrary measurable functions. The basic idea will be that measurable functions can be arbitrarily well approximated by simple functions. For example, we can show a result such as the following.

Proposition 17 (Approximation from below by simple functions): Let $X$ be a non-negative measurable function. Then there exists a sequence of simple functions $X_1,X_2,\ldots$ such that $X_n < X$ and $X_n \to X$ almost surely.

Proof: In fact, it is possible to get sure convergence. Define $X_n$ to take on values in $\frac1{2^n},\frac2{2^n},\ldots,n$ . Let $X_n^{-1}\left(\{\frac{a}{2^n}\}\right) = X^{-1}\left(\left[\frac{a}{2^n},\frac{a+1}{2^n}\right)\right)$ except for the $\frac{a}{2^n} = n$ . Let $X_n^{-1}(\{n\}) = X^{-1}([n,\infty))$ .

The above construction is a common theme in the theory of integration in general and probability in general: in order to prove something about expected values (which turn out to be integrals) the following road map often works:

First show the fact for indicator functions
Extend to simple functions using linearity
Extend to non-negative functions by approximation from below
Extend to arbitrary functions by the fact that $X = X^+ - X^-$ where $X^+ = X \mathbf 1\{X \geq 0\}$ and $X^- = -X \mathbf 1\{X \leq 0\}$ .

Motivated by this type of construction, we can define expectations in general:

Definition 18: Let $X$ be an arbitrary random variable. Then

$\mathbb E[X] = \sup\{\mathbb E[X^*]: X^* \text{ simple}, X^* \leq X\}$

When the distribution of $X$ has a p.d.f. (I will not rigorously define this here, but assume that the reader is familiar) for example, this definition gives precisely that $E[X] = \int_{-\infty}^\infty x f_X(x)\,\mathrm dx$ . Readers with some familiarity with real analysis may recognize this as a special case of the definition of the Lebesgue integral. Approximation from below means that the linearity of expectations carries over from the discrete case to the general case. Additionally, other familiar properties of integrals are preserved by this definition of expectations. There are three (four including the corollary) classical results that most contribute to the usefulness of this definition of expectation. Their proofs are somewhat long and technical, so I omit them here (although Wikipedia actually proves all of them if you are interested)

Theorem 19 (Monotone Convergence): Suppose $X_1,X_2,\ldots$ are such that $X_n \leq X_{n+1}$ and $X_n \to X$ for almost all $\omega$ . Then $\mathbb E[X_n] \to \mathbb E[X]$ .

Theorem 20 (Fatou’s Lemma): Let $X_1,X_2,\ldots$ be such that $\liminf_{n\to\infty} X_n = X$ . Then $\mathbb E[X] \leq \liminf_{n\to\infty} \mathbb E[X_n]$

Corollary 21 (Reverse Fatou’s Lemma): Let $X_1,X_2,\ldots$ be such that there is a non-negative $Y \geq X_n$ for all $n$ with $\mathbb E[Y] < \infty$ . Then $\limsup_{n\to\infty} \mathbb E[X_n] \leq \mathbb E[\limsup_{n\to\infty} X_n]$ .

Theorem 22 (Dominated Convergence): Let $X_1,X_2,\ldots$ be such that there is non-negative $Y \geq |X_n|$ for all $n$ with $\mathbb E[Y] < \infty$ . Assume further that $X_n \to X$ . Then $\mathbb E[X_n] \to \mathbb E[X]$ .

Monotone convergence tells us if the process by which the $X_n$ converge to $X$ is sufficiently well behaved, integration and taking limits may be swapped. Fatou’s lemma tells us that when swapping integrals and liminfs in general, mass may be lost, but not created. The Reverse Fatou Lemma gives some conditions under which mass can also not be created for limsups. Finally, dominated convergence tells us that if we can find some function that dominates the sequence (in the sense of being larger in absolute value) and this dominating function is well behaved, then this good behavior is passed on to the sequence it dominates. In practice, dominated convergence is the most commonly seen directly applied result.

In this post, I introduced a wide array of technical machinery that seems a bit disparate at this point. In a sequel, I will be able to discuss the laws of large numbers, which tie these concepts together. The long term goal will be to use these results to motivate a general framework for coming up with statistical estimators.

Some probability theory

Published by statsmodel

Leave a comment Cancel reply

Share this:

Related

Published by statsmodel

Leave a comment Cancel reply