Multivariate Gaussian Conditioning Without Explicit Inversion

Suppose we have some random vector (X, Y)^T \in \mathbb R^{m+n} following some (non-degenerate in the sense of having positive definite covariance) multivariate Gaussian distribution with mean (\mu_X, \mu_Y) and covariance matrix

\begin{pmatrix}\Sigma_{xx}& \Sigma_{xy}\\ \Sigma_{xy}^T & \Sigma_{yy}\end{pmatrix}

It is a well known fact that the conditional distribution of Y | X is given by another multivariate Gaussian with mean and covariance matrix given respectively by

\mathbb E[Y | X] = \mu_Y + \Sigma_{xy}^T\Sigma_{xx}^{-1}(X - \mu_X)

\mathbb C\mathrm{ov}(Y | X) = \Sigma_{yy} - \Sigma_{xy}^T \Sigma_{xx}^{-1}\Sigma_{xy}

These formulas have some nice intuition behind them, especially when they interpreted as Bayesian rules for updating beliefs about Y upon seeing some relevant data, X. The top formula is basically saying that we should move in the direction of the best predictor given X, and our update should be more extreme the more unexpected X is. The bottom formula basically says that information about X must reduce our uncertainty about Y.

The usual proof of this is a tedious and boring calculation involving expressing a matrix inversion in terms of a Schur complement (this is not unlike how the typical proof Frisch-Waugh-Lovell is also just a computation with Schur complements). Here, I want to present an alternative proof along lines I feel are more typical to how econometricians do stats.

By demeaning our multivariate Gaussian, we will WLOG assume that \mathbb E[X] = 0 and \mathbb E[Y] = 0. Then by one definition of a multivariate Gaussian distribution, we have there are full rank matrixes A,B such that X = A U and Y = B U for some U \sim \mathcal N(0, I_{m+n}) such that

\begin{pmatrix}\Sigma_{xx}& \Sigma_{xy}\\ \Sigma_{xy}^T & \Sigma_{yy}\end{pmatrix} = \begin{pmatrix}A A^T & A B^T\\ B A^T & B B^T\end{pmatrix}

We will need two technical exercises to justify a few operations on submatrices:

Exercise 1: Show that if A is a positive definite matrix, then A is invertible and A^{-1} is also positive definite (Hint: use the spectral theorem).

Exercise 2: Show that if A is positive definite, then any square sub-matrix is also positive definite (Hint: use the definition that A is positive definite if and only if x^T A x > 0 for all vectors x).

With these two in hand, it is possible to prove:

Exercise 3: Show that Y | X follows another Gaussian distribution and \mathbb E[Y | X] is linear in X (Hint: use the conditional density formula. It suffices to show that the density must be proportional to a multivariate Gaussian density without showing what that density is).

Remark 4: Hopefully, the above exercise makes clear why the typical proof is to take inverses using a Schur complement. It would be the obvious next step after the conclusion of the exercise. However, I will proceed in a way that I think is more conceptually stimulating and less computationally burdensome.

We can now write the relationship between X,Y as a linear regression, giving us

BU = Y = DX + V = DAU + V

Then we are just solving the least squares problem:

D = \mathbb E[YX^T] \mathbb E[X X^T]^{-1} = B A^T (A A^T)^{-1}

Exercise 3 shows us that the conditional expectation is linear, so it is equivalent to the best leaner predictor, so \mathbb E[Y | X] = DX = B A^T (A A^T)^{-1} X. Additionally, we have that V is Gaussian and uncorrelated with X and hence independent of X. So \mathbb C\mathrm{ov}(Y | X) = \mathbb C\mathrm{ov}(V | X) = \mathbb C\mathrm{ov}(V). But we also have that

V = (B - DA) U = B(I - A^T(A A^T)^{-1}A)U

\implies \mathbb C\mathrm{ov}(V) = B(I - A^T(A A^T)^{-1}A) B^T = B B^T - B A^T (A A^T)^{-1} A B^T

which is the desired conditional covariance formula.

I will conclude by teasing a somewhat applied example (although admittedly, I don’t know the algorithms I’m referencing with any depth). Note that the conditional covariance formula above is nothing more than the Schur complement of \mathbb C\mathrm{ov}(Y) = B B^T, and it can be shown that if we write:

\begin{pmatrix}\Sigma_{xx}& \Sigma_{xy}\\ \Sigma_{xy}^T & \Sigma_{yy}\end{pmatrix}^{-1} = \begin{pmatrix}\Lambda_{xx}& \Lambda_{xy}\\ \Lambda_{xy}^T & \Lambda_{yy}\end{pmatrix}

we have that \Lambda_{yy} = \left(B B^T - B A^T (A A^T)^{-1} A B^T\right)^{-1}. Suppose Y = (Y_1, Y_2)^T \in \mathbb R^2. Then Y_1 \perp Y_2 | X \iff \mathbb C\mathrm{ov}(Y_1,Y_2) = 0 \iff \Lambda_{yy} is a diagonal matrix (since a matrix is diagonal if and only if its inverse is diagonal). The upshot is that if X is multivariate Gaussian system with \mathbb C\mathrm{ov}(X)^{-1} = \Lambda, then we have that

X_i \perp X_j | (X_k)_{k\not\in\{i,j\}} \iff \Lambda_{ij} = \Lambda_{ji} = 0

This nice property partially explains the popularity of a Gaussian assumption in the probabilistic graphical models literature. In particular, the above shows that an undirected graph where each component of X gets a node, and each pair of nodes i,j has an edge between them if and only if \Lambda_{ij} = 0 is a valid Markov random field representing the dependence structure of X. This construction allows us to use general purpose exact and approximate inference algorithms on variables in X, and these sort of ideas eventually motivate the actual algorithms that housekeeping robots and self driving cars use.

Leave a comment