Multivariate Gaussian Conditioning Without Explicit Inversion

Suppose we have some random vector $(X, Y)^T \in \mathbb R^{m+n}$ following some (non-degenerate in the sense of having positive definite covariance) multivariate Gaussian distribution with mean $(\mu_X, \mu_Y)$ and covariance matrix

$\begin{pmatrix}\Sigma_{xx}& \Sigma_{xy}\\ \Sigma_{xy}^T & \Sigma_{yy}\end{pmatrix}$

It is a well known fact that the conditional distribution of $Y | X$ is given by another multivariate Gaussian with mean and covariance matrix given respectively by

$\mathbb E[Y | X] = \mu_Y + \Sigma_{xy}^T\Sigma_{xx}^{-1}(X - \mu_X)$

$\mathbb C\mathrm{ov}(Y | X) = \Sigma_{yy} - \Sigma_{xy}^T \Sigma_{xx}^{-1}\Sigma_{xy}$

These formulas have some nice intuition behind them, especially when they interpreted as Bayesian rules for updating beliefs about $Y$ upon seeing some relevant data, $X$ . The top formula is basically saying that we should move in the direction of the best predictor given $X$ , and our update should be more extreme the more unexpected $X$ is. The bottom formula basically says that information about $X$ must reduce our uncertainty about $Y$ .

The usual proof of this is a tedious and boring calculation involving expressing a matrix inversion in terms of a Schur complement (this is not unlike how the typical proof Frisch-Waugh-Lovell is also just a computation with Schur complements). Here, I want to present an alternative proof along lines I feel are more typical to how econometricians do stats.

By demeaning our multivariate Gaussian, we will WLOG assume that $\mathbb E[X] = 0$ and $\mathbb E[Y] = 0$ . Then by one definition of a multivariate Gaussian distribution, we have there are full rank matrixes $A,B$ such that $X = A U$ and $Y = B U$ for some $U \sim \mathcal N(0, I_{m+n})$ such that

$\begin{pmatrix}\Sigma_{xx}& \Sigma_{xy}\\ \Sigma_{xy}^T & \Sigma_{yy}\end{pmatrix} = \begin{pmatrix}A A^T & A B^T\\ B A^T & B B^T\end{pmatrix}$

We will need two technical exercises to justify a few operations on submatrices:

Exercise 1: Show that if $A$ is a positive definite matrix, then $A$ is invertible and $A^{-1}$ is also positive definite (Hint: use the spectral theorem).

Exercise 2: Show that if $A$ is positive definite, then any square sub-matrix is also positive definite (Hint: use the definition that $A$ is positive definite if and only if $x^T A x > 0$ for all vectors $x$ ).

With these two in hand, it is possible to prove:

Exercise 3: Show that $Y | X$ follows another Gaussian distribution and $\mathbb E[Y | X]$ is linear in $X$ (Hint: use the conditional density formula. It suffices to show that the density must be proportional to a multivariate Gaussian density without showing what that density is).

Remark 4: Hopefully, the above exercise makes clear why the typical proof is to take inverses using a Schur complement. It would be the obvious next step after the conclusion of the exercise. However, I will proceed in a way that I think is more conceptually stimulating and less computationally burdensome.

We can now write the relationship between $X,Y$ as a linear regression, giving us

$BU = Y = DX + V = DAU + V$

Then we are just solving the least squares problem:

$D = \mathbb E[YX^T] \mathbb E[X X^T]^{-1} = B A^T (A A^T)^{-1}$

Exercise 3 shows us that the conditional expectation is linear, so it is equivalent to the best leaner predictor, so $\mathbb E[Y | X] = DX = B A^T (A A^T)^{-1} X$ . Additionally, we have that $V$ is Gaussian and uncorrelated with $X$ and hence independent of $X$ . So $\mathbb C\mathrm{ov}(Y | X) = \mathbb C\mathrm{ov}(V | X) = \mathbb C\mathrm{ov}(V)$ . But we also have that

$V = (B - DA) U = B(I - A^T(A A^T)^{-1}A)U$

$\implies \mathbb C\mathrm{ov}(V) = B(I - A^T(A A^T)^{-1}A) B^T = B B^T - B A^T (A A^T)^{-1} A B^T$

which is the desired conditional covariance formula.

I will conclude by teasing a somewhat applied example (although admittedly, I don’t know the algorithms I’m referencing with any depth). Note that the conditional covariance formula above is nothing more than the Schur complement of $\mathbb C\mathrm{ov}(Y) = B B^T$ , and it can be shown that if we write:

$\begin{pmatrix}\Sigma_{xx}& \Sigma_{xy}\\ \Sigma_{xy}^T & \Sigma_{yy}\end{pmatrix}^{-1} = \begin{pmatrix}\Lambda_{xx}& \Lambda_{xy}\\ \Lambda_{xy}^T & \Lambda_{yy}\end{pmatrix}$

we have that $\Lambda_{yy} = \left(B B^T - B A^T (A A^T)^{-1} A B^T\right)^{-1}$ . Suppose $Y = (Y_1, Y_2)^T \in \mathbb R^2$ . Then $Y_1 \perp Y_2 | X \iff \mathbb C\mathrm{ov}(Y_1,Y_2) = 0 \iff \Lambda_{yy}$ is a diagonal matrix (since a matrix is diagonal if and only if its inverse is diagonal). The upshot is that if $X$ is multivariate Gaussian system with $\mathbb C\mathrm{ov}(X)^{-1} = \Lambda$ , then we have that

$X_i \perp X_j | (X_k)_{k\not\in\{i,j\}} \iff \Lambda_{ij} = \Lambda_{ji} = 0$

This nice property partially explains the popularity of a Gaussian assumption in the probabilistic graphical models literature. In particular, the above shows that an undirected graph where each component of $X$ gets a node, and each pair of nodes $i,j$ has an edge between them if and only if $\Lambda_{ij} = 0$ is a valid Markov random field representing the dependence structure of $X$ . This construction allows us to use general purpose exact and approximate inference algorithms on variables in $X$ , and these sort of ideas eventually motivate the actual algorithms that housekeeping robots and self driving cars use.

Multivariate Gaussian Conditioning Without Explicit Inversion

Published by statsmodel

Leave a comment Cancel reply

Share this:

Related

Published by statsmodel

Leave a comment Cancel reply