The Multivariate Gaussian Distribution

Definition

Let \(\mathbf{x} \in \mathbb{R}^N\) be a vector whose elements, \(X_i\), are random variables (i.e., \(\mathbf{x}\) is a random vector). As elements of \(\mathbf{x}\), we say that \(X_1, X_2, \dots, X_N\) are jointly distributed. Define the vector \(\mathbf{\mu}\) and matrix \(\mathbf{\Sigma}\) as \[ \begin{align*} \mathbf{\mu} \in \mathbb{R}^N &= (\mu_1, \mu_2, \dots, \mu_N)^\top = (\mathbb{E}(X_1), \mathbb{E}(X_2), \dots, \mathbb{E}(X_N))^\top \\ \mathbf{\Sigma} \in \mathbb{R}^{N \times N} &= Cov(\mathbf{x}) = \mathbb{E}[(\mathbf{x} - \mathbf{\mu})(\mathbf{x} - \mathbf{\mu})^\top] = \Bigl[Cov(X_i, X_j) \Bigr]_{i, j = 1}^N \end{align*} \]

and suppose \(X_i \sim \mathcal{N}(\mu_i, \mathbf{\Sigma}_{i, i})\). If every linear combination of \(\mathbf{x}\)’s elements results in a univariate normal variable, we say that \(\mathbf{x}\) comes from the multivariate Gaussian distribution parameterized by \(\mathbf{\mu}\) and \(\mathbf{\Sigma}\), written as \(\mathbf{x} \sim \mathcal{N}_N (\mathbf{\mu}, \mathbf{\Sigma}).\)

Sampling Algorithm

A commonly used algorithm for sampling from a multivariate Gaussian distribution can be described as the following, given a vector \(\mathbf{\mu} \in \mathbb{R}^N\) and a positive-(semi)definite matrix \(\mathbf{\Sigma} \in \mathbb{R}^{N \times N}\):

Compute \(L L^\top = \mathbf{\Sigma}\), the Cholesky decomposition of \(\mathbf{\Sigma}\).
Generate \(\mathbf{z} = (Z_1, Z_2, \dots, Z_N)^\top\) where \(Z_1, Z_2, \dots, Z_N \overset{\text{i.i.d.}}{\sim} \mathcal{N}(0, 1)\).
Compute \(\mathbf{x} = L\mathbf{z} + \mathbf{\mu}\).

Why does the algorithm work?

The claim is that \(\mathbf{x}\) is a draw from a multivariate Gaussian distribution parameterized by \(\mathbf{\mu}\) and \(\mathbf{\Sigma}\), i.e., \(\mathbf{x} \sim \mathcal{N}_N (\mathbf{\mu}, \mathbf{\Sigma})\). So, why does this work? First, we can show that \(\mathbb{E}(\mathbf{x}) = \mathbf{\mu}\):

\[ \begin{align*} \mathbb{E}(\mathbf{x}) &= \mathbb{E}(L\mathbf{z} + \mathbf{\mu}) \\ &= \mathbb{E}(L\mathbf{z}) + \mathbb{E}(\mathbf{\mu}) \\ &= L \cdot \mathbb{E}(\mathbf{z}) + \mathbf{\mu} \\ &= L \cdot 0 + \mathbf{\mu} \\ &= \mathbf{\mu}. \end{align*} \]

Next, we can see that \(Cov(\mathbf{x}) = \mathbf{\Sigma}\):

Reminder that \(Cov(\mathbf{x})\) is not a scalar, rather, it is the covariance matrix (or variance-covariance matrix) of \(\mathbf{x}\).

\[ \begin{align*} Cov(\mathbf{x}) &= \mathbb{E}[(\mathbf{x} - \mathbf{\mu})(\mathbf{x} - \mathbf{\mu})^\top] \\ &= \mathbb{E}[(L\mathbf{z} + \mathbf{\mu} - \mathbf{\mu})(L\mathbf{z} + \mathbf{\mu} - \mathbf{\mu})^\top] \\ &= \mathbb{E}[L\mathbf{z}\mathbf{z}^\top L^\top] \\ &= L\ \mathbb{E}[\mathbf{z}\mathbf{z}^\top]\ L^\top \\ &= L\ \mathbb{E}[(\mathbf{z} - 0)(\mathbf{z} - 0)^\top]\ L^\top \\ &= L\ Cov(\mathbf{z})\ L^\top \\ &= L I L^\top \\ &= L L^\top \\ &= \mathbf{\Sigma}. \end{align*} \]

This proof hinges on the definition of covariance and the independence of the \(Z_i\)’s. Remember, \(Cov(Z_i, Z_i) = Var(Z_i) = 1\) and \(Cov(Z_i, Z_j) = 0\) when \(i \neq j\) and \(Z_i \perp Z_j\).

The following references ((https://math.stackexchange.com/users/10117/lepidopterist), n.d.), ((https://math.stackexchange.com/users/21550/hkbattousai), n.d.), (ibilion[at]purdue.edu 2022) (specifically this section) were helpful to me as I worked through understanding the algorithm and the proofs for the two results above.

Having shown that the expectation and (co)variance of \(\mathbf{x}\) are what we’d assume, we need to show that \(\mathbf{x}\) qualifies as multivariate normal. A random vector is multivariate normal if any linear combination of its elements is normally distributed. From its construction, we know \(\mathbf{x}\) is a linear transformation of a standard normal vector, i.e., \(\mathbf{z}\). We can decompose \(\mathbf{x}\)’s \(i\)-th element, \(X_i\), into the following:

\[ \begin{align*} \sigma_i &= \sum_{j = 0}^n L_{i,j} \\ X_i &= Z_i \sigma_i + \mu_i. \end{align*} \]

We can then show that \(X_i \sim \mathcal{N}(\sigma_i \cdot 0 + \mu_i, \sigma_i^2) \implies X_i \sim \mathcal{N}(\mu_i, \sigma_i^2)\). This follows from theorems regarding the transformation of random variables (Casella and Berger 2024). I’ll demonstrate it for an arbitrary \(Z \sim \mathcal{N}(0, 1)\), \(\mu\), and \(\sigma > 0\) (omitting the subscript \(i\) for brevity). The probability density function (pdf) of the standard normal \(f_Z(z)\) is defined as \[ f_Z(z) = \frac{1}{\sqrt{2\pi}}\ e^{-z^2 / 2}. \]

Define \(X' = g(Z) = \sigma \cdot f_Z(Z)\). Then, \(g^{-1}(x) = \frac{x}{\sigma}\), and \(\frac{d}{dx}g^{-1}(x) = \frac{1}{\sigma}\). Thus, the probability density function \(f_{X'}(x)\) is

\[ \begin{align*} f_{X'}(x) &= f_Z(g^{-1}(x)) \frac{d}{dx}g^{-1}(x) \\ &= f_Z \Bigl(\frac{x}{\sigma} \Bigr) \cdot \frac{1}{\sigma} \\ &= \frac{1}{\sqrt{2\pi}}\ e^{-(\frac{x}{\sigma})^2 \cdot \frac{1}{2}} \cdot \frac{1}{\sigma} \\ &= \frac{1}{\sqrt{2\pi} \sigma}\ e^{-\frac{x^2}{2\sigma^2}}. \end{align*} \]

This is the pdf for \(\mathcal{N}(0, \sigma^2)\). Now, we find \(X = h(X') = f_{X'}(X') + \mu\), where \(h^{-1}(x) = x - \mu\) and \(\frac{d}{dx}\ h^{-1}(x) = 1\). As with the above, we have \[ f_X(x) = f_{X'}(h^{-1}(x)) \frac{d}{dx}\ h^{-1}(x) = f_{X'}(x - \mu) = \frac{1}{\sqrt{2\pi} \sigma}\ e^{-\frac{(x - \mu)^2}{2\sigma^2}}, \]

which is the classical definition of the normal distribution’s probability density function. This means that scaling a normal variable by a constant, or adding a constant to a normal variable results in another normal variable. So, each \(X_i\) is normal. Lastly, as one’s intuition might suspect, the sum of jointly distributed normal variables is also normal. I won’t go through the proof here, but some versions can be found on wikipedia.

Thus, we can conclude that a linear combination \(Y = a_1 X_1 + \cdots + a_n X_n\) is a normal variable, meaning \(\mathbf{x}\) is a draw from the multivariate Gaussian distribution.

R Implementation

set.seed(123)
mu <- c(1.1, 3, 8)
Sigma <- matrix(c(1, 0.1, 0.3, 0.1, 1, 0.1, 0.3, 0.1, 1), 3, 3)

mvrnorm <- function(n, mu, Sigma) {
  L <- chol(Sigma) # this returns L', where LL' = Sigma
  d <- length(mu)

  out <- matrix(0, nrow = n, ncol = d)
  for (i in 1:n) {
    u <- rnorm(d)
    out[i, ] <- t(t(L) %*% u + mu)
  }
  return(out)
}

mvrnorm(10, mu, Sigma)

           [,1]     [,2]     [,3]
 [1,] 0.5395244 2.714929 9.298527
 [2,] 1.1705084 3.135691 9.661861
 [3,] 1.5609162 1.787372 7.395843
 [4,] 0.6543380 4.173380 8.294725
 [5,] 1.5007715 3.150205 7.599224
 [6,] 2.8869131 3.674046 6.700175
 [7,] 1.8013559 2.599714 7.161280
 [8,] 0.8820251 1.957341 7.169001
 [9,] 0.4749607 1.259257 8.490846
[10,] 1.2533731 1.882905 9.158747

References

Casella, G., and R. Berger. 2024. Statistical Inference. Chapman & Hall/CRC Texts in Statistical Science. CRC Press. https://books.google.com/books?id=cqUIEQAAQBAJ.

(https://math.stackexchange.com/users/10117/lepidopterist), Lepidopterist. n.d. “Generating Multivariate Gaussian Samples–Why Does It Work?” Mathematics Stack Exchange. https://math.stackexchange.com/q/2691254.

(https://math.stackexchange.com/users/21550/hkbattousai), hkBattousai. n.d. “Affine Transformation Applied to a Multivariate Gaussian Random Variable - What Is the Mean Vector and Covariance Matrix of the New Variable?” Mathematics Stack Exchange. https://math.stackexchange.com/q/332722.

ibilion[at]purdue.edu, Ilias Bilionis. 2022. Introduction to Scientific Machine Learning. Edited by n/a. Self published.