Eigendecomposition
A useful technique to know in machine learning, is how to factorize a matrix. It seems like this only complicates things: writing a matrix as product of several matrices. However, it can help us gain pictorial insights from our equations. In this article, we explore a matrix decomposition method known as eigendecomposition.
Eigendecomposition
Suppose $A$ is a $N \times N$ matrix with $N$ linearly independent eigenvectors. For simplicitly, suppose $A$ is $2 \times 2$. We will have the following two equations:
$
A \vec{u_1} = \lambda_1 \vec{u_1}
$
$
A \vec{u_2} = \lambda_2 \vec{u_2}
$
Let's define a matrix $\Lambda$:
$ \Lambda = \begin{bmatrix} \lambda_1 & 0 \\ 0 & \lambda_2 \end{bmatrix} $
Let's also arrange our eigenvectors into their own matrix $U$:
$ U = [\vec{u_1} \vec{u_2}] = \begin{bmatrix} u_1^{(1)} & u_1^{(2)} \\ u_2^{(1)} & u_2^{(2)} \end{bmatrix} $
Here, the superscript components denote the eigenvector, and the subscripts denote the components of the vector. Now, consider the following matrix multiplication:
$ U \Lambda = \begin{bmatrix} u_1^{(1)} & u_1^{(2)} \\ u_2^{(1)} & u_2^{(2)} \end{bmatrix} \begin{bmatrix} \lambda_1 & 0 \\ 0 & \lambda_2 \end{bmatrix} = \begin{bmatrix} \lambda_1 u_1^{(1)} & \lambda_2 u_1^{(2)} \\ \lambda_1 u_2^{(1)} & \lambda_2 u_2^{(2)} \end{bmatrix} = [\lambda_1 \vec{u_1} \; \lambda_2 \vec{u_2}] = $
Now, consider this matrix multiplication:
$ A U = \begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix} \begin{bmatrix} u_1^{(1)} & u_1^{(2)} \\ u_2^{(1)} & u_2^{(2)} \end{bmatrix} = \begin{bmatrix} a_{11} u_1^{(1)} + a_{12} u_2^{(1)} & a_{11} u_1^{(2)} + a_{12} u_2^{(2)} \\ a_{21} u_1^{(1)} + a_{22} u_2^{(1)} & a_{21} u_1^{(2)} + a_{22} u_2^{(2)} \end{bmatrix} = [A \vec{u_1} \; A \vec{u_2}] $
Thus, $AU = U \Lambda$. Solving for $A$:
$A = U \Lambda U^{-1}$
Understanding multivariate Guassian
One application [K. Murphy] is to gain insight into how multivariate Guassian works. Recall its equation:
$N(\vec{x} \, | \, \mu, \Sigma) = \frac{1}{(2 \pi)^{\frac{D}{2}}} \frac{1}{ | \, \Sigma \, | ^{\frac{1}{2}}} e^{-\frac{1}{2} (\vec{x} - \vec{\mu})^T \, \Sigma^{-1} \, (\vec{x} - \vec{\mu})}$
We can rewrite $\Sigma$, the covariance matrix as:
$\Sigma = U \Lambda U^{-1}$
We'd like to examine the eigendecomposition of its inverse instead:
$\Sigma^{-1} = (U \Lambda U^{-1})^{-1} = (\Lambda U^{-1})^{-1} U^{-1} = U \Lambda^{-1} U^{-1}$
But since the covariance matrix is a symmetric matrix, $U$ is guaranteed to be orthonormal (we can pick our eigenvectors to be unit vectors). This means $U^T U = I$. Thus, $U^{-1} = U^T$. Therefore:
$\Sigma^{-1} = U \Lambda^{-1} U^{T} = \sum_i^{D} \frac{\vec{u} \; \vec{u}^T}{\lambda_i}$
Plugging this in:
$ (\vec{x} - \vec{\mu})^T \, \big( \sum_i^{D} \frac{\vec{u_i} \; \vec{u_i}^T}{\sqrt{\lambda_i}^2} \big) \, (\vec{x} - \vec{\mu}) $
or
$ \sum_i^{D} \frac{ (\vec{x} - \vec{\mu})^T \vec{u_i} \; \vec{u_i}^T (\vec{x} - \vec{\mu}) }{\sqrt{\lambda_i}^2} $
If we let $y_i = \vec{u_i}^T (\vec{x} - \vec{\mu})$
Then
$ \sum_i^{D} {(\frac{y_i}{\sqrt{\lambda_i}})}^2 $
This is the equation of an ellipse! Where the length of each axis is given by singular values (the square roots of the eigenvalues).
Now, let's take a step back and look at this from a bigger point of view. We begin with an aribtrary vector $\vec{x}$ in space. Then, we input this, do some things, and get back a scalar value. But what is this scalar value we are getting back? Well, let's follow the steps.
The Translation
The first operation, is to subtract the mean. We have $\vec{x}' = \vec{x} - \vec{\mu}$ and think of $\vec{x}'$ as shifiting our coordinate system so that the origin is the center of the distribution.
The Rotation
What next? Well next, we compute $\vec{u_i} \cdot \vec{x}'$. But what's going on here? Recall that a dot project can geometrically be interpreted as projecting the ith component of $x_i$ onto $\vec{u_i}$ Here, we are projecting onto eigenvectors. But these are eigenvectors for a covariance matrix work. The eigenvector with the largest eigenvalue points in the direction of maximum variance. And the next does the same, with the requirement that it is orthogonal to the first, etc. In other words, we are rotating our coodinate system. However, our components may be really long. How long? Well, the components are as long as the axes of the multidimensional ellipsoid. This is why we scale next.
The Scaling
At this point, the distribution will still look ellipsoid. But now, we divide each component by the square root of its eigenvalue. If we do this, we will arrive at a symmetric layout. This is exacly what $\frac{1}{\sqrt{\lambda_i}}$ does.
Mahalanobis Distance
$ \sum_i^{D} x_i^2 = (\vec{x} - \vec{\mu})^T \, \Sigma^{-1} \, (\vec{x} - \vec{\mu}) $
To make this distance have units that make sense, all we need to do is take the square root:
$ D_M = \sqrt{\sum_i^{D} x_i^2} = \sqrt{(\vec{x} - \vec{\mu})^T \, \Sigma^{-1} \, (\vec{x} - \vec{\mu})} $
This is the so-called Mahalanobis distance!
$N(\vec{x} \, | \, \mu, \Sigma) = \frac{1}{(2 \pi)^{\frac{D}{2}}} \frac{1}{ | \, \Sigma \, | ^{\frac{1}{2}}} e^{-\frac{1}{2} {D_M}^2}$
That is, we are measuring distance in a coordinate system. In this coordinate system, data points have been normalized. We centered, rotated, and stretched data points, and then measure distance in this new space.
References
[1] K. Murphy, Machine Learning: a Probabilistic Perspective.. See 4.1.2.