The Principal Axis Theorem

In the past few weeks, I have been trapped by a basic mathematical problem that comes up in each undergraduate class on linear algebra: the Principal Axis Theorem (which, in German, has the rather more impressive name „Hauptachsentransformationssatz“, or – as friends of the playwright Brecht might say – “der Satz von der Hauptachsentransformation”. Never mind.) I got there from a discussion I’ve had with a colleague, concerning a certain covariance matrix and what happens when you transform it in a certain way. The key to the discussion was the diagonalization of the covariance matrix via some orthogonal transformation – in the end, I was aware that everything was alright, but I was sufficiently curious about where this powerful theorem is rooted (and I was sufficiently distant to my undergraduate course, so I was totally unable to remember the proof).

It didn’t take me long to understand the proof as it is. It’s an inductive proof constructing the orthogonal transformation and you need a theorem on the existence of some eigenvalue. Of course, my own handwritten lecture notes used the Fundamental Theorem of Algebra, as did the next best book on linear algebra that I could find. (Just as a side note: in all the years that I spent at university, I never encountered a book on linear algebra that I could properly work with or which I would recommend. Some can at least be used as a quick reference, but none were actually useful for my self-studies to deepen a special subject).

This reliance on the Fundamental Theorem of Algebra bugged me. Once you have left your undergraduate studies behind you, it’s a theorem like many others: use it at will, it’s correct. But flipping through my lecture notes, I saw: the proof of the Principal Axis Theorem hangs in midair, because the Fundamental Theorem of Algebra was out of reach for that lecture. So I ventured to look for a proof that could do without that deeper result from complex analysis – thereby ignoring the famous quote by Hadamard:

Le plus court chemin entre deux vérités dans le domaine réel passe par le domaine complexe.

In the end, I found it twice: once in the really nice book by Koecher, another time in the rougher book by Fischer. Fischer’s book was on my shelf, I own it since my undergraduate studies – that book and I have never become good friends. Koecher’s book on the other hand, is new to me, although it is a little more dated, but it contains a lot of little known side notes, historical remarks and results that are not so common in the standard literature. I am still flipping through the rest of Koecher’s book, but I have come to like it. Sadly, the proof of the Principal Axis Theorem is rather intricate – much deeper that the rest of the book is, actually. A nice little quote is connected to the problem at hand here:

Die Hauptachsentransformation ist ein so schlagkräftiges Hilfsmittel, dass man immer dann an sie denken sollte, wenn ein Problem über reelle symmetrische Matrizen ansteht

(The principal axis transformation is such a powerful tool that one should always think of it, whenever a problem on real symmetric matrices comes up). Let us therefore have a look at how you can prove this Theorem without leaving the realm of real numbers.

Let’s start with an example, taken from Koecher: the equation


defines a curve in the plane \mathbb{R}^2. Which curve?
The defining equation can also be written as a quadratic form, using the matrix \begin{pmatrix}\hphantom{-}41&-12\\ -12&\hphantom{-}34 \end{pmatrix}, which yields \left\langle x, Ax\right\rangle = 25. Well, maybe we can apply some rotation to make the equation more accessible. Any rotation is represented by the matrix T=\begin{pmatrix}\cos\alpha& -\sin\alpha\\ \sin\alpha&\hphantom{-}\cos\alpha\end{pmatrix}, for some \alpha\in\mathbb{R}. Now, after applying the linear map T to the coordinates (x_1,x_2)^t, we arrive at new coordinates y, say. In these new coordinates, the quadratic form looks like \left\langle Tx, ATx\right\rangle = \left\langle x, T^tATx\right\rangle, and we shall denote B=T^tAT. Obviously, our goal is to make B as simple as possible. Among the simplest matrices are diagonal matrices. So, we will want to see \begin{pmatrix}a&0 \\ 0&b\end{pmatrix}. What do the constant reals a and b need to be?

Using the abbreviation u=\cos\alpha and v=\sin\alpha, we find the equations

\begin{aligned}  41u^2-24uv+34v^2&=a\\  12v^2-7uv-12u^2&=0\\  41v^2+24uv+34v^2&=b.  \end{aligned}

Besides, ab=\det B = \det T^tAT = \det A = 1250. Those are four equations for four unknows (note: there are non-linear dependencies present! Finding the solution is not particularly difficult, but a little cumbersome).

This is sufficient information to get \cos\alpha=\frac35, \sin\alpha=\frac 45, which first shows T=\frac15 \begin{pmatrix}3& -4\\4&\hphantom{-}3\end{pmatrix} and finally B=25\begin{pmatrix}1&0 \\ 0&2\end{pmatrix}. In particular, \alpha\approx 53^\circ, and the curve in transformed coordinates is y_1^2+2y_2^2=1. It’s an ellipse.

We have rotated the axis of \mathbb{R}^2 such that the ellipse appears in its simplest form, without distracting linear terms. In particular: we only have rotated: the orthogonal transformation T has kept all distances and angles unchanged.

So much for our example. It has shown the underlying ideas of what we shall do geometrically. Let us now state the theorem in its algebraic form, we shall return to the geometric interpretation later:

Theorem: For any symmetric matrix S\in \mathbb{R}^{n\times n}, one can find some orthogonal matrix T\in O(n), such that T^tST is a diagonal matrix.

For the proof, let us consider the set of values of the quadratic form \left\langle x, Sx\right\rangle, which come from those x\in\mathbb{R}^n, for which \left|x\right|=1. Thus, we consider a continuous function on a compact set: there must be some \mu(S) = \max_{\left|x\right|=1} \left\langle x, Sx\right\rangle. In particular, this maximum \mu(S) must be finite and there must be some v with \left\langle v, Sv\right\rangle = \mu(S). Hence, \left\langle x, Sx\right\rangle \leq \left\langle v, Sv\right\rangle for any x with \left| x \right| = 1.

Now, let \tau\in[0,1], \sigma = \sqrt{1-\tau^2} and w \bot v with \left|w\right|=1. We consider the vector x = \sigma v + \tau w. We have \left |x\right|^2 = \sigma^2\left|v\right|^2 + \tau^2\left|w\right|^2 + 2\sigma \tau \left\langle v,w\right\rangle = \sigma^2 +\tau^2 = 1. Since S is symmetric, \left\langle Sv, w\right\rangle = \left\langle v, Sw\right\rangle and therefore,

\begin{aligned}  \left\langle v,Sv\right\rangle & \geq \left\langle x,Sx\right\rangle\\  &= \sigma^2\left\langle v,Sv\right\rangle + 2\sigma\tau\left\langle v,Sw\right\rangle + \tau^2\left\langle w,Sw\right\rangle.  \end{aligned}

Division by \tau and the relation between \tau and \sigma show

2\sigma\left\langle w, Sv\right\rangle \leq \tau (\left\langle v, Sv\right\rangle - \left\langle w, Sw\right\rangle).

Without loss of generality, \left\langle w, Sv\right\rangle \geq 0 (we can choose -w instead of w if need be). From here and from letting \tau\to 0, we find the right-hand side approaching 0, and hence \left\langle w, Sv\right\rangle=0.

In particular, Sv \bot w. But w was chosen from (\mathbb{R}v)^\bot and so Sv \in ((\mathbb{R}v)^\bot)^\bot = \mathbb{R}v.

This was the first step in our proof of the Principal Axis Theorem. We now need a second auxiliary result before we get to the finale: Let a and b two vectors in \mathbb{R}^n with \left|a\right|=\left|b\right|. Then there is an orthogonal matrix U such that Ua=b.

One can choose, for instance, the reflection

\displaystyle S_{a-b}(x) := x-2\frac{\left\langle a-b, x\right\rangle}{\left|{a-b}\right|^2}x.

By assumption,

\displaystyle \left|a-b\right|^2 = \left|a\right|^2+\left|b\right|^2-2\left\langle a,b\right\rangle = 2\left\langle a,a\right\rangle-2\left\langle a,b\right\rangle = 2\left\langle a-b,a\right\rangle.

This shows

\displaystyle S_{a-b}(a) = a-2\frac{\left\langle a-b,a\right\rangle}{\left|a-b\right|^2}a = a-(a-b)=b

and vice versa S_{a-b}(b)=a. It is easy to see that S_{a-b} is actually orthogonal.

One can even choose the orthogonal matrix to have positive determinant, but this is not important here.

The rest will be easy now. Let us use induction on the dimension of the matrix S.

The case n=1 is trivial: nothing is to be done, the matrix is diagonal already. Let the theorem be proved for any symmetric real (n-1)\times(n-1) matrix, n\geq 2. Let S be some real symmetric n\times n matrix. We have already proved that there is some v with \left|v\right|=1 and some \lambda\in\mathbb{R} such that Sv = \lambda v (actually, we have shown that \mu(S) is this \lambda).

Besides, there is some orthogonal matrix T such that Te_1 = v. Then we find T^tSTe_1 = T^t\lambda v=\lambda e_1, which means T^tST = \begin{pmatrix}\lambda & q \\ 0&S_1\end{pmatrix}. The symmetry of S implies the symmetry of T^tST, and hence S_1 is a symmetric (n-1)\times(n-1)-matrix, q=0. By induction, T_1^t S_1 T_1 is a diagonal matrix D_1, using an orthogonal transformation T_1.

With T' = \begin{pmatrix}1&0 \\ 0&T_1\end{pmatrix}, we have {T'}^tT^tSTT' = \begin{pmatrix}\lambda&0 \\ 0&D_1\end{pmatrix}.

The theorem is proved.

The entries of the diagonal matrix D=\begin{pmatrix}\lambda_1&&0 \\ &\ddots& \\ 0&&\lambda_n\end{pmatrix} obviously are special – let’s have a closer look at them: they were defined via Sv_i = \lambda_iv_i for some extremal vector v_i. This is the definition of an eigenvalue of S. If you are accustomed to another definition of eigenvalues: it follows directly from the theorem: One finds

\displaystyle\det(\lambda E-S) = \det(T^t(\lambda E-S)T) = \det(\lambda E-D) = \prod_{i=1}^n(\lambda-\lambda_i).

In particular, if you remember your course on linear algebra: the entries of D are exactly the zeros of the characteristic polynomial of S – they are exactly the eigenvalues of S.

This is a natural way of introducing eigenvalues, not as an abstract concept, but via an application from inside mathematics. I do like this approach, although it makes the following statement a lot less impressive: Any symmetric real n\times n-matrix has n real eigenvalues, and it allows for an orthonormal basis of eigenvectors. This is not obvious at all.

The key to the proof of the theorem was the existence of eigenvalues to the symmetric matrix. We haven’t even left the real numbers and we produced the existence of zeros to a certain polynomial of arbitrary degree. As I mentioned above, this statement of existence is often done via the Fundamental Theorem of Algebra – but we can do perfectly fine without this rather bulky instrument. We can’t compete at all without some information from calculus. In this case, we vitally needed the structure of the real numbers as a complete field, allowing for the existence of extremal values of a continuous function on a compact set. In particular, the theorem can’t work over any arbitrary field.

The proof of existence of eigenvalues was taken from Fischer’s book – it comes without too much unnecessary overhead information. Koecher used more information on positive definite matrices – which clouded the exact spot where we use the symmetry of our matrix. In the proof above, the spot is clear: we need to simplify the expression in the scalar product. Hence, the theorem will be wrong, in general, for non-symmetric matrices. The conclusion of the proof given above was taken from Koecher again. We have used the best of many worlds here.

On the other hand, the proof of the theorem could be done a lot shorter: there is an eigenvalue (because the characteristic polynomial splits over \mathbb{C}), by symmetry this eigenvalue must be real (because the inner product is sesquilinear), transform the eigenvector to a canonical base vector and use induction (as we did above). That’s the strategy from my handwritten lecture notes – at least it’s easy to see. But it also covers any inner relations and it uses tougher results than would be necessary. Different tools for different intents.

The orthogonal transformation comes from the fact that you can transform any two vectors with the same length into each other. We did it with a reflection here, but a rotation (being the concatenation of two reflections) could have done the same trick. That can yield an even stronger theorem: the eigenvectors are not only orthogonal but the transformation matrix also has positive determinant.

For conclusion, we give a little application of the theorem for use in geometry.

Corollary: Let \left\langle x, Sx\right\rangle + \left\langle x,a\right\rangle +\alpha be a general polynomial in n variables over the real numbers. Then, the polynomial can be brought in one of these normal forms:
\displaystyle\sum_{i=1}^n\lambda_ix_i^2 + \beta,
\displaystyle\sum_{i=1}^{n-1}\lambda_ix_i^2+2\gamma x_n.
In particular, any curve of degree 2 (which is the zero set of a general polynomial), can be classified in this manner.

Note here, that the matrix S can be chosen as symmetric without loss of generality: the polynomial will not change when you choose the corresponding off-diagonal entries as their arithmetic mean.

In the plane \mathbb{R}^2, this classification leads to the classical descriptions of a parabola, a hyperbola, an ellipse, straight lines or single points – there are no other curves of degree 2 in the plane.

We shall not prove the corollary, the details are a little messy (getting rid of the “linear term”, for instance) and are not spectacular as they stand. But the proof in the reals of the Principal Axis Theorem is a collection of nice arguments, as we have seen.