What does the determinant have to do with transformed measures?

Let us consider transformations of the space \mathbb{R}^p. How does Lebesgue measure change by this transformation? And how do integrals change? The general case is answered by Jacobi’s formula for integration by substitution. We will start out slowly and only look at how the measure of sets is transformed by linear mappings.

It is folklore in the basic courses on linear algebra, when the determinant of a matrix is introduced, to convey the notion of size of the parallelogram spanned by the column vectors of the matrix. The following theorem shows why this folklore is true; this of course is based on the axiomatic description of the determinant which encodes the notion of size already. But coming from pure axiomatic reasoning, we can connect the axioms of determinant theory to their actual meaning in measure theory.

First, remember the definition of the pushforward measure. Let X and Y be measurable spaces, and f:X\to Y a measurable mapping (i.e. it maps Borel-measurable sets to Borel-measurable sets; we shall not deal with finer details of measure theory here). Let \mu be a measure on X. Then we define a measure on Y in the natural – what would \mu do? – way:

\displaystyle \mu_f(A) := \mu\bigl(f^{-1}(A)\bigr).

In what follows, X = Y = \mathbb{R}^p and \mu = \lambda the Lebesgue measure.

Theorem (Transformation of Measure): Let f be a bijective linear mapping and let A\subset\mathbb{R}^p be a measurable set. Then, the pushforward measure satisfies:

\displaystyle \lambda_f(A) = \left|\det f\right|^{-1}\lambda(A)\qquad\text{for any Borel set }A\in\mathcal{B}^p.

 

We will start with several Lemmas.

Lemma (The Translation-Lemma): Lebesgue measure is invariant under translations.

Proof: Let c,d\in\mathbb{R}^p with c\leq d component-wise. Let t_a be the shift by the vector a\in\mathbb{R}^p, i.e. t_a(v) = v+a and t_a(A) = \{x+a\in\mathbb{R}^p\colon x\in A\}. Then,

\displaystyle t_a^{-1}\bigl((c,d]\bigr) = (c-a,d-a],

where the interval is meant as the cartesian product of the component-intervals. For the Lebesgue-measure, we get

\displaystyle \lambda_{t_a}\bigl((c,d]\bigr) = \lambda\bigl((c-a,d-a]\bigr) = \prod_{i=1}^p\bigl((d-a)-(c-a)\bigr) = \prod_{i=1}^p(d-c) = \lambda\bigl((c,d]\bigr).

The measures \lambda, \lambda_{t_a} hence agree on the rectangles (open on their left-hand sides), i.e. on a semi-ring generating the \sigma-algebra \mathcal{B}^p. With the usual arguments (which might as well involve \cap-stable Dynkin-systems, for instance), we find that the measures agree on the whole \mathcal{B}^p.

q.e.d.

 

Lemma (The constant-multiple-Lemma): Let \mu be a translation-invariant measure on \mathcal{B}^p, and \alpha:=\mu\bigl([0,1]^p\bigr) < \infty. Then \mu(A) = \alpha\lambda(A) for any A\in\mathcal(B)^p.

 

Note that the Lemma only holds for finite measures on [0,1]^p. For instance, the counting measure is translation-invariant, but it is not a multiple of Lebesgue measure.

Proof: We divide the set (0,1]^p via a rectangular grid of sidelengths \frac1{n_i}, i=1,\ldots,p:

\displaystyle (0,1]^p = \bigcup_{\stackrel{k_j=0,\ldots,n_j-1}{j=1,\ldots,p}}\left(\times_{i=1}^p\left(0,\frac1{n_i}\right] + \left(\frac{k_1}{n_1},\ldots,\frac{k_p}{n_p}\right)\right).

On the right-hand side there are \prod_{i=1}^pn_i sets which have the same measure (by translation-invariance). Hence,

\displaystyle \mu\bigl((0,1]^p\bigr) = \mu\left(\bigcup\cdots\right) = n_1\cdots n_p \cdot \mu\left(\times_{i=1}^p \left(0,\frac1{n_i}\right]\right).

Here, we distinguish three cases.

 

Case 1: \mu\bigl((0,1]^p\bigr) = 1. Then,

\displaystyle\mu\left(\times_{i=1}^p (0,\frac1{n_i}]\right) = \prod_{i=1}^p\frac1{n_i} = \lambda\left(\times_{i=1}^p (0,\frac1{n_i}]\right).

By choosing appropriate grids and further translations, we get that \mu(I) = \lambda(I) for any rectangle I with rational bounds. Via the usual arguments, \mu=\lambda on the whole of \mathcal{B}^p.

Case 2: \mu\bigl((0,1]^p\bigr) \neq 1 and >0. By assumption, however, this measure is finite. Setting \gamma = \mu\bigl((0,1]^p\bigr), we can look at the measure \gamma^{-1}\mu, which of course has \gamma^{-1}\mu\bigl((0,1]^p\bigr) = 1. By Case 1, \gamma^{-1}\mu = \lambda.

Case 3: \mu\bigl((0,1]^p\bigr) = 0. Then, using translation invariance again,

\displaystyle \mu(\mathbb{R}^p) = \mu\bigl(\bigcup_{z\in\mathbb{Z}^p}((0,1]^p+z)\bigr) = \sum_{z\in\mathbb{Z}^p}\mu\bigl((0,1]^p\bigr) = 0.

Again, we get \mu(A) = 0 for all A\in\mathcal{B}^p.

 

That means, in all cases, \mu is equal to a constant multiple of \lambda, the constant being the measure of (0,1]^p. That is not quite what we intended, as we wish the constant multiple to be the measure of the compact set [0,1]^p.

Remember our setting \alpha:=\mu\bigl([0,1]^p\bigr) and \gamma := \mu\bigl((0,1]^p\bigr). Let A\in\mathcal{B}^p. We distinguish another two cases:

 

Case a) \alpha = 0. By monotony, \gamma = 0 and Case 3 applies: \mu(A) = 0 = \alpha\lambda(A).

Case b) \alpha > 0. By monotony and translation-invariance,

\displaystyle \alpha \leq \mu\bigl((-1,1]^p\bigr) = 2^p\gamma,

meaning \gamma\geq\frac{\alpha}{2^p}. Therefore, as \alpha>0, we get \gamma>0, and by Case 1, \frac1\gamma\mu(A) = \lambda(A). In particular,

\displaystyle \frac\alpha\gamma = \frac1\gamma\mu\bigl([0,1]^p\bigr) = \lambda\bigl([0,1]^p\bigr) = 1,

and so, \alpha = \gamma, meaning \mu(A) = \gamma\lambda(A) = \alpha\lambda(A).

q.e.d.

 

Proof (of the Theorem on Transformation of Measure). We will first show that the measure \lambda_f is invariant under translations.

We find, using the Translation-Lemma in (\ast), and the linearity of f before that,

\displaystyle    \begin{aligned}    \lambda_{t_a\circ f}(A) = \lambda_f(A-a) &\stackrel{\hphantom{(\ast)}}{=} \lambda\bigl(f^{-1}(A-a)\bigr) \\    &\stackrel{\hphantom{(\ast)}}{=} \lambda\bigl(f^{-1}(A) - f^{-1}(a)\bigr) \\    &\stackrel{(\ast)}{=} \lambda\bigl(f^{-1}(A)\bigr) \\    &\stackrel{\hphantom{(\ast)}}{=} \lambda_f(A),    \end{aligned}

which means that \lambda_f is indeed invariant under translations.

As [0,1]^p is compact, so is f^{-1}\bigl([0,1]^p\bigr) – remember that continuous images of compact sets are compact (here, the continuous mapping is f^{-1}). In particular, f^{-1}\bigl([0,1]^p\bigr) is bounded, and thus has finite Lebesgue measure.

We set c(f) := \lambda_f\bigl([0,1]^p)\bigr). By the constant-multiple-Lemma, \lambda_f is a multiple of Lebesgue measure: we must have

\displaystyle \lambda_f(A) = c(f)\lambda(A)\text{ for all }A\in\mathcal{B}^p.\qquad (\spadesuit)

We only have left to prove that c(f) = \left|\det f\right|^{-1}. To do this, there may be two directions to follow. We first give the way that is laid out in Elstrodt’s book (which we are basically following in this whole post). Later, we shall give the more folklore-way of concluding this proof.

We consider more and more general fashions of the invertible linear mapping f.

Step 1: Let f be orthogonal. Then, for the unit ball B_1(0),

\displaystyle c(f)\lambda\bigl(B_1(0)\bigr) \stackrel{(\spadesuit)}{=} \lambda_f\bigl(B_1(0)\bigr) = \lambda\left(f^{-1}\bigl(B_1(0)\bigr)\right) = \lambda\bigl(B_1(0)\bigr).

This means, that c(f) = 1 = \left|\det(f)\right|.

This step shows for the first time how the properties of a determinant encode the notion of size already: we have only used the basic lemmas on orthogonal matrices (they leave distances unchanged and hence the ball B_1(0) doesn’t transform; besides, their inverse is their adjoint) and on determinants (they don’t react to orthogonal matrices because of their multiplicative property and because they don’t care for adjoints).

Step 2: Let f have a representation as a diagonal matrix (using the standard basis of \mathbb{R}^p). Let us assume w.l.o.g. that f = \mathrm{diag}(d_1,\ldots,d_p) with d_i>0. The case of d_i<0 is only notationally cumbersome. We get

\displaystyle c(f)\lambda_f\bigl([0,1]^p\bigr) = \lambda\left(f^{-1}\bigl([0,1]^p\bigr)\right) = \lambda\bigl(\times_{i=1}^p[0,d_i^{-1}]\bigr) = \prod_{i=1}^pd_i^{-1} = \left|\det f\right|^{-1}.

Again, the basic lemmas on determinants already make use of the notion of size without actually saying so. Here, it is the computation of the determinant by multiplication of the diagonal.

Step 3: Let f be linear and invertible, and let f^\ast be its adjoint. Then f^\ast f is non-negative definite (since for x\in\mathbb{R}^p, x^\ast(f^\ast f)x = (fx)^\ast(fx) = \left\|fx\right\|^2\geq0). By the Principal Axis Theorem, there is some orthogonal matrix v and some diagonal matrix with non-negative entries d with f^\ast f = vd^2v^\ast. As f was invertible, no entry of d may vanish here (since then, its determinant would vanish and in particular, f would no longer be invertible). Now, we set

\displaystyle w:=d^{-1}v^\ast f^\ast,

which is orthogonal because of

\displaystyle ww^\ast = d^{-1} v^\ast f^\ast fv d^{-1} = d^{-1}v^\ast (vd^2v^\ast)v d^{-1} = d^{-1}d^2d^{-1}v^\ast v v^\ast v = \mathrm{id}.

As f = w^\ast dv, we see from Step 1

\displaystyle    \begin{aligned}    c(f) = \lambda_f\bigl([0,1]^p\bigr) &= \lambda\left(v^{-1}d^{-1}w\bigl([0,1]^p\bigr)\right) \\    &= \lambda\left(d^{-1}w\bigl([0,1]^p\bigr)\right)\\    &= \left|\det d\right|^{-1}\lambda\left(w\bigl([0,1]^p\bigr)\right) \\    &= \left|\det d\right|^{-1}\lambda\bigl([0,1]^p\bigr) \\    &= \left|\det f\right|^{-1}\lambda\bigl([0,1]^p\bigr) \\    &= \left|\det f\right|^{-1},    \end{aligned}

by the multiplicative property of determinants again (\det f = \det d).

q.e.d.(Theorem)

 

As an encore, we show another way to conclude in the Theorem, once all the Lemmas are shown and applied. This is the more folklore way alluded to in the proof, making use of the fact that any invertible matrix is the product of elementary matrices (and, of course, making use of the multiplicative property of determinants). Hence, we only consider those.

Because Step 2 of the proof already dealt with diagonal matrices, we only have to look at shear-matrices like E_{ij}(r) := \bigl(\delta_{kl}+r\delta_{ik}\delta_{jl}\bigr)_{k,l=1,\ldots,p}. They are the identity matrix with the (off-diagonal) entry r in row i and column j. One readily finds \bigl(E_{ij}(r)\bigr)^{-1} = E_{ij}(-r), and \det E_{ij}(r) = 1. Any vector v\in[0,1]^p is mapped to

\displaystyle E_{ij}(v_1,\ldots,v_i,\ldots, v_p)^t = (v_1,\ldots,v_i+rv_j,\ldots,v_p)^t.

This gives

\displaystyle    \begin{aligned}\lambda_{E_{ij}(r)}\bigl([0,1]^p\bigr) &= \lambda\left(E_{ij}(-r)\bigl([0,1]^p\bigr)\right) \\    &= \lambda\left(\left\{x\in\mathbb{R}^p\colon x=(v_1,\ldots,v_i-rv_j,\ldots,v_p), v_k\in[0,1]\right\}\right).    \end{aligned}

This is a parallelogram that may be covered by n rectangles as follows: we fix the dimension i and one other dimension to set a rectangle of height \frac1n, width 1+\frac rn (all other dimension-widths = 1; see the image for an illustration). Implicitly, we have demanded that p\geq2 here; but p=1 is uninteresting for the proof, as there are too few invertible linear mappings in \mathbb{R}^1.

By monotony, this yields

\lambda_{E_{ij}(r)}\bigl([0,1]^p\bigr) \leq n\frac1n\left(1+\frac{r}{n}\right) = 1+\frac rn\xrightarrow{n\to\infty}1.

On the other hand, this parallelogram itself covers the rectangles of width 1-\frac rn, and a similar computation shows that in the limit \lambda_{E_{ij}(r)}\bigl([0,1]^p\bigr)\geq1.

In particular: \lambda_{E_{ij}(r)}\bigl([0,1]^p\bigr) = 1 = \left|\det E_{ij}(r)\right|^{-1}.

q.e.d. (Theorem encore)

 

Proving the multidimensional transformation formula for integration by substitution is considerably more difficult than in one dimension, where it basically amounts to reading the chain rule reversedly. Let us state the formula here first:

Theorem (The Transformation Formula, Jacobi): Let U,V\subset \mathbb{R}^p be open sets and let \Phi:U\to V be a \mathcal{C}^1-diffeomorphism (i.e. \Phi^{-1} exists and both \Phi and \Phi^{-1} are \mathcal{C}^1-functions). Let f:V\to\mathbb{R} be measurable. Then, f\circ\Phi:U\to\mathbb{R} is measurable and

\displaystyle\int_V f(t)dt = \int_U f\bigl(\Phi(s)\bigr)\left|\det\Phi'(s)\right|ds.

 

At the core of the proof is the Theorem on Transformation of Measure that we have proved above. The idea is to approximate \Phi by linear mappings, which locally transform the Lebesgue measure underlying the integral and yield the determinant in each point as correction factor. The technical difficulty is to show that this approximation does no harm for the evaluation of the integral.

We will need a lemma first, which carries most of the weight of the proof.

 

The Preparatory Lemma: Let U,V\subset \mathbb{R}^p be open sets and let \Phi:U\to V be a \mathcal{C}^1-diffeomorphism. If X\subset U is a Borel set, then so is \Phi(X)\subset V, and

\displaystyle \lambda\bigl(\Phi(X)\bigr)\leq\int_X\left|\det\Phi'(s)\right|ds.

Proof: Without loss of generality, we can assume that \Phi, \Phi' and (\Phi')^{-1} are defined on a compact set K\supset U. We consider, for instance, the sets

\displaystyle U_k:=\left\{ x\in U\colon \left|x\right|<k, \mathrm{dist}\bigl(x,U^\complement\bigr)>\frac1k\right\}.

The U_k are open and bounded, \overline U_k is hence compact, and there is a chain U_k\subset\overline U_k\subset U_{k+1}\subset\cdots for all k, with U=\bigcup_kU_k. To each U_k there is, hence, a compact superset on which \Phi, \Phi' and (\Phi')^{-1} are defined. Now, if we can prove the statement of the Preparatory Lemma on X_k := X\cap U_k, it will also be true on X=\lim_kX_k by the monotone convergence theorem.

As we can consider all relevant functions to be defined on compact sets, and as they are continuous (and even more) by assumption, they are readily found to be uniformly continuous and bounded.

It is obvious that \Phi(X) will be a Borel set, as \Phi^{-1} is continuous.

 

Let us prove that the Preparatory Lemma holds for rectangles I with rational endpoints, being contained in U.

There is some r>0 such that for any a\in I, B_r(a)\subset U. By continuity, there is a finite constant M with

\displaystyle M:=\sup_{t\in I}\left\|\bigl(\Phi'(t)\bigr)^{-1}\right\|,

and by uniform continuity, r may be chosen small enough such that, for any \varepsilon>0, even

\displaystyle \sup_{x\in B_r(a)}\left\|\Phi'(x)-\Phi'(a)\right\|\leq\frac{\varepsilon}{M\sqrt p} \text{ for every }a\in I.

With this r, we may now sub-divide our rectangle I into disjoint cubes I_k of side-length d such that d<\frac{r}{\sqrt p}. In what follows, we shall sometimes need to consider the closure \overline I_k for some of the estimates, but we shall not make the proper distinction for reasons of legibility.

For any given b\in I_k, every other point c of I_k may at most have distance d in each of its components, which ensures

\displaystyle\left\|b-c\right\|^2 \leq \sum_{i=1}^pd^2 = pd^2 < r^2.

This, in turn, means, I_k\subset B_r(b) (and B_r(b)\subset U has been clear because of the construction of r).

Now, in every of the cubes I_k, we choose the point a_k\in I_k with

\displaystyle\left|\det\Phi'(a_k)\right| = \min_{t\in I_k}\left|\det\Phi'(t)\right|,

and we define the linear mapping

\displaystyle \Phi_k:=\Phi'(a_k).

Remember that for convex sets A, differentiable mappings h:A\to\mathbb{R}^p, and points x,y\in A, the mean value theorem shows

\displaystyle \left\|h(x)-h(y)\right\|\leq\left\|x-y\right\|\sup_{\lambda\in[0,1]}\left\|h'\bigl(x+\lambda(y-x)\bigr)\right\|.

Let a\in I_k be a given point in one of the cubes. We apply the mean value theorem to the mapping h(x):=\Phi(x)-\Phi_k(x), which is certainly differentiable, to y:=a_k, and to the convex set A:=B_r(a):

\displaystyle    \begin{aligned}    \left\|h(x)-h(y)\right\|&\leq\left\|x-y\right\|\sup_{\lambda\in[0,1]}\left\|h'\bigl(x+\lambda(y-x)\bigr)\right\|\\    \left\|\Phi(x)-\Phi_k(x)-\Phi(a_k)+\Phi_k(a_k)\right\| & \leq \left\|x-a_k\right\|\sup_{\lambda\in[0,1]}\left\|\Phi'\bigl(x+\lambda(x-a_k)\bigr)-\Phi'(a_k)\right\|\\    \left\|\Phi(x)-\Phi(a_k)-\Phi_k(x-a_k)\right\| &< \left\|x-a_k\right\| \frac{\varepsilon}{M\sqrt p}\qquad (\clubsuit).    \end{aligned}

Note that as a_k\in I_k\subset B_r(a), x+\lambda(x-a_k)\in B_r(a) by convexity, and hence the upper estimate of uniform continuity is applicable. Note beyond that, that \Phi_k is the linear mapping \Phi'(a_k) and the derivative of a linear mapping is the linear mapping itself.

Now, \left\|x-a_k\right\|< d\sqrt p, as both points are contained in I_k, and hence (\clubsuit) shows

\displaystyle    \begin{aligned}    \Phi(I_k) &\subset \Phi(a_k)+\Phi_k(I_k-a_k)+B_{\frac{\varepsilon}{M\sqrt p}d\sqrt p}(0) \\    &\subset \Phi(a_k)+\Phi_k(I_k-a_k)+B_{\frac{d\varepsilon}{M}}(0).    \end{aligned}

By continuity (and hence boundedness) of \Phi', we also have

\displaystyle \left\|(\Phi_k)^{-1}(x)\right\|\leq\left\|\bigl(\Phi'(a_k)\bigr)^{-1}\right\|\left\|x\right\|\leq M \left\|x\right\|,

which means B_{\frac{d\varepsilon}{M}}(0) = \Phi_k\left(\Phi_k^{-1}\bigl(B_{\frac{d\varepsilon}{M}}(0)\bigr)\right) \subset \Phi_k\bigl(B_{d\varepsilon}(0)\bigr).

Hence:

\displaystyle \Phi(I_k) \subset \Phi(a_k) + \Phi_k\bigl(I_k-a_k+B_{d\varepsilon}(0)\bigr).

Why all this work? We want to bound the measure of the set \Phi(I_k), and we can get it now: the shift \Phi(a_k) is unimportant by translation invariance. And the set I_k-a_k+B_{d\varepsilon}(0) is contained in a cube of side-length d+2d\varepsilon. As promised, we have approximated the mapping \Phi by a linear mapping \Phi_k on a small set, and the transformed set has become only slightly bigger. By the Theorem on Transformation of Measure, this shows

\displaystyle    \begin{aligned}    \lambda\bigl(\Phi(I_k)\bigr) &\leq \lambda\left(\Phi_k\bigl(I_k-a_k+Blat_{d\varepsilon}(0)\bigr)\right) \\    &=\left|\det\Phi_k\right|\lambda\bigl(I_k-a_k+B_{d\varepsilon}(0)\bigr)\\    &\leq \left|\det\Phi_k\right|d^p(1+2\varepsilon)^p \\    &= \left|\det\Phi_k\right|(1+2\varepsilon)^p\lambda(I_k).    \end{aligned}

Summing over all the cubes I_k of which the rectangle I was comprised, (remember that \Phi is a diffeomorphism and disjoint sets are kept disjoint; besides, a_k has been chosen to be the point in I_k of smallest determinant for \Phi')

\displaystyle    \begin{aligned}    \lambda\bigl(\Phi(I)\bigr) &\leq (1+2\varepsilon)^p\sum_{k=1}^n\left|\det \Phi_k\right|\lambda(I_k) \\    &= (1+2\varepsilon)^p\sum_{k=1}^n\left|\det \Phi'(a_k)\right|\lambda(I_k)\\    &= (1+2\varepsilon)^p\sum_{k=1}^n\int_{I_k}\left|\det\Phi'(a_k)\right|ds\\    &\leq (1+2\varepsilon)^p\int_I\left|\det\Phi'(s)\right|ds.    \end{aligned}

Taking \varepsilon\to0 yields to smaller subdivisions I_k and in the limit to the conclusion. The Preparatory Lemma holds for rectangles.

 

Now, let X\subset U be any Borel set, and let \varepsilon>0. We cover X by disjoint (rational) rectangles R_k\subset U, such that \lambda\bigl(\bigcup R_k \setminus X\bigr)<\varepsilon. Then,

\displaystyle    \begin{aligned}    \lambda\bigl(\Phi(X)\bigr) &\leq \sum_{k=1}^\infty \lambda\bigl(\Phi(R_k)\bigr)\\    &\leq\sum_{k=1}^\infty\int_{R_k}\left|\det \Phi'(s)\right|ds\\    &= \int_{\bigcup R_k}\left| \det\Phi'(s)\right| ds\\    &= \int_X\left| \det\Phi'(s)\right| ds + \int_{\bigcup R_k\setminus X}\left|\det\Phi'(s)\right|ds\\    &\leq \int_X\left| \det\Phi'(s)\right| ds + M\lambda\left(\bigcup R_k\setminus X\right)\\    &\leq \int_X\left| \det\Phi'(s)\right| ds + M\varepsilon.    \end{aligned}

If we let \varepsilon\to0, we see \lambda\bigl(\Phi(X)\bigr)\leq\int_X\bigl|\det\Phi'(s)\bigr|ds.

q.e.d. (The Preparatory Lemma)

 

We didn’t use the full generality that may be possible here: we already focused ourselves on the Borel sets, instead of the larger class of Lebesgue-measurable sets. We shall skip the technical details that are linked to this topic, and switch immediately to the

Proof of Jacobi’s Transformation Formula: We can focus on non-negative functions f without loss of generality (take the positive and the negative part separately, if needed). By the Preparatory Lemma, we already have

\displaystyle\begin{aligned}    \int_{\Phi(U)}\mathbf{1}_{\Phi(X)}(s)ds &= \int_{V}\mathbf{1}_{\Phi(X)}(s)ds\\    &= \int_{\Phi(X)}ds\\    &= \lambda\bigl(\Phi(X)\bigr)\\    &\leq \int_X\left|\det\Phi'(s)\right|ds\\    &= \int_U\mathbf{1}_X(s)\left|\det\Phi'(s)\right|ds\\    &= \int_U\mathbf{1}_{\Phi(X)}\bigl(\Phi(s)\bigr)\left|\det\Phi'(s)\right|ds,    \end{aligned}

which proves the inequality

\displaystyle \int_{\Phi(U)}f(t)dt \leq \int_U f\bigl(\Phi(s)\bigr)\left|\det\Phi'(s)\right|ds,

for indicator functions f = \mathbf{1}_{\Phi(X)}. By usual arguments (linearity of the integral, monotone convergence), this also holds for any measurable function f. To prove the Transformation Formula completely, we apply this inequality to the transformation \Phi^{-1} and the function g(s):=f\bigl(\Phi(s)\bigr)\left|\det\Phi'(s)\right|:

\displaystyle    \begin{aligned}    \int_Uf\bigl(\Phi(s)\bigr)\left|\det\Phi'(s)\right|ds &= \int_{\Phi^{-1}(V)}g(t)dt\\    &\leq \int_Vg\bigl(\Phi^{-1}(t)\bigr)\left|\det(\Phi^{-1})'(t)\right|dt\\    &=\int_{\Phi(U)}f\Bigl(\Phi\bigl(\Phi^{-1}(t)\bigr)\Bigr)\left|\det\Phi'\bigl(\Phi^{-1}(t)\bigr)\right|\left|\det(\Phi^{-1})'(t)\right|dt\\    &=\int_{\Phi(U)}f(t)dt,    \end{aligned}

since the chain rule yields \Phi'\bigl(\Phi^{-1}\bigr)(\Phi^{-1})' = \bigl(\Phi(\Phi^{-1})\bigr)' = \mathrm{id}. This means that the reverse inequality also holds. The Theorem is proved.

q.e.d. (Theorem)

 

There may be other, yet more intricate proofs of this Theorem. We shall not give any other of them here, but the rather mysterious looking way in which the determinant pops up in the transformation formula is not the only way to look at it. There is a proof by induction, given in Heuser’s book, where the determinant just appears from the inductive step. However, there is little geometric intuition in this proof, and it is by no means easier than what we did above (as it make heavy use of the theorem on implicit functions). Similar things may be said about the rather functional analytic proof in Königsberger’s book (who concludes the transformation formula by step functions converging in the L^1-norm, he found the determinant pretty much in the same way that we did).

 

Let us harvest a little of the hard work we did on the Transformation Formula. The most common example is the integral of the standard normal distribution, which amounts to the evaluation of

\displaystyle \int_{-\infty}^\infty e^{-\frac12x^2}dx.

This can happen via the transformation to polar coordinates:

\Phi:(0,\infty)\times(0,2\pi)\to\mathbb{R}^2,\qquad (r,\varphi)\mapsto (r\cos \varphi, r\sin\varphi).

For this transformation, which is surjective on all of \mathbb{R}^2 except for a set of measure 0, we find

\Phi'(r,\varphi) = \begin{pmatrix}\cos\varphi&-r\sin\varphi\\\sin\varphi&\hphantom{-}r\cos\varphi\end{pmatrix},\qquad \det\Phi'(r,\varphi) = r.

From the Transformation Formula we now get

\displaystyle    \begin{aligned}    \left(\int_{-\infty}^\infty e^{-\frac12x^2}dx\right)^2 &= \int_{\mathbb{R}^2}\exp\left(-\frac12x^2-\frac12y^2\right)dxdy\\    &= \int_{\Phi\left((0,\infty)\times(0,2\pi)\right)}\exp\left(-\frac12x^2-\frac12y^2\right)dxdy\\    &= \int_{(0,\infty)\times(0,2\pi)}\exp\left(-\frac12r^2\cos^2(\varphi)-\frac12r^2\sin^2(\varphi)\right)\left|\det\Phi'(r,\varphi)\right|drd\varphi\\    &= \int_{(0,\infty)\times(0,2\pi)}\exp\left(-\frac12r^2\right)rdrd\varphi\\    &= \int_0^\infty\exp\left(-\frac12r^2\right)rdr\int_0^{2\pi}d\varphi \\    &= 2\pi \left[-\exp\left(-\frac12r^2\right)\right]_0^\infty\\    &= 2\pi \left(1-0\right)\\    &= 2\pi.    \end{aligned}

In particular, \int_{-\infty}^\infty e^{-\frac12x^2}dx=\sqrt{2\pi}. One of the very basic results in probability theory.

 

Another little gem that follows from the Transformation Formula are the Fresnel integrals

\displaystyle \int_{0}^\infty \cos(x^2)dx = \int_{0}^\infty\sin(x^2)dx = \sqrt{\frac{\pi}{8}}.

 

They follow from the same basic trick given above for the standard normal density, but as other methods for deriving this result involve even trickier uses of similarly hard techniques (the Residue Theorem, for instance, as given in Remmert’s book), we shall give the proof of this here:

Consider

\displaystyle F(t)=\int_{0}^\infty e^{-tx^2}\cos(x^2)dx\qquad\text{and}\qquad\int_{0}^\infty e^{-tx^2}\sin(x^2)dx.

Then, the trigonometric identity \cos a+b = \cos a \cos b - \sin a\sin b tells us

\displaystyle    \begin{aligned}    \bigl(F(t)\bigr)^2 - \big(G(t)\bigr)^2 &= \int_0^\infty\int_0^\infty e^{-t(x^2+y^2)}\cos(x^2)\cos(y^2)dxdy - \int_0^\infty\int_0^\infty e^{-t(x^2+y^2)}\sin(x^2)\sin(y^2)dxdy\\    &= \int_0^\infty\int_0^\infty e^{-t(x^2+y^2)}\bigl(\cos(x^2+y^2)+\sin(x^2)\sin(y^2) - \sin(x^2)\sin(y^2)\bigr)dxdy\\    &= \int_0^\infty\int_0^{\frac\pi2}e^{-tr^2}\cos(r^2)rd\varphi dr\\    &= \frac\pi2 \int_0^\infty e^{-tu}\cos u\frac12du.    \end{aligned}

This integral can be evaluated by parts to show

\displaystyle \int_0^\infty e^{-tu}\cos udu\left(1+\frac1{t^2}\right) = \frac1t,

which means

\displaystyle \bigl(F(t)\bigr)^2 - \bigl(G(t)\bigr)^2 = \frac\pi4\int_0^\infty\int_0^\infty e^{-tu}\cos udu = \frac\pi4\frac t{t^2+1}.

Then we consider the product F(t)G(t) and use the identity \sin(a+b) = \cos a\sin b + \cos b\sin a, as well as the symmetry of the integrand and integration by parts, to get

\displaystyle    \begin{aligned}    F(t)G(t) &= \int_0^\infty\int_0^\infty e^{-t(x^2+y^2)}\bigl(\cos(x^2)\sin(y^2)\bigr)dxdy\\    &=2\int_0^\infty\int_0^ye^{-t(x^2+y^2)}\sin(x^2+y^2)dxdy\\    &=2\int_0^\infty\int_0^{\frac\pi8}e^{-tr^2}\sin(r^2)rd\varphi dr\\    &=\frac\pi4\int_0^\infty e^{-tr^2}\sin(r^2)r dr\\    &=\frac\pi4\int_0^\infty e^{-tu}\sin u\frac12du\\    &=\frac\pi8\frac1{1+t^2}.    \end{aligned}

We thus find by the dominated convergence theorem

\displaystyle    \begin{aligned}    \left(\int_0^\infty\cos x^2dx\right)^2-\left(\int_0^\infty\sin x^2dx\right)^2 &= \left(\int_0^\infty\lim_{t\downarrow0}e^{-tx}\cos x^2dx\right)^2-\left(\int_0^\infty\lim_{t\downarrow0}e^{-tx}\sin x^2dx\right)^2 \\    &=\lim_{t\downarrow0}\left(\bigl(F(t)\bigr)^2-\bigl(G(t)\bigr)^2\right)\\    &=\lim_{t\downarrow0}\frac\pi4\frac{t}{t^2+1}\\    &=0,    \end{aligned}

and

\displaystyle    \begin{aligned}    \left(\int_0^\infty\cos x^2dx\right)^2 &= \left(\int_0^\infty\cos x^2dx\right)\left(\int_0^\infty\sin x^2dx\right)\\    &=\left(\int_0^\infty\lim_{t\downarrow0}e^{-tx}\cos x^2dx\right)\left(\int_0^\infty\lim_{t\downarrow0}e^{-tx}\sin x^2dx\right)\\    &=\lim_{t\downarrow0}F(t)G(t)\\    &=\lim_{t\downarrow0}\frac\pi8\frac1{1+t^2}\\    &=\frac\pi8.    \end{aligned}

One can easily find the bound that both integrals must be positive and from the first computation, we get

\int_0^\infty\cos x^2dx = \int_0^\infty\sin x^2dx,

from the second computation follows that the integrals have value \sqrt{\frac\pi8}.

q.e.d. (Fresnel integrals)

 

Even Brouwer’s Fixed Point Theorem may be concluded from the Transformation Formula (amongst a bunch of other theorems, none of which is actually as deep as this one though). This is worthy of a seperate text, mind you.

Johann Kepler and how planets move

Kepler’s Laws of planetary motion follow very smoothly from Newton’s Law of Gravitation. Very little tough mathematics is needed for the proofs, it can actually be done with ordinary differential calculus and some knowledge on path integration.

Of course, from a historical point of view, these laws appeared reversed. Kepler had neither Newton’s Law at his disposal, nor had he sufficient use of the calculus machinery that we have today. Instead, his way of coming up with his Laws included years of hard work on the astronomical tables compiled by Tycho Brahe; Kepler himself was unable to make astronomical observations himself, even with his self-invented telescope, as he was ill-sighted for all his life. Later, Newton could rely on Kepler’s results to find inspiration for his Law of Gravitation: indeed, unless he could relate his results to Kepler’s laws, he knew his results to be incomplete. Together with his many other achievements (for instance, Kepler was the first to state Simpson’s rule, which is accordingly called “Kepler’s barrel rule” sometimes) this makes him one of the most interesting minds of the early modern era. His interest in planet’s orbits was rooted in astrology and the construction of horoscopes; his observations and deductions led him to drop both Copernicus’ thought that the orbits were circles and, later, that there were no platonic solids involved. Both Copernicus and Kepler had revolutionized the thoughts on space itself, Copernicus by showing how much easier the orbits can be described when Earth is allowed to move itself (no more epicycles and the like), but Kepler made it even easier by dropping circles altogether.

To consider how hard the problem of finding the planet’s orbits is, consider that we have plenty of observational data of the planets, however the data contain two unknowns: the orbit of Earth and the orbit of the planet. On top of that, our observations only tell us about the angle under which the planet is observed, we don’t learn anything about the distances involved (unless we have Kepler’s third law at our disposal). In a very insightful talk for a wider audience, Terry Tao has explained some of Kepler’s ideas on this, especially how Kepler dealt with the orbit of Mars which had been for several reasons the most tricky one of the orbits in the models that preceded Kepler. Tao mentions that Einstein valued the finding of Kepler’s Laws one of the most shining moments in the history of human curiosity. “And if Einstein calls you a genius, you are really doing well.

From these Laws and from what Newton and his successors achieved, many things can be inferred that are impossible to measure directly. For instance the mass of the Sun and all planets can be computed from here, once the gravitational constant is known (which is tricky to pinpoint, actually). Voltaire is quoted with the sentence, regarding Newton’s achievements but this would fit to Kepler as well, that the insights gained “semblaient n’être pas faites pour l’esprit humain.

To give just a tiny bit of contrast, we mention that Kepler also had erroneous thoughts that show how deeply he was still rooted in ancient ideas of harmonics and aesthetics. For instance, Kepler tried to prove why the Solar system had exactly six planets (or to rephrase a little more accurately to his thinking: why God had found pleasure in creating exactly six planets). For some time, he believed that the reason was related to the fact that there are exactly five platonic solids which define the structure of the six orbits around the sun. Those were ideas also related to the integer harmonies of a vibrating string, as the planets were supposed to move in a harmonical way themselves. Of course, in these days the observations were limited up to Saturn, as the outer planets (and dwarf planets) cannot be found by eyesight or the telescopes at Kepler’s disposal; all such ideas were doomed to be incomplete. However, his quest for harmonics in the Solar system led him in the end to his Third Law. On another account, Kepler was mistaken in the deduction of his First Law, since he lacked the deep knowledge about integration that would be developed decades later; luckily, his mistake cancels out with another mistake later on: “Es ist schon atemberaubend, wie sich bei Keplers Rechnungen letztlich doch alles fügt.” (“It is stunning how everything in Kepler’s computations adds up in the end”; have a look at Sonar’s highly readable and interesting book on the history of calculus for this).

In what follows, we shall show how the Kepler’s Laws can be proved, assuming Newton’s Law of Gravitation, in a purely mathematical fashion. There will be no heuristics from physics or from astronomy, only the axiomatic mathematical deduction that mostly works without any intuition from the applications (though we will look at motivations for why some definitions are made the way they are).

As a nice aside, we can look at the mathematical descriptions of the conic sections on which the first Law relies. But here again, there’s no connection to why these curve are called this way.

Let us state Kepler’s Laws here first.

Kepler’s First Law of Planetary Motion: Planets orbit in ellipses, with the Sun as one of the foci.

Kepler’s Second Law of Planetary Motion: A planet sweeps out equal areas in equal times.

Kepler’s Third Law of Planetary Motion: The square of the period of an orbit is proportional to the cube of its semi-major axis.

Let us prove this, by following the account of Königsberger’s book on calculus. Many calculus books deal with Kepler’s Laws in a similar axiomatical fashion, yet we stick to this account as it appears to be the neatest one without conjuring up too much of physics.

We shall give a couple of technical lemmas first.

 

The Triangle-Lemma: The triangle marked by the points (x_1,y_1), (x_2,y_2), (x_3,y_3) has area \displaystyle \frac12\left|\mathrm{det}\begin{pmatrix}1&x_1&y_1\\1&x_2&y_2\\1&x_3&y_3\end{pmatrix}\right|.

Proof: The triangle together with the coordinate axes marks the parallelograms:

P_{13} with the points (x_1,y_1), (x_3, y_3), (x_3,0), (x_1,0);

P_{23} with the points (x_2,y_2), (x_3, y_3), (x_3,0), (x_2,0);

P_{12} with the points (x_1,y_1), (x_2, y_2), (x_2,0), (x_1,0).

dreiecksflaeche

Thus, the area of the triangle is:

F_{T} = P_{13}+P_{23}-P_{12}.

The sign represents the situation given in the figure. For other triangles, another permutation of the signs may be necessary, but there will always be exactly one negative sign. Other permutations of the sign only represent a re-numbering of the points and therefore a change of sign in the determinant given in the statement. As we put absolute values to our statement, we avoid any difficulties of this kind.

As each of the paralellograms has two of their points on the x-axis, we find

\begin{aligned}    F_{T} &= \frac{y_1+y_3}{2}(x_3-x_1)+\frac{y_2+y_3}{2}(x_2-x_3)-\frac{y_1+y_2}{2}(x_2-x_1)\\    &= \frac12\left(x_1\bigl((y_1+y_2)-(y_1+y_3)\bigr) + x_2\bigl((y_2+y_3)-(y_1+y_2)\bigr) + x_3\bigl((y_1+y_3)-(y_2+y_3)\bigr)\right)\\    &= \frac12\left(x_1\bigl((y_1-y_3) + (y_2 - y_1)\bigr) + x_2(y_3-y_1) + x_3(y_1-y_2)\right)\\    &= \frac12\left((x_2-x_1)(y_3-y_1) - (x_3-x_1)(y_2-y_1)\right)\\    &= \frac12\mathrm{det}\begin{pmatrix}x_2-x_1&y_2-y_1\\x_3-x_1&y_3-y_1\end{pmatrix}\\    &= \frac12\mathrm{det}\begin{pmatrix}1&x_1&y_1\\0&x_2-x_1&y_2-y_1\\0&x_3-x_1&y_3-y_1\end{pmatrix}\\    &= \frac12\mathrm{det}\begin{pmatrix}1&x_1&y_1\\1&x_2&y_2\\1&x_3&y_3\end{pmatrix}.    \end{aligned}

Q.e.d.

 

Lemma (Leibniz’ sector formula): Let \gamma:[a,b]\to\mathbb{R}^2 be a continuously differentiable path, \gamma(t) = \begin{pmatrix}x(t)\\y(t)\end{pmatrix}. Then the line segment from 0 to the points of the path sweeps the area \displaystyle\frac12\int_a^b(x\dot y - y\dot x)dt.

Note that we have used Newton’s notation for derivatives. One might also write the integral as \displaystyle\int_a^b\bigl(x(t)y'(t)-y(t)x'(t)\bigr)dt.

Proof: Let us clarify first, what we understand by “sweeping” line segments. Consider the path \gamma given in the image.

weg

As this path is not closed (it’s not a contour), it doesn’t contain an area. But if we take the origin into account, we can define an area that is related to where the path is:

fahrstrahl

Now, pick a partition of [a,b], such as \{a=t_0,t_1,\ldots,t_{n-1},b=t_n\} and make a polygon of the partition and the origin – the corresponding triangles form an area that approximates the area bounded by \gamma, as above.

fahrstrahl_proxy

As the partition gets finer, we expect that the polygon-area converges to the \gamma-area. And this is where the definition originates, of the area that is swept by a path:

For any \varepsilon>0 there shall be \delta>0, such that for every partition \{t_0,\ldots,t_n\} of [a,b] that is finer than \delta, we get

\displaystyle \left|\sum_{i=0}^{n-1}F_{T_i} - \frac12\int_a^b(x\dot y-y\dot x)dt\right| < \varepsilon.\qquad(\ast)

Here, F(T_i) is the area of the triangle bounded by \gamma_{t_i}, \gamma_{t_{i+1}} and the origin. By the Triangle-Lemma, its area is \displaystyle\frac12\mathrm{det}\begin{pmatrix}1&0&0\\1&x_i&y_i\\1&x_{i+1}&y_{i+1}\end{pmatrix}.

Because the orientation of the T_i might be of importance, F(T_i) may keep its sign in what follows (Imagine, for instance, a path that traverses a line segment once from left to right and once from right to left; in total, no area is covered).

Now, let us prove that (\ast) is true.

As \dot x and \dot y are continuous, choose L=\max\left(\max_{[a,b]}\left|\dot x(t)\right|, \max_{[a,b]}\left|\dot y(t)\right|\right) and take \delta = \frac{\varepsilon}{2L^2(b-a)}. Take a partition \{t_0,\ldots,t_n\} with t_0=a, t_0=b and \left|t_{k+1}-t_k\right| < \delta for k=0,\ldots,n-1. Then, for any such k,

\begin{aligned}    F(T_k) &= \frac12\mathrm{det}\begin{pmatrix}1&0&0\\1&x_k&y_k\\1&x_{k+1}&y_{k+1}\end{pmatrix}\\    &=\frac12(x_ky_{k+1}-x_{k+1}y_k)\\    &= \frac12\Bigl(x_k\bigl(y_k+(y_{k+1}-y_k)\bigr) - y_k\bigl(x_k+(x_{k+1}-x_k)\bigr)\Bigr)\\    &=\frac12\bigl(x_k(y_{k+1}-y_k)-y_k(x_{k+1}-x_k)\bigr)\\    &=\frac12\left(x_k\int_{t_k}^{t_{k+1}}\dot y(t)dt - y_k\int_{t_k}^{t_{k+1}}\dot x(t)dt\right)\\    &=\frac12\int_{t_k}^{t_{k+1}}\bigl(x_k\dot y(t)-y_k\dot x(t)\bigr)dt.    \end{aligned}

This yields, using the mean value theorem,

\begin{aligned}    \left|2F(T_k)-\int_{t_k}^{t_{k+1}}(x\dot y-y\dot x)dt\right| &= \left|\int_{t_k}^{t_{k+1}}(x_k\dot y-y_k\dot x-x\dot y+y\dot x)dt\right|\\    &\leq \left|\int_{t_k}^{t_{k+1}}(x_k-x)\dot ydt\right| + \left|\int_{t_k}^{t_{k+1}}(y-y_k)\dot xdt\right|\\    &\leq\int_{t_k}^{t_{k+1}}\max_{[a,b]}\left|\dot x(t)\right|\left|{t_k-t}\right|\left|\dot y(t)\right|dt + \\    &\hphantom{=}+\int_{t_k}^{t_{k+1}}\max_{[a,b]}\left|\dot y(t)\right|\left|{t_k-t}\right|\left|\dot x(t)\right|dt\\    &\leq L^2\left|t_{k+1}-t_k\right|(t_{k+1}-t_k) + L^2\left|t_{k+1}-t_k\right|(t_{k+1}-t_k)\\    &< 2L^2\delta(t_{k+1}-t_k).    \end{aligned}

We conclude

\displaystyle\left|2\sum_{k=0}^{n-1}F(T_k) - \int_a^b(x\dot y - y\dot x)dt\right| < 2L^2\delta(b-a) = \varepsilon.

Q.e.d.

One might as well prove this by applying Green’s Theorem, but in this case it just gets less elementary.

 

The \times-Lemma: Let a,b\in\mathbb{R}^3. We set

\displaystyle a\times b := \begin{pmatrix}a_2b_3-a_3b_2\\a_3b_1-a_1b_3\\a_1b_2-a_2b_1\end{pmatrix}.

Obviously, this is linear both in a and in b. We have

(i)\quad\displaystyle \left\langle (a\times b),c\right\rangle = \mathrm{det}(a,b,c).

(ii)\quad a\times b is orthogonal to a and to b.

(iii)\quad\displaystyle (a\times b)\times c = -\left\langle b,c\right\rangle a + \left\langle a,c\right\rangle b (Grassmann’s equation).

(iv)\quad\displaystyle (a\times b)^{\cdot} = \dot a\times b+a\times\dot b.

Proof: For (i):

\begin{aligned}    \left\langle (a\times b),c\right\rangle &= \left\langle\begin{pmatrix}a_2b_3-a_3b_2\\a_3b_1-a_1b_3\\a_1b_2-a_2b_1\end{pmatrix},\begin{pmatrix}c_1\\c_2\\c_3\end{pmatrix}\right\rangle\\    &=c_1(a_2b_3-a_3b_2)+c_2(a_3b_1-a_1b_3)+c_3(a_1b_2-a_2b_1)\\    &=c_1\mathrm{det}\begin{pmatrix}a_2&b_2\\a_3&b_3\end{pmatrix}-c_2\mathrm{det}\begin{pmatrix}a_1&b_1\\a_3&b_3\end{pmatrix}+c_3\mathrm{det}\begin{pmatrix}a_1&b_1\\a_2&b_2\end{pmatrix}\\    &=\mathrm{det}\begin{pmatrix}a_1&b_1&c_1\\a_2&b_2&c_2\\a_3&b_3&c_3\end{pmatrix}\\    &=\mathrm{det}(a,b,c).    \end{aligned}

For (ii): \left\langle a\times b, a\right\rangle = \mathrm{det}(a,b,a)=0 and \left\langle a\times b,b\right\rangle = \mathrm{det}(a,b,b)=0.

For (iii):

\begin{aligned}    (a\times b)\times c &=\begin{pmatrix}a_2b_3-a_3b_2\\a_3b_1-a_1b_3\\a_1b_2-a_2b_1\end{pmatrix}\times\begin{pmatrix}c_1\\c_2\\c_3\end{pmatrix}\\    &=\begin{pmatrix}(a_3b_1-a_1b_3)c_3-(a_1b_2-a_2b_1)c_2\\(a_1b_2-a_2b_1)c_1-(a_2b_3-a_3b_2)c_3\\(a_2b_3-a_3b_2)c_2-(a_3b_1-a_1b_3)c_1\end{pmatrix}\\    &=\begin{pmatrix}a_3b_1c_3-a_1b_3c_3-a_1b_2c_2+a_2b_1c_2\\a_1b_2c_1-a_2b_1c_1-a_2b_3c_3+a_3b_2c_3\\a_2b_3c_2-a_3b_2c_2-a_3b_1c_1+a_1b_3c_1\end{pmatrix}\\    &=\begin{pmatrix}a_1b_1c_1+a_2b_1c_2+a_3b_1c_3-a_1b_1c_1-a_1b_2c_2-a_1b_3c_3\\a_1b_2c_1+a_2b_2c_2+a_3b_2c_3-a_2b_1c_1-a_2b_2c_2-a_2b_3c_3\\a_1b_3c_1+a_2b_3c_2+a_3b_3c_3-a_3b_1c_1-a_3b_2c_2-a_3b_3c_3\end{pmatrix}\\    &=(a_1c_1+a_2c_2+a_3c_3)\begin{pmatrix}b_1\\b_2\\b_3\end{pmatrix} - (b_1c_1+b_2c_2+b_3c_3)\begin{pmatrix}a_1\\a_2\\a_3\end{pmatrix}\\    &=\left\langle a,c\right\rangle b - \left\langle b,c\right\rangle a.    \end{aligned}

For (iv):

\begin{aligned}    \bigl(a(t)\times b(t)\bigr)^\cdot &= \begin{pmatrix}a_2(t)b_3(t)-a_3(t)b_2(t)\\a_3(t)b_1(t)-a_1(t)b_3(t)\\a_1(t)b_2(t)-a_2(t)b_1(t)\end{pmatrix}^\cdot\\    &=\begin{pmatrix}\dot a_2b_3+a_2\dot b_3-\dot a_3b_2-a_3\dot b_2\\\dot a_3b_1+a_3\dot b_1-\dot a_1b_3-a_1\dot b_3\\\dot a_1b_2+a_1\dot b_2-\dot a_2b_1-a_2\dot b_1\end{pmatrix}\\    &=\begin{pmatrix}\dot a_2b_3-\dot a_3b_2\\\dot a_3b_1-\dot a_1b_3\\\dot a_1b_2-\dot a_2b_1\end{pmatrix}+\begin{pmatrix}a_2\dot b_3-a_3\dot b_2\\a_3\dot b_1-a_1\dot b_3\\ a_1\dot b_2-a_2\dot b_1\end{pmatrix} \\    &=\dot a\times b + a\times \dot b    \end{aligned}

Q.e.d.

 

Now, let us look at conic sections and define them mathematically. We will not be interested in what these things have to do with cones – as stated at the beginning: pure mathematics here.

We are going to work in \mathbb{R}^2 here. Let F be a point (the so-called focal point) and l be a line (the so-called directrix), and the distance of F and l shall be some p>0. We are looking for all those points in \mathbb{R}^2 for which the distance to F and the distance to l are proportional – formally: For any point (\xi,\eta)^t\in\mathbb{R}^2, we set

r=\mathrm{dist}\bigl(F,(\xi,\eta)^t\bigr)\qquad\text{and}\qquad d = \mathrm{dist}\bigl(l, (\xi,\eta)^t\bigr),

and for \varepsilon > 0 we demand

\displaystyle \frac{r}{d}=\varepsilon.

For simplicity, we will put F into the origin of our coordinate system, and l parallel to one of the axes, as in the figure. In particular, r^2 = \xi^2+\eta^2 and d = p+\xi. Our equation for the interesting points thus becomes:

\begin{aligned}    && r^2&=\varepsilon^2d^2\\    &\iff& \xi^2+\eta^2 &= \varepsilon^2p^2+2\varepsilon^2p\xi+\varepsilon^2\xi^2\\    &\iff&\xi^2(1-\varepsilon^2) &= \varepsilon^2p^2-\eta^2+2\varepsilon^2p\xi.    \end{aligned}

kegelschnitte

Let us distinguish the following cases:

Case \varepsilon = 1. Then we set x := \xi+\frac p2, y := \eta and find

\begin{aligned}    &&\xi^2 (1-\varepsilon^2) &= \varepsilon^2p^2-\eta^2 + 2\varepsilon^2 p \xi \\    &\iff& 0 &= p^2-y^2+2p\left(x-\frac p2\right)\\    &\iff& 0 &= p^2-y^2+2px-p^2\\    &\iff& y^2 &= 2px.    \end{aligned}

We see that the interesting points lie on a parabola which is open to the right (by choosing other coordinate systems, of course, any other parabola will appear; in some way, this is its normal form).

Case \varepsilon < 1. Here we set x:=\xi-\frac{p\varepsilon^2}{1-\varepsilon^2}, y:=\eta. Then we get

\begin{aligned}    &&\xi^2 (1-\varepsilon^2) &= \varepsilon^2p^2-\eta^2 + 2\varepsilon^2 p \xi \\    &\iff &\left(x+\frac{p\varepsilon^2}{1-\varepsilon^2}\right)^2(1-\varepsilon^2) &= \varepsilon^2p^2-y^2+2\varepsilon^2p\left(x+\frac{p\varepsilon^2}{1-\varepsilon^2}\right)\\    &\iff& x^2(1-\varepsilon^2)+2xp\varepsilon^2 + \frac{p^2\varepsilon^4}{1-\varepsilon^2} &= \varepsilon^2p^2-y^2+2\varepsilon^2px+\frac{2\varepsilon^4p^2}{1-\varepsilon^2}\\    &\iff& x^2(1-\varepsilon^2) + y^2 &= \varepsilon^2p^2+\frac{\varepsilon^4p^2}{1-\varepsilon^2}\\    &\iff& x^2(1-\varepsilon^2) + y^2 &= \frac{(1-\varepsilon^2)\varepsilon^2p^2+\varepsilon^4p^2}{1-\varepsilon^2}\\    &\iff& x^2(1-\varepsilon^2) + y^2 &= \frac{\varepsilon^2p^2}{1-\varepsilon^2}\\    &\iff& \frac{x^2(1-\varepsilon^2)^2}{\varepsilon^2p^2} + \frac{y^2(1-\varepsilon^2)}{\varepsilon^2p^2} &= 1\\    &\iff& \frac{x^2}{a^2} + \frac{y^2}{b^2} &= 1,    \end{aligned}

with a = \frac{\varepsilon p}{1-\varepsilon^2} and b = \frac{\varepsilon p}{\sqrt{1-\varepsilon^2}}.

We have found that the interesting points lie on an ellipse.

Case \varepsilon > 1. This is exactly the same as the case \varepsilon<1, except for the last step. We mustn’t set b as before, since 1-\varepsilon^2<0 and we cannot get a real square root of this. Thus, we use b = \frac{\varepsilon p}{\sqrt{\varepsilon^2-1}} and the resulting negative sign is placed in the final equation:

\displaystyle \frac{x^2}{a^2}-\frac{y^2}{b^2} = 1.

This is a hyperbola.

To conclude this part, we give the general representation of conic sections in polar coordinates. From the figure given above, we see d = p+r\cos\varphi and so

\displaystyle r = \varepsilon d = \varepsilon p + \varepsilon r\cos\varphi,

which means

\displaystyle r = \frac{\varepsilon p}{1-\varepsilon\cos\varphi}.

That yields the polar coordinates (only depending on parameters and on the variable \varphi:

\displaystyle re^{i\varphi} = \frac{\varepsilon p}{1-\varepsilon\cos\varphi}e^{i\varphi}.

 

Now, let us turn to our base for the proofs of Kepler’s Laws: Newton’s Law of Gravitation. Let m be the mass of a planet, M the mass of the Sun, \gamma a real constant (the gravitational constant), and let x(t) be a path (the planet’s orbit). By Newton’s Law we have the differential equation

\displaystyle m\ddot x = -\gamma Mm\frac{x}{\left\|x\right\|^3}.

On the left-hand side, there’s the definition of force as mass multiplied by acceleration. On the right-hand side is Newton’s Law stating the gravitational force between Sun and planet.

We define the vector-valued functions of t (note that x depends on t):

\displaystyle J = x\times m\dot x\qquad\text{and}\qquad A=\frac1{\gamma Mm}J\times\dot x+\frac{x}{\left\|x\right\|}.

(J is the angular momentum, A is an axis; but our math doesn’t care for either of those names or intentions).

The AJ-Lemma: As functions of t, A and J are constant.

Proof: Let us look at J first. Using the fact that by definition a\times a = 0, and Newton’s Law of Gravitation, we get

\begin{aligned}    \dot J \underset{\times\text{-Lemma}}{\overset{(iv)}{=}} (x\times m\dot x)^\cdot &= \dot x \times m\dot x + x\times m\ddot x\\    &= \dot x \times m\dot x + x\times\left(-\gamma Mm\frac{x}{\left\|x\right\|^3}\right)\\    &= m(\dot x\times \dot x) - \frac{\gamma Mm}{\left\|x\right\|^3}(x\times x)\\    &= 0 - 0 = 0.    \end{aligned}

Now, for A, we will need a side-result first.

\begin{aligned}    \frac{d}{dt}\frac{1}{\left\|x\right\|} &= \frac{d}{dt}\bigl(x_1^2(t)+x_2^2(t)+x_3^2(t)\bigr)^{-1/2}\\    &= -\frac12\bigl(x_1^2(t)+x_2^2(t)+x_3^2(t)\bigr)^{-3/2}\bigl(2x_1(t)\dot x_1(t)+2x_2(t)\dot x_2(t)+2x_3(t)\dot x_3(t)\bigr)\\    &= -\frac{\left\langle x,\dot x\right\rangle}{\left\|x\right\|^3}.    \end{aligned}

This yields

\begin{aligned}    \dot A &\underset{\hphantom{\times\text{-Lemma}}}{=} \left(\frac{1}{\gamma Mm}J\times \dot x + \frac{x}{\left\|x\right\|}\right)^\cdot \\    &\underset{\times\text{-Lemma}}{\overset{(iv)}{=}} \frac{1}{\gamma Mm}\bigl(\dot J\times\dot x + J\times\ddot x\bigr) + \left(\frac{1}{\left\|x\right\|}\dot x-\frac{\left\langle x,\dot x\right\rangle}{\left\|x\right\|^3}x\right)\\    &\underset{\hphantom{\times\text{-Lemma}}}{=} \frac1{\gamma Mm}\left(0+J\times\left(-\gamma M\frac{x}{\left\|x\right\|^3}\right)\right) + \left(\frac{1}{\left\|x\right\|}\dot x-\frac{\left\langle x,\dot x\right\rangle}{\left\|x\right\|^3}x\right)\\    &\underset{\hphantom{\times\text{-Lemma}}}{=}-\frac1m\left((x\times m\dot x)\times \frac{x}{\left\|x\right\|^3}\right) + \left(\frac{1}{\left\|x\right\|}\dot x-\frac{\left\langle x,\dot x\right\rangle}{\left\|x\right\|^3}x\right)\\    &\underset{\hphantom{\times\text{-Lemma}}}{=}-\left((x\times \dot x)\times \frac{x}{\left\|x\right\|^3}\right) + \left(\frac{1}{\left\|x\right\|}\dot x-\frac{\left\langle x,\dot x\right\rangle}{\left\|x\right\|^3}x\right)\\    &\underset{\times\text{-Lemma}}{\overset{(iii)}{=}} -\left\langle x,\frac{x}{\left\|x\right\|^3}\right\rangle \dot x + \left\langle \dot x,\frac{x}{\left\|x\right\|^3}\right\rangle x + \frac{1}{\left\|x\right\|}\dot x-\frac{\left\langle x,\dot x\right\rangle}{\left\|x\right\|^3}x \\    &\underset{\hphantom{\times\text{-Lemma}}}{=} -\frac1{\left\|x\right\|^3}\left\langle x,x\right\rangle \dot x + \frac1{\left\|x\right\|^3}\left\langle \dot x, x\right\rangle x + \frac1{\left\|x\right\|}\dot x-\frac1{\left\|x\right\|^3}\left\langle x,\dot x\right\rangle x\\    &\underset{\hphantom{\times\text{-Lemma}}}{=} -\frac1{\left\|x\right\|^3}\left\|x\right\|^2\dot x + \frac1{\left\|x\right\|}\dot x\\    &\underset{\hphantom{\times\text{-Lemma}}}{=} 0.    \end{aligned}

Q.e.d.

 

Now, we have all ingredients to prove Kepler’s Laws. We conclude axiomatically by assuming that planetary motion is governed by Newton’s Law of Gravitation (the differential equation given above).

Let’s start with

Theorem (Kepler’s First Law of Planetary Motion): Planets orbit in ellipses, with the Sun as one of the foci.

Proof: Let x(t) denote the orbit of a planet around the Sun. By the AJ-Lemma, J is constant. By definition of J = x\times m\dot x and by (ii) of the \times-Lemma, both x and \dot x are orthogonal to J; the orbit is therefore located in a two-dimensional plane. Let us introduce polar coordinates in this plane, with the Sun in the origin, and with axis A (this works, as A is located in the same plane as well: A\bot J by the definition of A).

planetenorbit

Now let \varphi(t) be the angle of x(t) and A, set \varepsilon := \left\|A\right\|. This means

\displaystyle \cos\varphi(t) = \frac{\left\langle x(t),A\right\rangle}{\left\|x\right\|\left\|A\right\|},

and so

\displaystyle \left\langle A,x(t)\right\rangle = \varepsilon \left\|x\right\|\cos\varphi(t).

By definition of A, we have

\begin{aligned}    \left\langle A,x(t)\right\rangle &\underset{\hphantom{\times\text{-Lemma}}}{=} \left\langle\frac{1}{\gamma Mm}J\times\dot x + \frac{x}{\left\|x\right\|}, x(t)\right\rangle\\    &\underset{\hphantom{\times\text{-Lemma}}}{=} \frac{1}{\gamma Mm}\left\langle J\times \dot x, x\right\rangle + \frac{1}{\left\|x\right\|}\left\langle x,x\right\rangle\\    &\underset{\times\text{-Lemma}}{\overset{(i)}{=}} \frac{1}{\gamma Mm}\mathrm{det}(J,\dot x,x)+\frac{\left\| x\right\|^2}{\left\|x\right\|}\\    &\underset{\hphantom{\times\text{-Lemma}}}{=} (-1)^3\frac{1}{\gamma Mm}\mathrm{det}(x,\dot x,J) + \left\|x(t)\right\|\\    &\underset{\times\text{-Lemma}}{\overset{(i)}{=}} -\frac1{\gamma Mm}\left\langle x\times \dot x,J\right\rangle + \left\|x(t)\right\|\\    &\underset{\hphantom{\times\text{-Lemma}}}{=} -\frac1{\gamma Mm^2}\left\langle x\times m\dot x, J\right\rangle + \left\|x(t)\right\|\\    &\underset{\hphantom{\times\text{-Lemma}}}{=} -\frac1{\gamma Mm^2}\left\langle J,J\right\rangle + \left\|x(t)\right\|\\    &\underset{\hphantom{\times\text{-Lemma}}}{=} const. + \left\|x(t)\right\|.    \end{aligned}

Now, if A=0, then we have found

\displaystyle \left\|x(t)\right\| = \frac1{\gamma Mm^2}\left\|J\right\|^2,

which means that the planet moves on a circular orbit.

If A\neq0, then we conclude

\begin{aligned}    &&\varepsilon \left\|x(t)\right\|\cos\varphi(t) &= -\frac1{\gamma Mm^2}\left\|J\right\|^2+\left\|x(t)\right\|\\    &\implies&\varepsilon\left\|x(t)\right\|\cos\varphi(t)&=-\frac1{\gamma Mm^2}\left\|J\right\|^2+\left\|x(t)\right\|\\    &\implies&\frac1{\gamma Mm^2}\left\|J\right\|^2&=\bigl(1-\varepsilon\cos\varphi(t)\bigr)\left\|x(t)\right\|\\    &\implies&\left\|x(t)\right\|&=\frac{\frac{\left\|J\right\|^2}{\gamma Mm^2}}{1-\varepsilon\cos\varphi(t)} = \frac{\varepsilon\frac{\left\|J\right\|^2}{\gamma Mm^2\left\|A\right\|}}{1-\varepsilon\cos\varphi(t)} = \frac{\varepsilon p}{1-\varepsilon\cos\phi(t)},    \end{aligned}

with p defined in the obvious fashion to make the last equation work.

Therefore, the planet moves on a conic section, with focus in the Sun. As the planet’s orbits are bounded, we have proved that it must follow an ellipse. Q.e.d.

Theorem (Kepler’s Second Law of Planetary Motion): A planet sweeps out equal areas in equal times.

Proof: We use cartesian coordinates in \mathbb{R}^3, such that e_1 is parallel to A and e_3 is parallel to J. Then x(t) is the plane \mathrm{span}(e_1,e_2) and the Sun is in (0,0). In particular, x_3(t) = 0 for all t by the proof of the First Law. Then,

\displaystyle \frac1m J = x\times \dot x = \begin{pmatrix}x_1\\x_2\\0\end{pmatrix}\times \begin{pmatrix}\dot x_1\\\dot x_2\\ 0\end{pmatrix} = \begin{pmatrix}0\\0\\x_1\dot x_2-x_2\dot x_1\end{pmatrix}.

By Leibniz’ sector formula, the line segment between times t_1 and t_2 sweeps the area

\displaystyle \frac12\left|\int_{t_1}^{t_2}(x_1\dot x_2-x_2\dot x_1)dt\right| = \frac1{2m}\left\|J\right\|(t_2-t_1).

This area only depends on the difference of times, as stated. Q.e.d.

Theorem (Kepler’s Third Law of Planetary Motion): The square of the period of an orbit is proportional to the cube of its semi-major axis.

Proof: By Leibniz’ sector formula (used similarly to the proof of the Second Law), the area contained in the planet’s entire orbit is

\displaystyle \frac12\left|\int_0^T (x_1\dot x_2-x_2\dot x_1)dt\right| = \frac1{2m}\left\|J\right\| T,

where T is the time taken for a full orbit around the Sun. By the First Law, this orbit is an ellipse, the area of which may be computed as follows: The cartesian coordinates of an ellipse are

\begin{pmatrix}x(t)\\y(t)\end{pmatrix}=\begin{pmatrix}a\cos t\\b\sin t\end{pmatrix},

with a and b real constants (the larger one is called the semi-major axis). This is actually an ellipse because of

\displaystyle \frac{x^2}{a^2}+\frac{y^2}{b^2} = \frac{a^2\cos^2 t}{a^2}+\frac{b^2\sin^2t}{b^2} = 1.

From the notations about normal forms of conic sections, we find b = \frac{\varepsilon p}{\sqrt{1-\varepsilon2}} = a\sqrt{1-\varepsilon^2}, which implies that a>b (as \varepsilon > 0 for any conic section). Now, the area of the ellipse is, by Leibniz’ sector formula again,

\begin{aligned}    \frac12\left|\int_0^{2\pi} (x_1\dot x_2-x_2\dot x_1)dt\right| &= \frac12\left|\int_0^{2\pi}\bigl(a \cos t \cdot b \cos t-b\sin t \cdot a (-\sin t)\bigr)dt\right| \\    &= \frac12 ab\int_0^{2\pi}dt \\    &= \pi ab.    \end{aligned}

Both representations of the area covered by the orbit now yield

\displaystyle T \frac1{2m}\left\|J\right\| = \pi a^2\sqrt{1-\varepsilon^2},

and so, using the definition of p obtained in the proof of the First Law,

\begin{aligned}    \displaystyle T^2 &= \frac{4m^2}{\left\|J\right\|^2}\pi^2a^4(1-\varepsilon^2) \\    &= \frac{4\pi^2 m^2a^3}{\left\|J\right\|^2}a(1-\varepsilon^2) \\    &= \frac{4\pi^2m^2a^3}{\left\|J\right\|^2}\frac{\varepsilon p}{1-\varepsilon^2}(1-\varepsilon^2) \\    &= \frac{4\pi^2m^2a^3}{\left\|J\right\|^2}\varepsilon\frac{\left\|J\right\|^2}{\gamma Mm^2\left\|A\right\|} \\    &= \frac{4\pi^2}{\gamma M}a^3.    \end{aligned}

The constant \frac{4\pi^2}{\gamma M} is identical for any planet travelling around the Sun, and thus is constant. Q.e.d.

Let us conclude with a brief remark of how beautiful and elegant those Laws are – made my day.

On numerical integration and convergence issues

The methods for numerical integration are of extreme practical interest for any mathematician. In any number of applications one is obliged to find the value of some definite integral. One of the most trivial examples should be the values of probability distributions, a little less obvious example are solutions to differential equations.

Especially since the advent of cheap but high-performing computer power those possibilities have risen immensely. Things that were unattainable some decades ago are used thoroughly by today’s standards. This is why I found it very necessary to make myself familiar with how one can find definite integrals numerically (basically that’s the same motivation for why I wanted to know about numerical algorithms for finding eigenvalues – and there is a nice connection between the two fields, as we shall see).

At first, I had wanted to compile some of the basic ideas for numerical integration here. However, I dived deeper down into some certain aspect of the theory and decided to cut that one short to present something a little less known on convergence of quadrature methods. But to begin with, a little review of what every mathematician should learn in his basic course on numerical analysis.

The numerical integration is sometimes called quadrature, as one tries to find the value of a given shape (the “area under the curve”) and therefore find a square with the same area – similar to what one does in “squaring the circle” which is called “quadraturam circuli” in latin. It’s just that we don’t have to be restricted to ruler and compass anymore.

 

Basics

Of course, the natural way of numerical integration is to evaluate the function f (to fix notation here) at certain points and make a weighted average of the values. This is related to the Riemann-definition of integrability, where you use sums of the form \sum_{i=1}^nf(\xi_i)(x_{i}-x_{i-1}) for \xi_i\in(x_{i-1},x_{i}) and let the grid x_i get finer. This is a starting point for what we can achieve, but the first question is, how can we choose the \xi_i smartly?

Let’s not delve into this too deeply, but the short answer from Riemann-integration is: do as you please (as long as f is continuous). In the end, it will converge. The most common first choice is the appropriately-called midpoint-rule (here’s a picture). If you don’t know anything further about f, why not. But one will soon find the trapezoidal rule (picture) and there is no general rule which one of them will be better in any instance. Let’s just write those down for further reference:

Midpoint-rule: \int_a^b f(t)dt \approx (b-a)f(\frac{a+b}2),

Trapezoidal rule: \int_a^b f(t)dt \approx \frac{b-a}{2}\bigl(f(a)+f(b)\bigr).

A certain mixture of those two rules can be found in

Simpson’s rule: \int_a^b f(t)dt \approx \frac{b-a}{6}(f(a)+4f(\frac{a+b}2)+f(b)\bigr).

All these rules are fit to integrate polynomials up to a certain degree exactly (which is why this feature is called the degree of exactness for the quadrature rule): the midpoint rule is exact for linear polynomials, the trapezoidal rule makes it up to degree 2 and Simpson’s rule can integrate cubic polynomials exactly.

Each of these rules can be constructed by interpolation of f with low-degree polynomials: one chooses the unique polynomial p for which p(t)=f(t) in those t that come up in the formulae given above and choose constants (“weights”) such that \int_a^b p(t)dt = \sum_{i=1}^n w_ip(t_i). Note, that this will not be a Riemann-sum anymore, in general. You will find that for p being of degree n=1, 2 or 3, you will end up with the three formulae given above.

Now, if you wish to be a little better in your interpolation, you can subdivide the interval [a,b] a little further and use any of those rules on the smaller intervals (interpolating repeatedly) to get the compound rules:

– Midpoint-rule: \int_a^b f(t)dt \approx \frac{b-a}{n}\sum_{i=1}^nf(\frac{x_{i}-x_{i-1}}2) with x_k = a+k\frac{b-a}{n},

– Trapezoidal rule: \int_a^b f(t)dt \approx \frac{b-a}{n}\sum_{i=1}^n\frac{f(x_{i-1})+f(x_i)}{2} = \frac12f(a)+\frac{b-a}{n}\sum_{i=1}^{n-1}f(x_i)+\frac12f(b) with x_k = a+k\frac{b-a}{n},

– Simpsons rule: \int_a^b f(t)dt \approx \frac{b-a}{6n}\bigl(f(a) + 2\sum_{i=1}^n f(x_i) + 4\sum_{i=1}^nf(\frac{x_{i-1}+x_{i}}{2}) + f(b)\bigr) with x_k = a+k\frac{b-a}{n}.

Note, that we have a constant grid-width here. This doesn’t have to be the case in general, as we shall see. Besides, note that in Simpson’s rule we have three evaluations of the f in each of the sub-intervals, and only the boundary points x_k are used twice (once as the left-hand point, once as the right-hand point). One might also define it slightly differently, by having the sub-intervals overlap and each evaluation of f is used twice (once as the midpoint, the other time as a boundary point), except for f(a) and f(b).

Now, any basic text on numerical analysis and Wikipedia will tell you everything there is to know about the error that you get from these quadrature formulae. Let’s not go there, except for some small remarks. First, each of these rules has an error that will diminish with more sub-divisions of [a,b]. The error will depend on how smooth f is, in terms of the norm \left\|f^{(k)}\right\|_\infty for some k\geq1. Therefore, to know about the error, you will need to conjure some information on the derivatives of f, which will rarely be possible in applications. However, if f is “only” continuous, you may find yourself in a bad position for error estimation: the quadrature rule might be just looking at the unusual points to evaluate f, which are not typical of the overall behaviour:

untypisch

(here’s an example going with the picture: take f to be some positive constant c, except for a tiny area around the evaluation, where it shall vanish; then \int_a^bf(t)dt \approx c(b-a) \neq 0 = \sum_{i=1}^n w_if(x_i) – though not impossible, it’s harder to do this the smoother f will be, and the derivatives of f will blow up, giving a worse estimate). This effect has been called “resonance” by Pólya.

Apart from ad hoc methods for the error estimation there is the method of Peano kernels which can be applied to bound quadrature formulae (and other linear functionals). Let there just be mentioned that this method requires rather little effort to yield error bounds. I wouldn’t quite call it elegant, possibly since I lack the background in functional analysis, but it gets the job nicely done.

Next natural question: which n should we use? Short answer: the more the better, though many evaluations of f are “expensive” (especially in applications from the sciences and economics where you might even be unable to get more measurements). Apart from the compound rules, which are used quite often in practice, one might think that interpolation using polynomials of higher degree should work out better and give finer error estimates. The quadrature stemming from this idea is called Newton-Cotes-formula (of which our previous rules are special cases). In principle, this is sound reasoning, but here’s the showstopper: interpolating polynomials need not be similar to the function that they interpolate, and hence the quadrature runs wild. The most famous example is Runge’s phenomenon: for f(x)=\frac{1}{1+x^2} the sequence of polynomials interpolating on an equidistant grid does not converge pointwise, which can be explained by the singularity at x = \pm i that stays invisible while working on the real numbers only. In particular, an interpolating polynomial of sufficiently high degree won’t tell us anything useful about the behaviour of f.

For numerical applications, the next problem is that the weights in Newton-Cotes-formulae become negative when n\geq 9 (that means, if you use a polynomial of higher degree to interpolate f on [a,b]). In effect, you will encounter annihilation in your computation from subtracting rather large numbers – there won’t be any guarantee for proper results anymore, even though in theory your method might work fine on your given function. Quadrature formulae with positive weights are called for.

Before the discussion of convergence of quadrature formulae, we will still look at what methods are best possible in terms of the degree of exactness. It’s the Gaussian integration (many practical things are named after Gauß, aren’t they?) which allows for exact integration of polynomials up to degree 2n-1 from n evaluations.

But first, why will we be content with a degree of exactness 2n-1, why not go further? Well, there is always at least one polynomial of degree 2n that can’t be exactly integrated by a quadrature formula on n evaluations. Let’s fix some quadrature formula, it will look like \sum_{i=1}^nw_if(x_i) with weights w_i and evaluations in x_i, and let’s look at the polynomial \omega^2(x) = \prod_{i=1}^n(x-x_i)^2 which has its zeroes at the points of evaluation of the quadrature formula. Then, we find

\displaystyle \sum_{i=1}^nw_i\omega^2(x_i) = 0 < \int_a^b \omega^2(t)dt.

Hence, there is always a polynomial of degree 2n that can’t be integrated exactly, and therefore 2n-1 is the best we could hope to achieve. Now, how do we get there? Up to now, the quadrature formulae had used some x_i (mostly because those x_i were the natural thing to do or because we can’t always chose the x_i at liberty, if they stem from experimental measurements, say) and the w_i were computed accordingly to make the polynomials up to degree n be integrated exactly. But we can of course choose the x_i wisely as well: they can act as a degree of freedom, and then we’ll have 2n clever choices to make – this seems to indicate that we can achieve the degree of exactness 2n-1. But can we? And how?

Let’s assume that there is some quadrature formula which has this degree of exactness, and let’s look at what it will do to the polynomials \omega p, where p is any polynomial of degree less than n. Restricting ourselves to the interval [a,b] = [-1,1] for simplicity, we should find

\displaystyle 0 = \sum_{i=1}^nw_i\omega(x_i)p(x_i) = \int_{-1}^1\omega(t)p(t)dt.

That shows that p would be orthogonal to \omega, no matter what p actually was (as the L^2-space is a Hilbert space, of course, with the integration as its scalar product). From the abstract theory of Hilbert spaces we can know that there are orthonormal bases of polynomials. For our setting here, we are led to use the Legendre polynomials as \omega; there are other orthonormal polynomials which will then use another weight function inside the integral – later more on that.

We have seen that the zeroes of the Legendre polynomials should be used to evaluate f. But how to find those? Here’s the pretty thing I alluded to above. There’s a recursion formula for the Legendre polynomials P_n which can be shown by induction:

\displaystyle \beta_{n+1}P_{n+1}(x) = xP_n(x) - \beta_nP_{n-1}(x), with \displaystyle P_1\equiv\frac1{\sqrt2} and \displaystyle P_2(x) = \frac32x,

and using \beta_n = \frac{n}{\sqrt{(2n-1)(2n+1)}}. This recursion may be re-written as a vector-matrix-shape:

\displaystyle\begin{bmatrix}0&\beta_1&0\\\beta_1&0&\beta_2&0\\0&\beta_2&0&\beta_3&0\\&&\ddots&\ddots&\ddots\\&&&\beta_{n-2}&0&\beta_{n-1}\\&&&0&\beta_{n-1}&0\end{bmatrix}\begin{bmatrix}P_0(x)\\P_1(x)\\P_2(x)\\\vdots\\P_{n-2}(x)\\P_{n-1}(x)\end{bmatrix}= x\begin{bmatrix}P_0(x)\\P_1(x)\\P_2(x)\\\vdots\\P_{n-2}(x)\\P_{n-1}(x)\end{bmatrix}-\begin{bmatrix}0\\0\\0\\\vdots\\0\\\beta_nP_n(x)\end{bmatrix}

The last vector drops out for the zeroes of P_n, and then we have an eigenvalue-problem. In particular: find the eigenvalues of the matrix on the left-hand-side to find the zeroes of P_n. That’s some rather unexpected connection – I found it marvellous when I first saw this.

Computing eigenvalues is a tricky thing to do, as everybody knows, and getting eigenvectors is even harder. It turns out that you’ll need the eigenvectors to compute the proper weights w_i for the quadrature formula, and many intelligent people have come up with ingenious algorithms to do so, such that we can have the proper values for the maximum degree of exactness that we have yearned for.

This shall suffice here. We can understand that these so-called Gauss-Legendre-formulae can’t be expressed in a closed form, but they yield satisfying results in practice. Of course, when there was no cheap computer power available, those formulae were more of an academic exercise, as you couldn’t easily evaluate any given function f in those crude points x_i (even though these x_i came in tables) and you can’t re-use those evaluations when you take other values for n.

A numerical advantage is that the weights will be positive for the Gaussian quadrature, which is easy to see, as a matter of fact: with the Lagrange-polynomials \ell_{ni}, that are used for explicit interpolation with polynomials and that have \ell_{ni}(x_j) = \delta_{ij} for the evaluation points, we find

\displaystyle w_i = \sum_{j=1}^nw_j\ell_{ni}^2(x_j) = \int_{-1}^1\ell_{ni}^2(t)dt > 0.

As an aside: there are some variants of Gauss-Legendre that may come in useful in some applications. For instance there is the Gauss-Chebyshev-formula which uses the weight function \frac1{\sqrt{1-x^2}} in the integration (not in the quadrature – weights and weight functions are not related directly). This means, if your function f is of the shape f(x) = \frac{g(x)}{\sqrt{1-x^2}}, use Gauss-Chebyshev on the function g, which may be easier or more stable; in that case, other x_i and w_i are necessary, of course. Similar things hold for Gauss-Laguerre (weight e^{-x}, the integral running on [a,b)=[0,\infty), which makes it possible to integrate on infinite intervals, being a non-trivial task at a first glance) and Gauss-Hermite (weight e^{-x^2}, (a,b)=(-\infty,\infty)). Other variants have certain fixed points of evaluation, which certainly reduces the degree of exactness, but which might help in solving differential equations (when you always evaluate in the point x_1=a, you save one evaluation as this is your starting-point of your multi-step method; Gauss-Radau does this, for instance, and similarly Gauss-Lobatto with x_1=a and x_n=b fixed).

Obvious linear transformations will enable us to use different values for a and b than the standard ones for which the various orthonormal polynomials are constructed. Let’s ignore this here.

Those were pretty much the basic facts of what any mathematician should know or at least should have heard once, without much of a proof stated here. Now for a little more specialized material. Who guarantees that the quadrature gets better with increasing numbers of evaluation? Evaluations are expensive, generally speaking, and maybe it doesn’t even do any good (as was the case with the Newton-Cotes-method – or does it)? It seems that Stieltjes first came up with this question and gave partial answers in 1884, while Pólya answered it rather exhaustively a couple of decades later in 1933. We rely on Pólya’s original article “Über die Konvergenz von Quadraturverfahren” (Math. Zeitschr. 37, 1933, 264-286)

 

Convergence

A quadrature method is, for all intents and purposes that are to come, a triangular scheme of weights w_{ij} and evaluation points x_{ij} like

\displaystyle \begin{matrix}w_{11}\hphantom{,}\\w_{21}, &w_{22}\hphantom{,}\\w_{31},&w_{32},&w_{33}\\\vdots\\w_{n1},&w_{n2},&\ldots&w_{nn}&\\\vdots&\end{matrix}\hspace{2em} and \hspace{2em}\displaystyle \begin{matrix}x_{11}\hphantom{,}\\x_{21}, &x_{22}\hphantom{,}\\x_{31},&x_{32},&x_{33}\\\vdots\\x_{n1},&x_{n2},&\ldots&x_{nn}&\\\vdots&\end{matrix}

For any n, we need a\leq x_{n1}<x_{n2}<\cdots<x_{nn}\leq b. Of course we shall combine those triangular schemes like this

\displaystyle Q_n[f] := \sum_{i=1}^nw_{ni}f(x_{ni})

and hold up our hope that \lim_{n\to\infty} Q_n[f] = \int_a^bf(t)dt. If so, the quadrature method converges for the funtion f. Of course, f needs to be at least Riemann-integrable (We are unable to focus on Lebesgue-integrable functions here, as no quadrature method can converge for any Lebesgue-integrable function: there are only countably many evaluation points on which f may be altered at will without effect on its integral).

It is clear that the Newton-Cotes-formulae fit in this definition. We can choose the x_{ij} at liberty and the w_{ij} follow from the demand that in stage n polynomials up to degree n-1 are integrated exactly. In particular, the Newton-Cotes-formulae converge for any polynomial by definition.

But there’s the question, what do the x_{ij} and w_{ij} need to satisfy in order for the quadrature method to converge on a given set of functions (any polynomial, any continuous or even any integrable function)? If you switch to a larger set, what are the additional necessary assumptions?

For polynomials, the case is quite easy. We look for \lim_{n\to\infty}\sum_{i=1}^nw_{ni}x_{ni}^k = \frac{1}{k+1}(b^{k+1}-a^{k+1}) for any k\geq0, as every polynomial is a linear combination of the x^k and both integration and quadrature are linear functionals. No big deal, and for k=0 in particular we find the important assumption \lim_{n\to\infty}\sum_{i=1}^nw_{ni} = b-a.

 

Pólya’s result that we shall examine in a little more detail is this

Theorem: The quadrature method converges for every continuous function, if and only if it converges for every polynomial and there is a constant \Gamma such that \sup_n\sum_{i=1}^n\left|w_{ni}\right| < \Gamma.

One direction of the proof is easy. Let \sup_n\sum_{i=1}^n\left|w_{ni}\right| < \Gamma and let f be any continuous function. We show that the method converges for f. We know by Weierstrass’ Approximation Theorem that there are uniformly close polynomials for f, in the sense that for every \varepsilon>0 we can get a polynomial p with \left\|f(t) - p(t)\right\|_\infty<\varepsilon. Thus,

\displaystyle \begin{aligned}    \left|\int_a^bf(t)dt - Q_n[f]\right| &\leq \int_a^b\left|f(t)-p(t)\right|dt + \left|\int_a^bp(t)dt - Q_n[p]\right| + \sum_{i=1}^n\left|w_{ni}\right|\left|p(x_{ni})-f(x_{ni})\right| \\    &\leq \varepsilon(b-a) + \varepsilon + \varepsilon\Gamma.    \end{aligned}

Here, we used that Q_n converges for the polynomial p and that for n sufficiently large \left\|f-p\right\|_\infty<\varepsilon. This shows that Q_n[f] is convergent.

The other direction of the proof is a little harder. We assume that there is no constant \Gamma such that \sup_n\sum_{i=1}^n\left|w_{ni}\right| < \Gamma – therefore \limsup_n\sum_{i=1}^n\left|w_{ni}\right| = \infty. We will construct a continuous function f for which Q_n[f] diverges.

This will involve the “resonance” that I alluded to earlier: choose a function that reacts as badly as possible on the evaluation points. For fixed k take g_k such that \left\|g_k\right\|_\infty = 1 and Q_k[g_k] = \sum_{i=1}^k\left|w_{ki}\right|.

resonanz

This ensures that Q_k[g_k] cannot possibly take a larger value on any function that itself is bounded by 1. Such g_k may be constructed, for instance, via g_k(x_{kj}) = \mathrm{sgn} w_{kj} and linear connections between the evaluation points.

Now, one of two possible cases may occur: either there is some k with \limsup_n\left|Q_n[g_k]\right|=\infty, or that expression stays bounded uniformly in k. In the first case, we have already found the continuous function which makes the quadrature method diverge and we are done. So, let us look at the second case only.

Let us inductively construct a subsequence of the g_k as follows. Let g_{k_1} = g_1. If we have got g_{k_{m-1}}, set

\displaystyle M_{m-1}:= \sup_nQ_n\left[\sum_{j=1}^{m-1}g_{k_j}3^{-j}\right].

As we have already decided we’re in the second case, M_{m-1} will be finite. Now, we wish to choose the next part g_{k_m} of the subsequence, and we ensure both k_m>k_{m-1} and (as \limsup_n\sum_{i=1}^n\left|w_{ni}\right| = \infty, we may find some index to make this sum of weights as large as we like)

\displaystyle \sum_{i=1}^{k_m}\left|w_{k_mi}\right| > 3^m 2(M_{m-1} + m).

With the subsequence g_{k_m} at our disposal, let’s define the function

\displaystyle f(x):=\sum_{j=1}^\infty 3^{-j}g_{k_j}(x).

This is a normally convergent series, as each of the g_{k_j} is bounded by 1 and thus for sufficiently large n:

\displaystyle\left\|\sum_{j=n}^\infty 3^{-j}g_{k_j}(x)\right\|_\infty\leq\sum_{j=n}^\infty 3^{-j} < C3^{-n}<\varepsilon.

As a uniformly convergent limit of continuous functions, f is continuous. Next, we invoke linearity of quadrature methods again and have a closer look at what the Q_n do to f:

\displaystyle \begin{aligned}    Q_{k_m}[f] &= Q_{k_m}\left[\sum_{j=1}^{m-1}g_{k_j}3^{-j}\right] + 3^{-m}Q_{k_m}\left[g_{k_m}\right] + Q_{k_m}\left[\sum_{j=m+1}^\infty g_{k_j}3^{-j}\right]\\    & > -M_{m-1} + 3^{-m}\sum_{j=1}^{k_m}\left|w_{k_mj}\right| - \sum_{\ell=1}^{k_m}\left|w_{k_m\ell}\right|\sum_{j=m+1}^\infty 1\cdot 3^{-j},    \end{aligned}

by the definition of M_{m-1} and by the construction of g_{k_m}. Then,

\displaystyle \begin{aligned}    Q_{k_m}[f] & = -M_{m-1} + 3^{-m}\sum_{j=1}^{k_m}\left|w_{k_mj}\right| - 3^{-m-1}\sum_{\ell=1}^{k_m}\left|w_{k_m\ell}\right|\cdot\frac1{1-\frac13}\\    & = -M_{m-1} + \frac123^{-m}\sum_{j=1}^{k_m}\left|w_{k_mj}\right|\\    & > m,    \end{aligned}

which may explain the strange bound that we used in the construction of the subsequence.

Letting m\to\infty, we see that the quadrature method diverges for f. The entire construction hinges on the assumption that \limsup_n\sum_{i=1}^n\left|w_{ni}\right| = \infty; in particular, the condition given in the theorem is necessary for the convergence of the quadrature method.

The theorem is proved. q.e.d.

 

The condition for the convergence on Riemann-integrable functions is rather technical and we won’t dwell on this for too long. The basic ideas of resonance are similar, and in applications there is often good reason to assume that your function is continuous (as opposed to differentiable which you usually can’t know). We’ll content ourselves with the remark that quadrature methods converge on any Riemann-integrable function if and only if they converge for every continuous function and \limsup_n\sum_{i\in I_n}\left|w_{ni}\right|=0 where I_n is a sequence of intervals which contains the x_{ni} and the total length of the intervals converges to 0.

A little more interesting is the following counter-example: There is an analytic function for which the Newton-Cotes-formulae on equidistant grids diverge.

Analytic functions are about the most well-behaved functions you may imagine, and even there the Newton-Cotes-method won’t do much good (as long as they’re computed on equidistant grids). Of course, by the theorem we proved, we can already see why: the weights are not bounded and the basic problem is what happened in Runge’s counterexample; the interpolating polynomials don’t have anything to do with the function f that we wish to integrate. As Ouspensky put it:

La formule de Cotes perd tout valeur pratique tant que le nombre des ordonnées devient considérable.

However, the problem is not inherent in Newton-Cotes: if you choose the points of evaluation wisely (as zeroes of Legendre polynomials, or the like, usually clustering them around the endpoints of the interval), the sequence will converge by the theorem that we have proved: the quadrature is then Gaussian, its weights are positive and integrate the constant function 1 properly (which even gives \sum_{i=1}^n w_{ni} = b-a and so b-a =: \Gamma <\infty) – hence, it must converge for any continuous (and in particular: analytic) function. The message is: avoid negative weights in quadrature methods and choose wisely whatever you can choose; not only will you avoid trouble in numerical computations, but you may also keep nice convergence properties.

 

Let us at least sketch the proofs of this counter-example.

Let us switch the notation very slightly and fix it like this: take x_{ni} = \frac in for i=0,\ldots,n, as we try to approximate \int_0^1f(t)dt. Then, Q_n[f] = \sum_{i=0}^nw_{ni}f(\frac in).

We’ll need two preliminaries. First: Asymptotics on the quadrature weights. Second: An estimate for the interpolating polynomial.

First preliminary: Asymptotics on the quadrature weights. As we use Newton-Cotes, the idea is to interpolate and then integrate the corresponding polynomial. The interpolating polynomial can be expressed via the Lagrange polynomials \ell_{ni}(x) := \prod_{j=1, j\neq i}^n\frac{x-x_{nj}}{x_{ni}-x_{nj}} and our idea narrows down to

\displaystyle \int_0^1f(t)dt\approx \int_0^1p(t)dt = \int_0^1\sum_{i=0}^nf(x_{ni})\ell_{ni}(t)dt = \sum_{i=0}^nf(x_{ni})\int_0^1\ell_{ni}(t)dt.

On the other hand, \int_0^1f(t)dt\approx Q_n[f] = \sum_{i=0}^nf(x_{ni})w_{ni}. Hence, the weights w_{ni} must equal \int_0^1\ell_{ni}(t)dt, or to write it down for once

\displaystyle w_{nk} = \int_0^1\frac{nt\cdot (nt-1)\cdots (nt-k+1)\cdot (nt-k-1)\cdots (nt-n)}{k(k-1)\cdots 1\cdot(-1)\cdots(k-n)}dt.

The really troublesome part is an asymptotic formula for the growth of these w_{nk}. Pólya uses all kinds of neat ideas from the theory of the gamma function, far-sighted variable transformations and a lot of computation. This has been done by Ouspensky first (“Sur les valeurs asymptotiques des coefficients de Cotes“; Bull. Amer. Math. Soc., 31, 1925, 145-156), who concluded with the quote given above, that the weights for Newton-Cotes get larger and larger in absolute value and therefore become increasingly useless. The result (which was at least a little generalized and simplified by Pólya) goes like this

\displaystyle w_{nk} = -\frac{1}{n(\log n)^2}\binom{n}{k}\left(\frac{(-1)^k}{k}+\frac{(-1)^{n-k}}{n-k}\right)(1+\eta_{nk})

with \lim_n\sup_k\eta_{nk}=0. In particular, a close inspection shows the special case

\displaystyle w_{n,\frac n2} > 2^n n^{-3}

for even n. Let’s keep this guy in mind.

Second preliminary: An estimate for the interpolating polynomial. As we wish to say something about analytic functions, we can invoke the pretty results of complex analysis. This can tell us about a representation of the interpolation polynomial. In what follows, \omega is, again, the polynomial with zeroes in the points of evaluation.  Of course, in the complex plane interpolation works with the Lagrange polynomials as well:

\displaystyle p(x) = \sum_{i=0}^nf(x_{ni})\prod_{j=0, j\neq i}^n\frac{x-x_{nj}}{x_{ni}-x_{nj}}.

Some close inspections show:

\displaystyle \begin{aligned}    p(x) &= \sum_{i=0}^nf(x_{ni})\frac{\prod_{j=0, j\neq i}^n(x-x_{nj})}{\prod_{j=0, j\neq i}^n(x_{ni}-x_{nj})}\\    &= \sum_{i=0}^nf(x_{ni})\frac{\prod_{j=0}^n(x-x_{nj})}{(x-x_{ni})\sum_{k=0}^n\prod_{\ell=0, \ell\neq k}^n(x_{ni}-x_{n\ell})}\\    &= \sum_{i=0}^n\frac{\omega(x)f(x_{ni})}{(x-x_{ni})\omega'(x_{ni})}.    \end{aligned}

For the sake of seeing how to invoke the residue theorem, we set F(t) := \frac{\omega(x)-\omega(t)}{x-t}f(t) and G(t):=\omega(t). This leads to

\displaystyle p(x) = \sum_{i=0}^n\frac{F(x_{ni})}{G'(x_{ni})} = \sum_{i=1}^n\mathrm{res}_{x_{ni}}\left(\frac{F}{G}\right).

The last equation followed from a little divine intuition – something we can’t avoid in this argument; in order to drop this intuition, we would have needed to start with an eerie expression that would have come from nowhere. So, how does it follow? The function \frac1G has simple poles in the x_{ni}, while F does not (the simple singularities in the x_{ni} cancel out in the numerator to leave us with a polynomial without further zeroes in the x_{ni}). The Laurent expansion of \frac FG starts with the a_{-1}(x-x_{ni})^{-1}+\cdots, a_{-1} being the residue. Multiplying with x-x_{ni}, using the Taylor expansion of G and taking the limit gives

\displaystyle \begin{aligned}    \mathrm{res}_{x_{ni}}\left(\frac{F}{G}\right) = \lim_{x\to x_{ni}}(x-x_{ni})\frac{F(x)}{G(x)} &= \lim_{x\to x_{ni}}(x-x_{ni})\frac{F(x)}{(x-x_{ni})G'(x_{ni})+\cdots} \\    &= \lim_{x\to x_{ni}}\frac{F(x)}{G'(x_{ni})+\cdots} = \frac{F(x_{ni})}{G'(x_{ni})}.    \end{aligned}

The argument looks slightly butchered as we read it the wrong way around – but it works nonetheless. Now comes the residue theorem into play, applying it on any reasonable domain D that contains the x_{ni} on its interior:

\displaystyle \begin{aligned}    p(x) = \sum_{i=1}^n\mathrm{res}_{x_{ni}}\left(\frac{F(\zeta)}{G(\zeta)}\right)&=\sum_{i=1}^n\mathrm{res}_{x_{ni}}\left(\frac{\bigl(\omega(x)-\omega(\zeta)\bigr)f(\zeta)}{(x-\zeta)\omega(\zeta)}\right)\\    &= \frac1{2\pi i}\int_{\partial D}\frac{\omega(x)-\omega(\zeta)}{(x-\zeta)\omega(\zeta)}f(\zeta)d\zeta    \end{aligned}

Re-writing this last expression with the help of Cauchy’s integral formula shows

\displaystyle \begin{aligned}p(x) &= \frac1{2\pi i}\int_{\partial D}\frac{\omega(x)f(\zeta)}{(x-\zeta)\omega(\zeta)}d\zeta - \frac1{2\pi i}\int_{\partial D}\frac{f(\zeta)}{x-\zeta}d\zeta \\    &= \frac1{2\pi i}\int_{\partial D}\frac{\omega(x)f(\zeta)}{(x-\zeta)\omega(\zeta)}d\zeta + f(x).    \end{aligned}

Now, consider a certain domain D in the complex plane

domain

(half-circles of radius 1 around z=0 and z=1, connected by straight lines), in which f is bounded by some constant M. Let f be analytic in this domain D and being interpolated by the polynomial p in the points x_{n0},\ldots,x_{nn} which are located in the interval [0,1]. For \zeta on the boundary of D, we are far away from the interval [0,1] and all x_{ni} in the sense that \left|\omega(\zeta)\right|\geq1 and \left|x-\zeta\right|\geq1, while for x in [0,1] we have \left|\omega(x)\right|<1. Hence,

\displaystyle \left|p(x)-f(x)\right|\leq \left|\frac1{2\pi i}\int_{\partial D}\frac{\omega(x)f(\zeta)}{(x-\zeta)\omega(\zeta)}d\zeta\right| \leq \frac1{2\pi}(2+2\pi)M.

Therefore

\displaystyle \left|\int_0^1p(t)dt\right|\leq\left|\int_0^1p(t)-f(t)dt\right|+\left|\int_0^1f(t)dt\right|\leq \frac{\pi+1}{\pi}M+M\leq 3M.

Let’s keep this one in mind as well.

From all of these preliminaries, let us state the promised counter-example: consider the function

\displaystyle f(x) := \sum_{k=4}^\infty a^{k!}\frac{\sin(k! \pi x)}{-\cos(\pi x)},

with a\in(\frac12, 1) and where the denominator and the starting index k=4 are chosen to ensure that the singularity in x=\frac12 is removable by a positive constant. This function is actually analytic as the uniform limit of analytic functions (the pretty complex analysis again; and here we need a<1 for the summands to actually diminish quickly enough). Besides, it is periodic and one can use this to find its maximum value on the horizontal bounds of the domain: by complex analysis (yet again!) the maximum must be located on the boundary, and if it were on the vertical bounds, we could (by periodicity) shift the domain without losing any information and then contradictively find the maximum on the inside of the domain. Therefore, any summand of which f is constructed may be bounded inside the domain by

\displaystyle \left|a^{k!}\frac{\sin(k! \pi x)}{-\cos(\pi x)}\right| \leq Ca^{k!}\exp(k!\pi \mathrm{Im}(x)) \leq Ca^{k!}\exp(k!\pi),

where the constant C helps to encode the minimum of the cosine on \mathrm{Im}z=1 (it won’t vanish anywhere near this line).

Now that we have bounded the analytic summand of our function f, from our second preliminary, we get

\displaystyle \left|\int_0^1 p(t)dt\right| < 3Ca^{k!}\exp(k!\pi)

and we shall consume the 3 into the constant C without further notice. On top of that, we are about to apply a Newton-Cotes-method; the integral of the interpolating polynomial is equal to what all the quadratures Q_n[f] do:

\displaystyle \left|\sum_{i=0}^nw_{ni}a^{k!}\frac{\sin(k! \pi \frac in)}{-\cos(\pi \frac in)}\right| = \left|\int_0^1 p(t)dt\right| < Ca^{k!}\exp(k!\pi).

Now we see

\displaystyle \begin{aligned}Q_{n!}[f] &= Q_{n!}\left[\sum_{k=n}^\infty a^{k!}\frac{\sin(k!\pi x)}{-\cos(\pi x)}\right]+Q_{n!}\left[\sum_{k=4}^{n-1} a^{k!}\frac{\sin(k!\pi x)}{-\cos(\pi x)}\right]\\    &\geq Q_{n!}\left[\sum_{k=n}^\infty a^{k!}\frac{\sin(k!\pi x)}{-\cos(\pi x)}\right]-\sum_{k=4}^{n-1}Ca^{k!}\exp(k!\pi)\\    &= w_{n!, \frac{n!}{2}}\sum_{k=n}^\infty k! a^{k!} -\sum_{k=4}^{n-1}Ca^{k!}\exp(k!\pi),    \end{aligned}

since in case k\geq n, \sin(k!\pi x)=0 for any relevant x_{n!,i}=\frac i{n!} and all terms in Q_{n!} vanish except for x_{n!,\frac12n!}=\frac12 where the denominator vanishes as well and the singularity is removable by taking the value k!.

This is the time to take our first preliminary out of the drawer:

\displaystyle \begin{aligned}    Q_{n!}[f] &> \frac{2^{n!}}{(n!)^3} n! a^{n!} - (n-4)C(ae^\pi)^{(n-1)!}\\    &= \frac{(2a)^{n!}}{(n!)^2}\left(1-\bigl(n!\bigr)^2(n-4)C\left(\frac{ae^\pi}{(2a)^n}\right)^{(n-1)!}\right)\end{aligned}.

As a>\frac12, the term in the large brackets converges to 1 (n! grows slower than (2a)^{n!} does), and the entire term diverges. In particular, the Newton-Cotes-method diverges for f. That’s what we wanted to show. Oof. q.e.d.

 

So, we have found necessary and sufficient conditions for quadrature methods to converge and we dealt with a striking counter-example for a plausible method on a reasonably nice class of functions to make us dismiss this plausible method. However, whenever we spoke about convergence, we didn’t deal with the speed of convergence – and we couldn’t have. As Davis and Rabinovitz have rather un-nervedly put it: Nothing positive can be said about the degree of convergence. Lipow and Stenger (“How slowly can quadrature Formulas converge?“; Math. Comp. 26, 1972, 917-922) have shown that all quadrature methods will converge arbitrarily slowly on the space of continuous functions: for every sequence a_n converging to 0, there is a continuous function f such that \int_0^1f(t)dt - Q_n[f] \geq \varepsilon_n.

Before closing this chapter, let’s say that in practical applications it’s hard to know whether your approximation to the integral is sufficient. The convergence theorems don’t speak about the speed of convergence, the error estimates deal with sup-norms of derivatives. Most applications deal with adaptive methods (which we haven’t covered here) to have a closer look where the function might become erratic, and in the end the methods are used with a little hope in the back of the user’s minds that stable methods with a high degree of exactness will do sufficiently well. There can be no certainty, however. But still, it’s some pretty mathematics to spend your time with.

On approximating functions – Part Two

The last time, we have proved the Stone-Weierstass-Theorem on how to find approximating functions to any continuous function on a compact interval only using a certain algebra of “easy” functions – such as polynomials.

Now that the hard, abstract work is done, we shall note that there are other, less abstract ways to prove Weierstrass’ Approximation Theorem. I am unaware of a less abstract method for Stone-Weierstrass, in particular the method given above is non-constructive. You could go through most of these things, but the way that we used compactness is not constructable in general – how do you choose the finite covering, as you only know there should exist one?

Let us therefore look at Bernstein Polynomials, which are associated with a given function f on [0,1]:

\displaystyle f_n(x) = \sum_{k=0}^n\binom{n}{k}f\left(\frac kn\right)x^k(1-x)^{n-k}.

One can prove:

 

Proposition: Let f:[0,1]\to\mathbb{R} be continuous. Then the Bernstein Polynomials converge uniformly to f.

 

Proof: Since f is continuous on a compact interval, it is also uniformly continuous. Let \varepsilon>0. Then, there is some \delta>0 with \left|f(x)-f(y)\right|<\varepsilon if \left|x-y\right|<\delta. Besides, for any x\in[0,1],

\displaystyle \left|f(x)-f_n(x)\right| \leq \sum_{k=0}^n\binom{n}{k}\left|f(x)-f\left(\frac kn\right)\right|x^k(1-x)^{n-k},

where we have used the standard fact that

\displaystyle\sum_{k=0}^n\binom nk x^k(1-x)^{n-k} = \bigl(x+(1-x)\bigr)^n = 1,

which will (if used again and together with uniform continuity) yield

\displaystyle \left|f(x)-f_n(x)\right| \leq \sum_{k=0}^n\binom{n}{k}\left|f(x)-f\left(\frac kn\right)\right|x^k(1-x)^{n-k}\mathbf{1}_{\left|x-\frac kn\right|\geq\delta} + \varepsilon.

So, we have split our problem in two parts: in the first part, if we want to approximate f well, it’s easy because an interpolation point for f_n is close by. If this is not the case, the interpolation point is far away, meaning:

\displaystyle \left|x-\frac kn\right|\geq\delta\quad\iff\quad 1 \leq \frac{\bigl(x-\frac kn\bigr)^2}{\delta^2}.

On top of that, by continuity, f will be bounded by a constant M, say. This gives:

\begin{aligned}    \left|f(x)-f_n(x)\right| &\leq \sum_{k=0}^n\binom{n}{k}\left|f(x)-f\left(\frac kn\right)\right| x^k(1-x)^{n-k}\frac{\bigl(x-\frac kn\bigr)^2}{\delta^2} + \varepsilon \\    &\leq \frac{2M}{\delta^2}\sum_{k=0}^n\binom{n}{k}\left(x-\frac kn\right)^2 x^k(1-x)^{n-k} + \varepsilon.    \end{aligned}

As far as ideas are concerned, we’re done. All that is left to do is to shove terms and binomial coefficients from left to right and to find out that they behave properly. We won’t do that here, as it’s not particularly hard. One will find:

\displaystyle\left |f(x)-f_n(x)\right| \leq \frac{2M}{\delta^2}\left(x^2-2x^2+x^2+\frac{x(1-x)}{n}\right) + \varepsilon\leq \frac{2M}{\delta^2}\frac1{4n}+\varepsilon < 2\varepsilon,

if n is chosen sufficiently large. q.e.d.

 

There is a variant of the proof which avoids some of these messier calculations that we didn’t pull through above. Well, this variant doesn’t avoid the calculations, it rather shoves them away to another concept: the Weak Law of Large Numbers. We rely on Klenke’s book on Probability for this (but the idea is due to Bernstein).

 

Alternative Proof: Let x\in[0,1], and let x_1,\ldots,X_N be independent random variables with a Bernoulli distribution with parameter x (the random experiment of throwing one coin and its chance of landing on heads is x). Then, the sum S_n = \sum_{i=1}^nX_i has a binomial distribution B_{n,x}. Therefore,

\displaystyle E\left[f\left(\frac{S_n}n\right)\right] = \sum_{k=0}^nf\left(\frac kn\right)P[S_n=k] = f_n(x).

We find, again with M = \left\|f\right\|_\infty,

\displaystyle \left|f\left(\frac{S_n}n\right) - f(x)\right| \leq \varepsilon + 2M \mathbf{1}_{|\frac{S_n}n-x|\geq\delta}

and hence

\displaystyle \left|f_n(x)-f(x)\right| \leq E\left[\left|f\left(\frac{S_n}n\right)-f(x)\right|\right] \leq \varepsilon + 2MP\left[\left|\frac{S_n}{n}-x\right|\geq\delta\right].

By the Chebyshev Inequality (which is just shy from using the weak law of large numbers itself), since E[\frac{S_n}n] = x and \mathrm{Var}[\frac{S_n}n] = \frac1{n^2}nx(1-x)\leq\frac1{4n}, we have

\displaystyle P\left(\left|\frac{S_n}n-x\right|\geq\delta \right) = P\left(\left|\frac{S_n}n-E\left[\frac{S_n}n\right]\right|\geq\delta\right) \leq \mathrm{Var}\left[\frac{S_n}n\right]\delta^{-2} = \frac{1}{4n\delta^2}.

Hence,

\displaystyle \left|f_n(x)-f(x)\right| \leq \varepsilon + \frac{2M}{4\delta^2n} < 2\varepsilon,

if n is taken large enough. q.e.d.

 

These two have been more direct ways of proving the classical Weierstrass Approximation Theorem. It is what Werner’s book on Functional Analysis does, and so does the treatment of Körner’s on Fourier Series. The pro side is, it’s a direct and simple proof. The downside is, however, you won’t get the stronger theorems given at the start of this text; it doesn’t even say anything about trigonometric polynomials.

But, there’s a slight hope: actually, in the non-probabilistic approach, we only needed to prove that Bernstein Polynomials can deal well with the monomials 1, x and x^2 (that’s the part that we have avoided, since there is some computation to be done). Maybe one can generalize this one – and one can: this is Korovkin’s approach.

 

Theorem (Korovkin): Let T_n:\mathcal{C}_{\mathbb{R}}(K)\to\mathcal{C}_{\mathbb{R}}(K) a sequence of positive linear operators, and let q\subset \mathcal{C}_{\mathbb{R}}(K) be a test set that satisfies certain conditions on positivity and on its zero set.

If for any q\in Q, we have \lim_{n\to\infty}\left\|T_nq-q\right\|_\infty=0, then we also have \lim_{n\to\infty}\left\|T_nf-f\right\|_\infty=0 for all f\in\mathcal{C}_{\mathbb{R}}(K).

 

Remember that an operator is just a function on a function space (plug a function in, get a function out). It is linear by the usual definition of that term, and by positive one means T_nf\geq0 as long as f\geq0 everywhere it is defined.

Korovkin tells us that you only need to approximate certain functions well with a cleverly chosen operator, to make the classical Weierstrass-Theorem work. This is what can be achieved by Bernstein Polynomials (they are a sequence of operators which only use finitely many values of a function) and the test set q=\{1, x, x^2\}. You can also prove the Theorem on trigonometric polynomials with the operator T_n chosen as the Fejér-kernel and the test set q=\{1, \cos, \sin\}. We don’t delve into this here, especially not in the test set q; let’s just say that there is one and in the given examples you can’t make it smaller. In the detailed proof, there is some computation required and you can actually learn many things about Fourier series (so go into it; for instance in the books by Werner on Functional Analysis, or in Hämmerlin’s book on Numerics – however, to place a quote by Gauss in here: Most readers may well wish in some places for greater clarity in the presentation).

We won’t prove that here because for the main applications there is not much to gain on top of what we already have. But the approach has a certain elegance to it which is especially captured by functional analysts. I am not aware of generalizations of its applicability, I’d be very interested to hear about them.

 

As a final alternate proof of the full real version of Stone-Weierstrass Theorem (not just Weierstrass’ Theorem), we shall look at an elementary approach suggested by Brosowski and Deutsch in 1981 (see the original paper here). There is no dealing with convergent series, operators and the like; the proof only hinges on definitions of compactness and continuity and three easily proved ideas: closed subsets of compact spaces are compact, continuous positive functions have a positive minimum on compact sets, and the Bernoulli inequality (1+h)^n\geq 1+nh for any n\in\mathbb{N} and h\geq-1. We will give two lemmas before we are able to prove the real version of Stone-Weierstrass.

The proof will actually show a little bit more, since we need not suppose that K\subset\mathbb{R}. In fact, Stone-Weierstrass holds for any real-valued function on a compact topological space (it’s just that we didn’t make a fuss about that earlier on – most applications deal with functions on \mathbb{R}, as a matter of fact).

 

The pointwise nearly indicator Lemma: Let x_0\in K and let U be a neighborhood of x_0. Then, for any \varepsilon>0 there is some neighborhood V(x_0) of x_0 with V\subset U and some function g\in\mathcal{A} such that

  • 0\leq g(x)\leq 1 for x\in K,
  • g(x)<\varepsilon for x\in V(t_0),
  • g(x)>1-\varepsilon for x\in K\setminus U.

 

The lemma got its name because g serves as an indicator function (or characteristic function for the non-probabilists out there) for the set K\setminus U. Of course, the algebra \mathcal A is not rich enough for a proper indicator function, since that wouldn’t be continuous. But the algebra approximates this indicator of K\setminus U, and we can guarantee the function g to nearly vanish in some neighborhood of the given point x_0. We’ll improve on that soon.

 

Proof: First of all, we need to find a function that takes its values only in the interval [0,1] and which already has a controllably small value in x_0. What easier way to pick that small value as 0?

For any y\in K\setminus U, by separation, there is some function g_y\in\mathcal{A} with g_y(y)\neq g_y(x_0). Besides, the function h_y(x) = g_y(x)-g_y(x_0) is in \mathcal{A} and of course h_y(y)\neq h_y(x_0) = 0. Even the function p_y(x) = \frac1{\left\|h_y^2\right\|_\infty}h_y^2(x) is in \mathcal{A}, we have p_y(x_0) = 0, p_y(y)>0 and 0\leq p_y\leq 1.

That was the first important step. p_y vanishes at x_0 and it doesn’t vanish in some point outside of U. Now, we need to push the “doesn’t vanish” part to make our function take values near 1, uniformly outside of U.

Consider the set W(y)=\{x\in K : p_y(x)>0\}. This is a neighborhood of y, hence its name. We can look at any of those W(y), whenever y\in K\setminus U. Now, K\setminus U is compact (it’s a closed subset of a compact space), and hence finitely many W(y) will suffice to cover K\setminus U, let’s say K\setminus U \subset\bigcup_{k=1}^m W(y_k). The function p(x) = \frac1m\sum_{k=1}^mp_{t_k}(x) is in \mathcal{A}, we still have 0\leq p\leq1, p(x_0)=0 and p(x)>0 for any x\in K\setminus U.

Because of compactness, p must take a positive minimum on the compact set K\setminus U, let’s call it \delta. Then, for this \delta\in(0,1], we look at the set V=\{x\in K : p(x)<\frac\delta2\}. Obviously, V\subset U and V is a neighborhood of x_0.

Let us look at the integer K = \left\lceil \frac1\delta+1\right\rceil. Thus, K-1\leq\frac1\delta and K\leq\frac{1+\delta}{\delta} \leq \frac2\delta. In total, 1< k\delta \leq 2. We can therefore, for every n\in\mathbb{N}, define the functions q_n(x) = \bigl(1-p^n(x)\bigr)^{k^n}, which, of course, are in \mathcal{A}, which have 0\leq q_n\leq 1 and which have q_n(x_0)=1.

Let x\in V. Then, Kp(x)< \frac{k\delta}2\leq 1 and Bernoulli’s inequality (used with n := k^n and h:= -p^n(x)\geq -1) shows q_n(x) \geq 1-k^np^n(x)\geq 1-\left(\frac{k\delta}2\right)^n. Therefore, \lim_{n\to\infty}q_n(x) = 1, and this limit is uniform in x\in V.

Let x\in K\setminus U. Then, Kp(x)\geq k\delta> 1 and Bernoulli’s inequality shows

\begin{aligned}    q_n(x) &= \frac{1}{k^np^n(x)}\bigl(1-p^n(x)\bigr)^{k^n} k^np^n(x) \\    &\leq \frac1{k^np^n(x)}\bigl(1-p^n(x)\bigr)^{k^n}\bigl(1+k^np^n(x)\bigr)\\    &\leq \frac{1}{k^np^n(x)}\bigl(1-p^n(x)\bigr)^{k^n}\bigl(1+p^n(x)\bigr)^{k^n} \\    &= \frac{1}{k^np^n(x)}\bigl(1-p^{2n}(x)\bigr)^{k^n} \\    &\leq \frac{1}{k^n\delta^n},    \end{aligned}

and this converges to 0 uniformly.

In particular, for any \varepsilon>0, we can find some n large enough such that q_n\leq\varepsilon on K\setminus U and q_n\geq 1-\varepsilon on V. Now, to conclude the proof, pick V(x_0) = V (remember that V was a neighborhood of x_0 by construction) and g(x) = 1-q_n(x). q.e.d.

 

This one felt quite chore-ish. It was a little messy and technical, but note the necessary tools: no operator theory, no convergence of series, just stuff from high school calculus and the very basics of the theory of compact sets. That would seem like a value in and of itself.

 

The setwise nearly indicator Lemma: Let A,B\subset K disjoint and closed. Let \varepsilon\in(0,1). Then, there is some g\in\mathcal{A} with

  • 0\leq g(x)\leq 1 for x\in K,
  • g(x)<\varepsilon for x\in A,
  • g(x)>1-\varepsilon for x\in B.

 

Obviously, this lemma improves on the first one, we have nearly an indicator function on B which nearly vanishes on A. In this respect, it resembles Urysohn’s Lemma from elementary topology, but Urysohn isn’t restricted to the algebra \mathcal A and can hence do even a little better. Never mind.

 

Proof: This is going to be a lot quicker: you know the drill. Let \varepsilon>0. Let U=K\setminus B and let y\in A. We can choose neighborhoods V(y) as in the previous Lemma. Of course, since A is compact, finitely many of those V(y) suffice to cover A, let’s say A\subset\bigcup_{k=1}^m V(y_k). For each of the V(y_k), there is a function g_k\in\mathcal{A} such that 0\leq g_k\leq 1, g_k(x)>1-\frac\varepsilon m for all x\in K\setminus U = B and g_k(x)<\frac\varepsilon m for all x\in V(y_k).

The function g(x) = \prod_{k=1}^mg_k(x) will now do the trick: Of course, we have 0\leq g\leq1.

Let x\in A\subset\bigcup_{k=1}^mV(y_k), in particular x\in V(y_{n_0}), say. Then

\displaystyle g(x) = \prod_{k=1}^mg_k(x)= g_{n_0}(x)\prod_{k=1, k\neq n_0}^mg_k(x) \leq \frac\varepsilon m\prod_{k=1, k\neq n_0}^m1 \leq\varepsilon.

Now, let x\in B. Then Bernoulli tells us that g(x)=\prod_{k=1}^mg_k(x)\geq\left(1-\frac\varepsilon m\right)^m \geq 1-\varepsilon. Note that we need \varepsilon\leq m here (which is not a very hard restriction anyway – nevertheless, we included some condition like that in the assertion).

That’s all we needed to show. q.e.d.

 

Note that the notion of compactness has been used rather slightly differently from the very first proof we gave for Stone-Weierstrass. Then, we had the Approximation Lemma which used continuity and direct neighborhoods where we could approximate f adequately. Here, we have decomposed the underlying set K by using the level sets of f to approximate it. Then, compactness tells us every time that we can do with finitely many sets – but the nature of those sets is ever so slightly different. The math, however, is not.

Now, let’s harvest the ideas:

 

Proof of the Stone-Weierstrass, real version:

Let f\in\mathcal{C}_{\mathbb{R}}(K) and let \varepsilon>0. Without loss of generality, we may assume that f\geq 0, since we might as well replace it by f+\left\|f\right\|_\infty. Besides, \varepsilon <\frac13 is not much of a restriction.

Let n\in\mathbb{N} be such that (n-1)\varepsilon>\left\|f\right\|_\infty. Let, for j=0,\ldots,n, A_j and B_j be the sets

\displaystyle A_j = \left\{x\in K : f(x)\leq \left(j-\frac13\right)\varepsilon\right\}

\displaystyle B_j = \left\{x\in K : f(x)\geq \left(j+\frac13\right)\varepsilon\right\}.

Obviously, those sets are disjoint for fixed j, and the A_j are increasing to A_n = K, while the B_j are decreasing to B_n = \emptyset. The setwise nearly indicator Lemma then says that there is some g_j\in\mathcal{A} with 0\leq g_j\leq 1, g_j(x) < \frac\varepsilon n for x\in A_j and g_j(x) > 1-\frac{\varepsilon}{n} for x\in B_j.

That’s all well and good – but how will this help us to prove Stone-Weierstrass? Imagine the space K decomposed like an onion into the A_j. The functions g_j will have very small values on the A_j, but large values outside, in the B_j. We will combine these small and large values properly, to have similar values as the function f in every ring of this onion. Note that the values of f are closely attached to the g_j – by the very definition of A_j and B_j.

Now consider the function g(x) = \varepsilon\sum_{k=0}^n g_j(x), which of course is in \mathcal{A}. Let x\in K, then there is some j\geq 1 with x\in A_j\setminus A_{j-1}, and so

\displaystyle \left(j-\frac43\right)\varepsilon < f(x) \leq \left(j-\frac13\right)\varepsilon,

and of course g_i(x) < \frac\varepsilon n for all i\geq j. Besides, our fixed x\in K is in B_i for i\leq j-2. Therefore, g_i(x)\geq 1-\frac\varepsilon n. All of this tells us (remember that \varepsilon<\frac13)

\begin{aligned}    g(x) = \varepsilon\sum_{k=0}^{j-1}g_k(x) + \varepsilon\sum_{k=j}^ng_k(x)&\leq j\varepsilon + \varepsilon(n-j+1)\frac{\varepsilon}{n} \\    &\leq j\varepsilon + \varepsilon^2 < j\varepsilon + \frac\varepsilon3 = \left(j+\frac13\right)\varepsilon.    \end{aligned}

Now, if j\geq2,

\begin{aligned}    g(x) \geq \varepsilon\sum_{k=0}^{j-2}g_k(x) &\geq (j-1)\varepsilon \left(1-\frac\varepsilon n\right) \\    &= (j-1)\varepsilon - \left(\frac{j-1}{n}\right)\varepsilon^2 > (j-1)\varepsilon - \varepsilon^2 > \left(j-\frac43\right)\varepsilon.    \end{aligned}

Note that, for j = 1, g(x) > \bigl(j-\frac43\bigr)\varepsilon appears to be correct.

To conclude, we wish to show something about \left|f(x)-g(x)\right|. Let us consider the case f(x)\geq g(x) first:

\displaystyle\left|f(x)-g(x)\right| = f(x)-g(x) \leq \left(j-\frac13\right)\varepsilon - \left(j-\frac43\right)\varepsilon = \varepsilon.

The other case is f(x) < g(x):

\displaystyle\left|f(x)-g(x)\right| = g(x) - f(x) \leq \left(j+\frac13\right)\varepsilon - \left((j-2)+\frac13\right)\varepsilon = 2\varepsilon.

This shows, that in every case f and g differ by at most 2\varepsilon, uniformly on K. q.e.d.

 

To conclude, let us mention some little question that is related to the classical Weierstrass Aproximation Theorem. If we have all monomials x^n at our disposal, we can uniformly approximate any continuous function; but what if some monomials are missing? With how few of them can the Approimation Theorem still hold? The surprisingly neat answer is given in Müntz’ Theorem (sometimes called the Müntz-Szász-Theorem for the sake of more signs on top of the letters):

 

Theorem (Müntz): Let \lambda_k be a strictly increasing sequence of natural numbers with \lambda_1>0. Then, we have

\displaystyle \overline{\mathrm{span}\{1, x^{\lambda_1},x^{\lambda_2},\ldots,\}} = \mathcal{C}_{\mathbb{R}}([0,1]) \qquad\iff\qquad \sum_{k=1}^\infty\lambda_k^{-1}=\infty.

 

The span is meant as the linear space generated by the given monomials, i.e. the polynomials that can be constructed using only those monomials.  Hence, the Weierstrass Theorem can even hold if you disallow to use infinitely many monomials. However, you will always need the monomial x^0=1, since otherwise your approximating functions could only satisfy f(0)=0.

The proof of that theorem is not particularly easy, especially the part where you show that the series must diverge if you can approximate any function. For the other direction there has been a very elementary proof that we shall give here because of its inherent beauty:

Proof of Müntz’ Theorem, \Leftarrow-part:

Let \sum_{k=1}^\infty\lambda_k^{-1}=\infty.

Let q\in\mathbb{N} be such that q\notin\{\lambda_1,\lambda_2,\ldots\}. Without loss of generality, we assume that q is smaller than any of the \lambda_k: just take away those \lambda_k which are smaller than q and re-enumerate the \lambda_k starting from k=1; the condition \sum_{k=1}^\infty\lambda_k^{-1}=\infty will then still hold.

We will show that x^q\in\overline{\mathrm{span}\{x^{\lambda_1},x^{\lambda_2},\ldots\}}. This will imply that all monomials x^n are in the closure of that span, and if we have that, Weierstrass tells us that we can approximate any continuous function.

Note that we don’t need the constant function 1 to approximate x^q, since for our purposes the relevant functions all have f(0)=0.

Now consider the function

\displaystyle Q_n(x) := x^q-\sum_{k=1}^na_{k,n}x^{\lambda_k}

with certain a_{k,n}\in\mathbb{R}. Then first of all, Q_0(x) = x^q, and if we choose certain a_{k,n}, we find the recursion

\displaystyle Q_n(x) = (\lambda_n-q)x^{\lambda_n}\int_x^1Q_{n-1}(t)t^{-1-\lambda_n}dt.

Let’s prove this by induction:

\begin{aligned}    (\lambda_n-q)x^{\lambda_n}\int_x^1Q_{n-1}(t)t^{-1-\lambda_n}dt &= (\lambda_n-q)x^{\lambda_n}\int_x^1\left(t^q-\sum_{k=1}^{n-1}a_{k,n-1}t^k\right)t^{-1-\lambda_n}dt\\    &=(\lambda_n-q)x^{\lambda_n}\int_x^1 t^{q-1-\lambda_n}-\sum_{k=1}^{n-1}a_{k,n-1}t^{k-1-\lambda_n}dt\\    &=(\lambda_n-q)x^{\lambda_n}\left(\left[\frac1{q-1-\lambda_n+1}t^{q-1-\lambda_n+1}\right]_x^1\right. \\    &\hphantom{=}\left.- \sum_{k=1}^{n-1}a_{k,n-1}\left[\frac1{k-1-\lambda_n+1}t^{k-1-\lambda_n+1}\right]_x^1\right)\\    &=(\lambda_n-q)x^{\lambda_n}\left(\frac1{q-\lambda_n}(1-x^{q-\lambda_n}) \vphantom{\sum_{k=1}^n}\right.\\    &\hphantom{=}-\left.\sum_{k=1}^{n-1}a_{k,n-1}\frac1{k-\lambda_n}(1-x^{k-\lambda_n})\right)\\    &=x^{\lambda_n}(x^{q-\lambda_n}-1)-(\lambda_n-q)\sum_{k=1}^{n-1}\frac{a_{k,n-1}}{k-\lambda_n}(1-x^{k-\lambda_n})\\    &=x^q-x^{\lambda_n}-(\lambda_n-q)\sum_{k=1}^{n-1}\frac{a_{k,n-1}}{k-\lambda_n}(1-x^{k-\lambda_n})\\    &=: x^q - \sum_{k=1}^na_{k,n}x^{\lambda_k}.    \end{aligned}

Obviously, one could try to define the a_{k,n} appropriately or to give some closed formula expression; but it doesn’t seem to be useful for the problem at hand.

Now, we have \left\|Q_0\right\|_\infty = 1, and

\begin{aligned}    \left\|Q_n\right\|_\infty &= \left\|(\lambda_n-q)x^{\lambda_n}\int_x^1Q_{n-1}(t)t^{-1-\lambda_n}dt\right\|_\infty \\    &\leq \left|\lambda_n-q\right|\left\|x^{\lambda_n}\right\|_\infty\left\|Q_{n-1}\right\|_\infty\left\|\int_x^1t^{-1-\lambda_n}dt\right\|_\infty\\    &=\left|\lambda_n-q\right|\left\|Q_{n-1}\right\|_\infty\left\|-\frac1{\lambda_n}(1-x^{-\lambda_n})\right\|_\infty\\    &=\left|\lambda_n-q\right|\left\|Q_{n-1}\right\|_\infty\frac1{\left|\lambda_n\right|}\left\|1-x^{-\lambda_n}\right\|_\infty \\    &=\left|1-\frac q{\lambda_n}\right|\left\|Q_{n-1}\right\|_\infty.    \end{aligned}

This shows

\displaystyle \left\|Q_n\right\|_\infty\leq\prod_{k=1}^n\left|1-\frac q{\lambda_k}\right|,

and hence

\displaystyle \lim_{n\to\infty}\left\|Q_n\right\|_\infty \leq \prod_{k=1}^\infty\left|1-\frac q{\lambda_k}\right|.

But the infinite product will be 0, because 0<\frac q{\lambda_k}<1 by assumption, and because of the basic inequality 1-x\leq e^{-x} for any x>0, we have

\displaystyle \prod_{k=1}^\infty\left|1-\frac q{\lambda_k}\right| \leq \prod_{k=1}^\infty\exp\left(-\frac q{\lambda_k}\right) = \exp\left(-\sum_{k=1}^\infty \frac q{\lambda_k}\right) = 0.

Hence, Q_n vanishes uniformly as n\to\infty, and so x^q is being approximated uniformly by some polynomials using only the allowed monomials x^{\lambda_k}. This concludes the proof. q.e.d.

 

At first, I was excited because I thought one could reverse this idea to show the other direction of Müntz’ Theorem as well. However, it doesn’t work because of the second-to-last inequality given above: If the product is finite, that doesn’t say anything about how well (or how badly) x^q is approximated. The proof of the reverse direction is harder and I am only aware of deeper ideas from Fourier Analysis (that’s historically the first proof) or using the Hahn-Banach Theorem (as in Rudin’s book).

But the elementary idea given here is actually very neat, even though it doesn’t, to the best my knowledge, solve any real-world problems. A very pretty academic finger-exercise. And this should be a good place to stop.

On approximating functions – Part One

Let us have a closer look at the Stone-Weierstrass-Theorem. It is a generalization of Weierstrass’ classical approximation theorem on polynomials which stands at the base of approximation theory. Stone took Weierstrass’ classical statement and rendered it more abstractly, so it can be used for a number of things in proofs about continuous functions. In practice, I have rarely seen it used in its pure form – maybe because it lacks a direct statement about the approximation error, and because you can base more powerful statements upon it.

Let us state the Stone-Weierstrass-Theorem in its pure form first, and its two most useful corollaries next, before we turn to different methods to prove them and to a deeper rendition of their significance.

We shall denote by \mathcal{C}(K) the set of continuous complex-valued functions on the set K, and by \mathcal{C}_{\mathbb{R}}(K) the set of real-valued continuous functions on the set K.

Remember that an algebra \mathcal{A} over a field f is a set which satisfies the following conditions:

  • for any f\in \mathcal{A} and any \alpha\in F, \alpha f\in \mathcal{A}, and
  • for any f, g\in \mathcal{A}, f+g\in \mathcal{A} and f\cdot g\in \mathcal{A}.

Also, we shall call a family of functions on the set x separating, if for any x, y\in X, with x\neq y, there is some f\in \mathcal{A} such that f(x)\neq f(y).

 

Stone-Weierstrass-Theorem (real version): Let K\subset\mathbb{R} be a compact set and let \mathcal{A}\subset\mathcal{C}_{\mathbb{R}}(K) a separating algebra with the constant function 1\in\mathcal{A}.

Then, \mathcal{A} is dense in \mathcal{C}_{\mathbb{R}}(K) in the topology of uniform convergence.

Equivalently, \overline{\mathcal{A}} = \mathcal{C}_{\mathbb{R}}(K). Equivalently, for every \varepsilon>0 and every f\in\mathcal{C}_{\mathbb{R}}(K), there is some g\in\mathcal{A} with \left\|f-g\right\|_\infty<\varepsilon.

 

Stone-Weierstrass-Theorem (complex version): Let K\subset\mathbb{C} be a compact set and let \mathcal{A}\subset\mathcal{C}(K) a separating algebra with the constant function 1\in\mathcal{A} and for any g\in\mathcal{A} let \overline g\in\mathcal{A}.

Then, \mathcal{A} is dense in \mathcal{C}(K) in the topology of uniform convergence.

 

Corollary: Weierstrass’ classical Theorem: For every continuous real-valued function f on an interval [a,b] there is a sequence of polynomials p_n which converges uniformly to f on [a,b].

 

Corollary: On trigonometric polynomials: For every continuous 2\pi-periodic real-valued function f there is a sequence of trigonometric polynomials \sum_{j=0}^n \alpha_j\cos(jx)+\beta_j\sin(jx) which converges uniformly to f.

 

In this second corollary, the objects don’t look like polynomials at all, but in the proof we will give a hint on why the series can be called that. It has to do with the representation (e^{iz})^j.

The conditions in the theorem are quite natural. You need some sort of richness in your set of approximating functions, this is achieved by the separation: if there were two points x\neq y and you couldn’t find a separating f\in\mathcal A, then how could you properly approximate any given continuous function which takes different values at x and y? You could never, well, separate the values with what the algebra provides you. Hence, this is obviously a necessary condition. As Hewitt/Stromberg put out nicely, this obviously necessary condition is also sufficient in the real case – that makes some of the beauty and elegance of the theorem.

In the complex version, there is some sort of extra-richness that we must provide for the theorem to hold. We obviously can’t do without that: Let us think of the family of polynomials on the unit disk in \mathbb{C}. They are an algebra, alright, they contain the continuous functions, and they are separating (even the identity polynomial f(x)=x can do that). But they are holomorphic as well, and the uniform limit on a compact set of holomorphic functions will be holomorphic again (this follows very naturally from Cauchy’s Integral Formula) – but certainly, there are continuous functions on the unit disk which are not holomorphic, so there needs to be an extra assumption in the Stone-Weierstrass-Theorem. And it follows very smoothly, that the lacking assumption can already be met by demanding that complex conjugates be contained in \mathcal{A} as well. Let us show this, why not:

Proof of the complex version basing on the real version

The real and imaginary part of any g\in\mathcal{A} can be represented as

\displaystyle\mathrm{Re}g=\frac{g+\overline g}{2}

and

\displaystyle\mathrm{Im}g=\frac{g-\overline g}{2i},

which means that \mathrm{Re}g, \mathrm{Im}g \in \mathcal{A}, by assumption. Moreover, if x, y\in K and x\neq y, then by separation there is some g\in\mathcal{A} which gives us g(x)\neq g(y), and that means at least \mathrm{Re}g(x)\neq\mathrm{Re}g(y) or \mathrm{Im}g(x)\neq\mathrm{Im}g(y). So the family of the real parts of the functions in \mathcal{A} is separating and thus meets the conditions of the real version of the theorem, and so does the family of imaginary parts. The theorem yields that either one of these families is dense in \mathcal{C}_{\mathbb{R}}(K).

Hence, for any f\in\mathcal{C}(K) and for any \varepsilon>0, there are some g_0, g_1\in\mathcal{A} such that

\displaystyle\left\|\mathrm{Re}f - g_0\right\|_\infty < \frac\varepsilon2

and

\displaystyle\left\|\mathrm{Im}f - g_1\right\|_\infty < \frac\varepsilon2.

Thus,

\begin{aligned}\displaystyle\left\|f - (g_0+ig_1)\right\|_\infty &= \left\|\mathrm{Re}f - g_0 + i(\mathrm{Im}f - g_1)\right\|_\infty \\    &\leq \left\|\mathrm{Re}f - g_0\right\|_\infty + \left\|\mathrm{Im}f - g_1\right\|_\infty \\ &< \varepsilon.    \end{aligned}

Since g_0+ig_1\in\mathcal{A}, the complex version is proved. q.e.d.

 

While we’re at it, let’s give the proofs for the corollaries now.

Proof of Weierstrass’ classical Theorem

That’s rather simple: the set of polynomials is obviously an algebra, and it’s separating because you can find the identity polynomial g(x)=x in it. The conditions of the theorem are met – the corollary follows. q.e.d.

 

Proof of the statement on trigonometric polynomials

This follows from the complex version. Let us look at some 2\pi-periodic function f:\mathbb{R}\to\mathbb{R}. The domain of f is not compact, but all that matters is the compact interval [0,2\pi].

Now consider the set of functions \sum_{j=-n}^nc_jz^j with z\in [0,2\pi], which obviously form a separating algebra. The conditions of the real version are thus met. Those functions can approximate the function f on [0,2\pi] uniformly. We are interested in trigonometric polynomials, however, and so far, for any \varepsilon>0, we’ve only found some appropriate n and c_j with

\displaystyle\left\|f(z) - \sum_{j=-n}^nc_jz^j\right\|_\infty < \varepsilon.

Of course, that’s not news. We already knew that c_j=0 for j<0 from the real version. But we can now consider the bijection z\mapsto e^{iz}, which transforms our setting to the compact complex set S^1. Let us take g(e^{iz}):=f(z) which maps g:S^1\to\mathbb{R}. We try to apply the complex version to this function g and this time we use the approximating functions \sum_{j=-n}^nc_je^{ijz} (in order to show the resemblance with the approximating functions for f, they can also be written like \sum_{j=-n}^nc_k(e^{iz})^j). They are still an algebra containing constant functions, they are separating because e^{iz} is a bijection on S^1, and now we need to ensure the extra assumption on complex conjugation – but that’s not a problem because of \overline{e^{ijz}} = e^{-ijz}. The complex version thus holds and we can find some appropriate n and c_j with

\displaystyle\left\|g(e^{iz}) - \sum_{j=-n}^nc_je^{ijz}\right\|_\infty < \varepsilon.

Now, we can represent c_j = a_j+ib_j with a_j,b_j\in\mathbb{R}, and e^{ijz} = \cos(jz)+i\sin(jz), which yields

\displaystyle\left\|g(e^{iz}) - \sum_{j=-n}^n\bigl(a_j\cos(jz)-b_j\sin(jz)\bigr) - i\sum_{j=-n}^n\bigl(a_j\sin(jz)+b_j\cos(jz)\bigr)\right\|_\infty < \varepsilon,

but we had chosen g(e^{iz})=f(z) and this is real by assumption (no imaginary part). So:

\displaystyle\left\|f(z) - \sum_{j=-n}^n\bigl(a_j\cos(jz)-b_j\sin(jz)\bigr)\right\|_\infty < \varepsilon.

Finally, we employ the relations \cos(-x) = \cos(x) and \sin(-x) = -\sin(x) to find

\displaystyle\left\|f(z) - a_0 - \sum_{j=1}^n \bigl((a_j+a_{-j}) \cos(jz) - (b_j-b_{-j})\sin(jz)\bigr)\right\|_\infty < \varepsilon.

By appropriate definitions of the coefficients \alpha_j and \beta_j, we have proved the corollary:

\displaystyle\left\|f(z) - \sum_{j=0}^n \bigl(\alpha_j\cos(jz)+\beta_j\sin(jz)\bigr)\right\|_\infty < \varepsilon. q.e.d.

 

In a way, that was surprisingly non-smooth for the proof of a corollary. But the Fourier series people are thankful for a proof like that. We shall see a sketch of another proof of that later, but this one here does without explicit computations of integrals and fuzzing around with sine-cosine-identities.

Up until now, everything has hinged on the real version of the Stone-Weierstrass-Theorem. In order to give a self-contained proof of this, we are going to need several lemmas and a little work. We take the proof from Königsberger’s book on Calculus (it’s very similar to the proof given in Heuser’s book). Let us start with the

 

The square root Lemma: Define the generalized binomial coefficient via \binom{\alpha}{n}:=\frac{\alpha(\alpha-1)\cdots(\alpha-n+1)}{n!}. Then we have the generalized binomial formula (1+x)^\alpha=\sum_{k=0}^\infty\binom{\alpha}{k}x^k for any \left|x\right|<1.

For \alpha>0, uniform and absolute convergence even holds for any \left|x\right|\leq1

In particular, for \alpha=\frac12 and any x\in[-1,1],

\displaystyle\sqrt{1+x} = \sum_{k=0}^\infty\binom{\frac12}{k}x^k.

 

Proof: One part is easy, if we apply the approriate machinery. Since (1+x)^\alpha is holomorphic with a possible singularity at x=-1, we can find its absolutely convergent power series in the point x=0 with a radius of convergence at least 1:

\displaystyle (1+x)^{\alpha} = \sum_{k=0}^\infty \left.\frac{d^k}{dx^k}(1+x)^{\alpha}\right|_{x=0} \frac1{k!}x^k,\qquad\left|x\right|<1.

But the derivatives in x=0 are easily found to be \frac{d^k}{dx^k}(1+x)^{\alpha} = \prod_{j=0}^{k-1}(\alpha-j). This proves the series representation.

This can also be shown in an elementary way, as seen in Hewitt/Stromberg.

We still need to prove something about convergence on the boundary of the circle for \alpha>0. Let us consider the series \sum_{k=0}^\infty\left|\binom{\alpha}k\right|. We have, for K\geq\alpha:

\displaystyle\frac{\left|\binom{\alpha}{k+1}\right|}{\left|\binom{\alpha}k\right|} = \frac{\left|\alpha-k\right|}{k+1} = \frac{k-\alpha}{k+1}.

In particular, for those large K:

\displaystyle k\left|\binom{\alpha}k\right|-(k+1)\left|\binom{\alpha}{k+1}\right| = \alpha \left|\binom{\alpha}k\right| > 0

The sequence K\left|\binom{\alpha}k\right| is thus eventually decreasing in K, it’s bounded and therefore has a limit \gamma\geq0. Now look at the telescoping series

\displaystyle \sum_{k=0}^\infty \left(k\left|\binom{\alpha}k\right| - (k+1)\left|\binom{\alpha}{k+1}\right|\right),

whose p-th partial sum is -(p+1)\left|\binom{\alpha}{p+1}\right|, which converges to $-\gamma$ (however, only the first few terms of it are negative, which gives the negative limit). So, the telescoping series converges, and we find

\begin{aligned}\displaystyle \sum_{k=0}^\infty\left|\binom{\alpha}k\right| &= \sum_{k=0}^{\left\lfloor\alpha\right\rfloor}\left|\binom{\alpha}k\right| + \sum_{k=\left\lfloor\alpha\right\rfloor+1}^\infty\left|\binom{\alpha}k\right| \\    &= C_1 + \frac1\alpha\sum_{k=\left\lfloor\alpha\right\rfloor+1}^\infty\left(k\left|\binom{\alpha}k\right| - (k+1)\left|\binom{\alpha}{k+1}\right|\right) \\    &= C_1 + C_2.    \end{aligned}

This shows that the series \sum_{k=0}^\infty\left|\binom{\alpha}k\right| is convergent. For in x\in[-1,1], we are actually interested in

\displaystyle \left|\sum_{k=0}^\infty\binom{\alpha}{k}x^k\right|\leq \sum_{k=0}^\infty\left|\binom{\alpha}{k}\right|,

which is convergent. The series thus defines a uniformly and absolutely convergent function on the compact interval [-1,1].

The fact that (1+x)^\alpha = \sum_{k=0}^\infty\binom{\alpha}{k}x^k even for x=-1 and x=1 follows by continuity on [-1,1] and by equality of both sides on (-1,1). q.e.d.

 

The Closure Lemma: Let \mathcal A be an algebra of continuous functions on a compact set and \overline{\mathcal A} its closure in the topology of uniform convergence. Then, for any f,g\in\overline{\mathcal A}:

f+g, fg, \left|f\right|, \max(f,g) and \min(f,g) are in \overline{\mathcal A}.

 

Proof: Let \varepsilon>0. There are p,q\in\mathcal A with \left\|f-p\right\|_\infty<\varepsilon and \left\|g-q\right\|_\infty<\varepsilon. Therefore,

\displaystyle\left\|f+g-(p+q)\right\|_\infty \leq \left\|f-p\right\|_\infty + \left\|g-q\right\|_\infty < 2\varepsilon,

and

\begin{aligned}    \displaystyle\left\|fg-pq\right\|_\infty &= \left\|fg-fq+fq-pq\right\|_\infty \\    &\leq \left\|f-p\right\|_\infty\left\|q\right\|_\infty + \left\|f\right\|_\infty\left\|g-q\right\|_\infty \\    &\leq \left\|f-p\right\|_\infty\bigl(\left\|g\right\|_\infty+\left\|g-q\right\|_\infty\bigr) + \left\|f\right\|_\infty\left\|g-q\right\|_\infty\\    &\leq \varepsilon\bigl(\left\|g\right\|_\infty + \varepsilon + \left\|f\right\|_\infty\bigr) < C\varepsilon.    \end{aligned}

The right-hand side can be made arbitrarily small, which proves the first two statements.

To deal with \sqrt{f}, we employ the compactness of the set: f will be bounded. We can therefore consider the function \phi = \frac{f}{\left\|f\right\|_\infty} taking values in [-1,1]. Note that the constant function \left\|f\right\|_\infty^{-1} belongs to \mathcal A, and since f\in\overline{\mathcal A}, \phi\in\overline{\mathcal A}. By the Square root Lemma,

\displaystyle\left|\phi(x)\right| = \sqrt{\left|\phi(x)\right|^2} = \sqrt{1+\bigl(\phi^2(x)-1\bigr)} = \sum_{k=0}^\infty\binom{\frac12}{k}\bigl(\phi^2(x)-1\bigr)^k,

with an absolutely and uniformly convergent series representation. There is some N_0 with

\displaystyle \left\|\left|\phi(x)\right| - \sum_{k=0}^{N_0}\binom{\frac12}{k}\bigl(\phi^2(x)-1\bigr)^k\right\|_\infty < \frac\varepsilon2.

By the arguments given above, together with induction, one sees that the partial sum up to N_0 belongs to \overline{\mathcal A}, if \phi does. Hence, there is some r\in\mathcal A with

\displaystyle\left\|r - \sum_{k=0}^{N_0}\binom{\frac12}{k}\bigl(\phi^2(x)-1\bigr)^k\right\|_\infty < \frac\varepsilon2.

This yields

\begin{aligned}    \displaystyle \bigl\|\left|\phi(x)\right| - r\bigr\|_\infty &\leq \left\|\left|\phi(x)\right| - \sum_{k=0}^{N_0}\binom{\frac12}{k}\bigl(\phi^2(x)-1\bigr)^k\right\|_\infty + \left\|\sum_{k=0}^{N_0}\binom{\frac12}{k}\bigl(\phi^2(x)-1\bigr)^k - r\right\|_\infty \\    &< \frac\varepsilon2+\frac\varepsilon2 = \varepsilon.    \end{aligned}

Thus, \left|\phi\right|\in\overline{\mathcal A}, and therefore \left|f\right|\in\overline{\mathcal A}.

Finally, we apply the equalities

\displaystyle \max(f,g) = \frac12\bigl(f+g\left|f-g\right|\bigr),\qquad \min(f,g) = \frac12\bigl(f+g-\left|f-g\right|\bigr),

to show the final statements. q.e.d.

 

Note that we need \overline{\mathcal{A}} to find these functions, even if f,g\in\mathcal{A}. The algebra \mathcal{A} itself is not rich enough, in general, to contain the functions in the statement of the Closure Lemma. Besides, we have made frequent use of \overline{\overline{\mathcal{A}}}=\overline{\mathcal{A}}, which is at the heart of why the Closure Lemma works.

 

The Approximation Lemma: Let \mathcal A\subset\mathcal{C}_{\mathbb{R}}(K) be a separating algebra and let f\in\mathcal{C}_{\mathbb{R}}(K), x\in K, \varepsilon>0. Then, there is some q_x\in\overline{\mathcal A} with the properties

q_x(x) = f(x) and q_x(y) \leq f(y)+\varepsilon for any y\in K.

 

Proof: First of all, for any z\in K there is some function h_z\in\mathcal A with h_z(x)=a and h_z(z) = b. Take

\displaystyle h_z(y) = (b-a)\frac{g(y)-g(x)}{g(z)-g(x)}+a

with an appropriate g\in\mathcal{A} for instance (separation allows for this to be well-defined: pick g such that g(z)\neq g(x)). In particular, one can choose a=f(x), b=f(z), which makes h_z coincide with f at least in the points x and z.

Since h_z and f are continuous, there is some interval I_z around z, with h_z(y)\leq f(y)+\varepsilon for all y\in I_z. Now, any point z\in K is at least contained in the interval I_z, and by compactness, finitely many of those intervals suffice to cover K. Say, K\subset\bigcup_{i=1}^nI_{z_i}. Then consider

\displaystyle q_x:=\min(h_{z_1},\ldots,h_{z_n}),

which is in \overline{\mathcal A} by the Closure Lemma. Besides, q_x(x) = \min\bigl(h_{z_1}(x),\ldots,h_{z_n}(x)\bigr) = \min\bigl(f(x),\ldots,f(x)\bigr) = f(x) by construction, and for every y\in K, there is some j, such that y\in I_{z_j}, and so q_x(y) = \min\bigl(h_{z_1}(y),\ldots,h_{z_n}(y)\bigr) \leq h_{z_j}(y) \leq f(y)+\varepsilon. q.e.d.

 

Proof of the Stone-Weierstrass-Theorem, real version: For any x\in K, choose some q_x\in\overline{\mathcal A} as in the Approximation Lemma (in particular: q_x(x)=f(x)). Then, by continuity, pick an open interval U_x around x, such that for any of the y\in U_x\cap K,

\displaystyle q_x(y) \geq f(y)-\frac\varepsilon2.

By compactness, finitely many of the U_{x_i} suffice to cover K. Consider

\displaystyle g:=\max(q_{x_1},\ldots,q_{x_n}),

which is in \overline{\mathcal A} by the Closure Lemma. For any z\in K, we find some i with z\in U_{x_i}, and thus

\displaystyle g(z) = \max\bigl(q_{x_1}(z),\ldots,q_{x_n}(z)\bigr) \geq q_{x_i}(z) \geq f(z)-\frac\varepsilon2.

By the Approximation Lemma, however,

\displaystyle g(z) = \max\bigl(q_{x_1}(z),\ldots,q_{x_n}(z)\bigr) \leq \max\bigl(f(z)+\varepsilon,\ldots,f(z)+\varepsilon\bigr) = f(z)+\varepsilon.

Now, since g\in\overline{\mathcal A}, we can pick some p\in\mathcal A with \left\|g-p\right\|_\infty<\varepsilon. Then, we have found

\displaystyle\left\|f-p\right\|_\infty\leq\left\|f-g\right\|_\infty+\left\|g-p\right\|_\infty < \varepsilon+\varepsilon = 2\varepsilon.

The Stone-Weierstrass-Theorem is proved. q.e.d.

 

A slightly different proof is given in Hewitt/Stromberg. The idea is about the same, but the compactness argument works on the image of f, w.l.o.g. the interval [0,1]. Here, we have used the compactness of the definition space K. Of course, Hewitt/Stromberg can’t do without the compactness of K, since they need that f takes its maximum and minimum value. The downside of their approach is that they need to iterate their argument, so their approximating function is actually a series of those max-min-functions that we have used. But there are not too many differences in the proofs, so let’s just stick to that. I have actually found the proof given above a little slicker; but I have come to admire the book by Hewitt/Stromberg for its clarity and style. There are several nice things hidden inside there to come back to.

For now, we’ll stop. But there’s some more backstory about the Stone-Weierstrass-Theorem to come soon.

Math is important

There have been many instances when people told me about how much they disliked math at school – especially when I speak highly of my studies and that I do abstract math frequently even today. On top of that, I have talked to many students (either B.Sc. students or teachers-to-be) who speak similarly about math at university-level. And finally, I know several math teachers who chime in to that tune as well. Among the most usual comments that I get are:

  • math at school was so boring and dry, it doesn’t have anything to do with what happens in real life
  • it would be sufficient if I had just learned the basic computations (addition, subtraction, multiplication and division) and to deal with percentages – that’s all I’ll ever need
  • math at school was so hard, I never understood why I should have come up with these particular tricks
  • the exercises at the university were so hard, I never managed any one of them on my own
  • it’s no use to learn so many hard topics in math that I will never use in my later work (and, oh boy, I had to go to so freaking weird lectures)

Here’s my random rant about why none of these comments is valid – or why any of these comments would apply just the same to any other high school subject. I just had to collect those arguments once in a single place, to get them off my heart. And for further reference. None of these arguments is originally mine, which might just show how deep this anti-math feeling is rooted.

Before I start, let me make something clear: I do not intend to say that the whole universe of mathematics is interesting in and of itself and that everybody should make themselves acquainted with that just for the sake of it. There is a lot of mathematics in which I have lost my interest as time went by (or in which I never had any interest in the first place). On top of that, most of what I have to say applies accordingly to other sciences or topics: I myself would never have become much of a biologist or theatre critic. In what follows, I will rather talk about things that should be standard knowledge to anybody who consider themselves an educated person: things that are part of what mankind has achieved, be it the Mona Lisa, be it Romeo and Juliet or be it Newton’s Law of Gravitation. In some way, many people tend to say that the first two are “must-knows” but the last one is not. You can live your life perfectly at ease if you’re unaware of Newton’s Law of Gravitation, but if you don’t recognize the Mona Lisa or have never seen it, some would think you’ve been living with eyes closed – how’s that? A painting may be easier to approach than a theorem, but both require some sort of education to properly grasp them and both will be (aesthetically) pleasing to the educated beholder. And there also is really deep beauty in maths, there is elegance, there are unexpected connections. But it is hard work, no doubt about that.

On top of that, there are many people who work in some shade of a scientific profession. That not only applies to scientists but also to engineers, programmers, actuaries, and – yes – especially (math) teachers. All these people must possess a certain way of thinking to do their job properly: an analytic, logical train of thought which allows them to cut through a problem and to make their reasoning clear to other people (it hardly matters whether you have to explain your work to your boss or to a bunch of children). Besides, these people have learnt to actually tackle a problem and not run away if they don’t understand it at first sight. In practical professional uses these trains of thought will be what people have learnt – this is what employers look for in scientists: they have learnt to think and they can’t be scared away by some hard and unsolved problem. This way of thinking is being taught by considering mathematical problems, but the actual skill is generally not mathematical. If the way of logical thinking could as well be taught by applying for a truck driver’s licence, then employers would also look for truck drivers, why not.

So, this is one of my first points that I’d like to make: in learning or studying math, you’re supposed to learn to think. The fact that you learn math is just incidental, if you will. You’re supposed to be suspicious about what people just tell you without proof – you’re supposed to think for yourself and to convince you of the truth for yourself. In social studies, in theology, in psychology, the teacher might shoot down any of your questions with a phrase like “I’m right because that’s the general opinion and you’re not yet aware of it”, but in mathematics and in the sciences any student is able to convince the teacher of mistakes that he’s made. I am aware that I’m over-simplifying here, since there is no such thing as an absolute truth in psychology and social sciences (theology is somewhat tricky in this respect) and there is a living culture of discussions – those professions will teach you a different, albeit similarly important way of thinking. Discussions are equally important in math, but in the end there is only one answer (apart from very special cases that usually aren’t taught) and the students are supposed to learn arguments towards that answer. In philosophy, no-one needs to be convinced by arguments since no-one is wrong; maybe different or on shaky ground, but not wrong. My point is not to diminish the non-mathematical sciences, my exact point is that you can learn argumentation and discussion on topics that actually have universal truth hidden inside them. This is how you can uncover false arguments, how to fix arguments gone wrong, how to think twice at crucial points – and how to wonder about unconvincing statements other people may make (and how to question these statements).

Another issue made about math is that it is considered hard. Well, that’s right. Math is hard and exercises are hard. But if your mind is to be trained to a purely logical way of thinking, then how else would you do the training? If you want to run a marathon, you also need to do your training (and yes, of course, running a marathon is hard as well). On top of that: problems in work are harder than the math exercises. Applications of mathematics are even harder than abstract math, since the abstraction is gone. There are no proper circles in the real world with a circumference of 2\pi r. In applications you always have to account for measurement errors, lack of data quality – and you have to stop dealing with those inaccuracies at some point because you couldn’t possibly get all of them right (but at least you should give an estimate then about how big your error is). Your math exercises are supposed to give indications about what can and what cannot be done, and how you can deal with that. About using math successfully, to discuss and to convince others by your logic. And without training, you can’t do anything properly. Learning is when you do things that you can’t do yet.

Now, some people will say they don’t have to deal with circles or anything similar in their everyday life, so this example doesn’t apply. Maybe so. But then they have to deal with something else and time and again something changes. People are not robots who receive their programming once and then do the same thing day after day – and every time something changes, you will have to adapt and to think about that. It might be useful to be able to think about that properly: what am I supposed to do now, how do I do that, how can I optimize that (i.e. have less trouble or effort with it). This doesn’t need to be mathematical thinking – it’s just that math and exercises may have taught you how to deal efficiently with “new things that you don’t yet understand”.

The other point of view to this is: math is universal because it’s abstract. Some things and techniques have been invented for a special purpose, but when you take the special situation away from it, the technique can be used more generally. I won’t start the long line of examples here, it would probably just distract from my point. Abstraction is hard, but it unveils the underlying principles and once you’ve understood those, many things will just fall into your hands like that. Or, to phrase it differently: you should make things as simple as possible – but not any simpler. Applications are not simple because they may be concrete – in this respect, math is simple because it is not concrete. And to give a quote from one of my high school math teachers: “The task of mathematics is to make life as simple as possible.”

About math exercises, some complain that, sometimes, tricks are required and you can’t come up with the trick in the first place. That’s right. But no-one expects you to invent a mesmerizing trick on your own. You’re supposed to look at the exercise and think about it, try to solve it. If you have spent some time with the exercise, sometimes you will come up with a solution (and sometimes a neater solution uses a dirty trick that you didn’t see). But there will always be exercises that you can’t solve. That’s not a problem. When you are given the solution, you are supposed to think about that, too. You wouldn’t understand it properly, if you hadn’t thought about the question in the first place – and you will understand the trick, so it will be part of your own quiver of tricks from now on. If there is a similar exercise later, you can use the trick since you’ve been taught it now. And, yes, there are many tricks. It takes experience and effort to know which can be used in a given situation and which can’t. This, in turn, enables you to deal with more complex problems that you couldn’t tackle without that quiver of tricks. That would bring us back to the training-argument from above: you can’t run a marathon, if you just walk around your house once a week. If you want to become better in something, this will take effort and time.

Much of what I said about some profession or science as a given example can be replaced by something else and still keep the same meaning. I actually intend to say, that a broad education is essential for anything an educated person does – you need to be really good in your special subject, but you also need to try and see the whole picture (or understand why you can’t see it). It’s rather my own point of view that made this posting start about mathematics. It actually applies to most other educated professions I can think of, and many people shun away from these “alien” educated professions when a discussion crosses those. I myself am often confronted with the mathematical angst of many people, which made me write this text.

Seeing the bigger picture (whatever your personal profession and specialty is), will also help you in another aspect. You always need to know a little more than you actually have to use. Thus, you will know the boundaries of what you’re doing (and the shortcuts that you may have taken). It will be immensely helpful to have an understanding of why the world works as it does, what the principles are – that doesn’t mean you have to properly understand each and every bit of it, your time is finite of course, but the basics should be standard knowledge. That makes it possible to appreciate the really hard stuff (e.g. it’s not an easy thing to encrypt your online banking access data; it’s not easy to write a song that will rank top in the charts; it’s not easy to understand why everyone actually puts any trust in paper-money instead of gold, bread or shells).

Especially teachers must know more than their students, and they must know more than what they actually teach their students. On top of that, teachers need to know how to deal with children and how to keep them awake and attentive in their classes. That’s why this job is so hard, after all.

Similar reasoning applies to languages. I can hardly overstate the need to speak foreign languages. You will understand the culture of a foreign country much better (if only at all) if you can make yourself familiar with its language. And that also applies to dead languages such as Latin or Ancient Greek. German educator Wilhelm von Humboldt once said “Having learnt Greek can be no less useful to a carpenter, than making tables is useful to a professor”. You learn to think analytically by learning the grammar and you learn about another culture. How can that be entirely useless, even though no-one will have a conversation with you in Latin? Of course, your time is finite and you’d rather deal with something else – but apart from very legitimate indulgent things like watching some football game or spending time with your friends: is a soap-opera really the better way to kill your time as opposed to learning a new language, playing chess, doing a painting or thinking about a science problem? What do you want to do with your time?

Finally, one paragraph about the boundaries of math. Mathematics is not a description of the world. Mathematics is a set of assumptions (called axioms) which are taken at their face value because no-one doubts them and you can’t prove anything without having a starting point. From those axioms, things are proven by logical conclusions. The construction is powerful enough that people have drawn useful and sensible conclusions from it to describe the world. But mathematics is not the world. It is a model (albeit quite a good one) and there are boundaries. Dealing with mathematics allows you to find out where those boundaries are and where your model assumptions break down (do they break down in your application?). You mustn’t use math without a sufficiently deep knowledge, and this is why economists, psychologists and many others have to deal with statistics, even though their computers will tell them about the significance of their test statistics. Still, you have to know the limits and deal with them.

That’s why people are supposed to deal with math.

Eigenvalues and how to find them

Recently, I have taken some time to look into numerical eigenvalue algorithms. This is a subject that has interested me for some time now, since it is not regularly taught in basic lectures on numerics, and there are many instances where you actually need to compute eigenvalues. Of course, any reasonable math software will have built-in algorithms that do the trick for you – but time and again, I’d like to know what makes them tick. So how do they work?

In what follows, I refer to chapter 3 of the basic-ish book by Hämmerlin on Numerical Mathematics. It is not very well known, and its notation is somewhat intricate, but at least it started at roughly the same level where I stood. For more advanced ideas, the book by Stoer and Bulirsch might be more applicable, or so I’m told – however, I haven’t had a closer look at it yet.

To fix notation, we look at some real or complex n\times n matrix A = (a_{ij})_{i,j=1,\ldots,n}, and we are interested in its eigenvalues \lambda_i with corresponding eigenvectors v_i, i=1\ldots,n. We denote the n-dimensional identity matrix by I.

As it turns out, these eigenvalue-algorithms are rather intricate beings. There are different ones for slightly different purposes, and if you are interested in eigenvectors as well – that’s almost always the bottleneck. So, unless you have a great package of built-in algorithms (such as Lapack, which is the foundation to what R and Matlab do with their eigen functions), you should really think twice before having a look at eigenvectors.

The most basic algorithm is the Power Method, which will yield the largest eigenvalue and its corresponding eigenvector (large in the sense of absolute value). It is based upon the concept of eigen-decomposition of vectors: if there are n eigenvalues \lambda_i and a base of corresponding eigenvalues x_i, then we have for any initial vector v_0: A^k v_0 = \sum_{i=1}^n\lambda_i^k x_i. Here, the largest eigenvalue, \lambda_1 say, will dominate the others and you can extract its value from the iterates A^k v_0. The iterates will also give you the corresponding eigenvector – that is, if the eigenspace is one-dimensional. If there are other eigenvalues that have the same absolute values as \lambda_1, things get rather messy. Manageable, yes, but it will not be as nice as before (here, Hämmerlin goes into some detail on the convergence issues especially of the eigenvectors). Related to this problem is the concept of deflation, where your n-dimensional space splits into lower dimensional ones to deal with the eigenvalue-problem in lower dimensions only (this can sometimes happen, depending on the exact structure of the matrix). This will improve the feasibility of all the algorithms discussed here, but we won’t go into this much deeper.

This algorithm may be what you need, if you are only interested in the largest eigenvalue. This might happen in physics or when dealing with the spectral norm, but in my own applications in statistics, I usually have to know all of the eigenvalues. Thankfully, there are more possibilities. For instance, there are tailored algorithms that work particularly well on tridiagonal matrices (which may arise in problems on ODEs for instance). Those can be based on a non-explicit computation of the characteristic polynomial – you mustn’t evaluate that polynomial straight-forwardly, since you will get all kinds of rubbish for your answer. The problem is highly instable from that point of view. But a recursive evaluation will do just fine, because the structure of the problem can be broken down to an evaluation of the characteristic polynomial and its derivative at any given point, thus allowing for Newton’s method to take over.

Statisticians will use eigenvalue-algorithms for real symmetric matrices (or suspected covariance matrices, if you will), which have the additional property of yielding n real eigenvalues (remember the principal axis theorem?). It will be interesting to find out, for statisticians at least, if all the eigenvalues are positive. Here, a clever use of orthogonal transformations will be useful, the Givens rotation, which will iteratively tone down the off-diagonal entries of your matrix (the application of this transformation in this application is called the Jacobi Algorithm). In the limit, you can read the eigenvalues off the diagonal. The proof of this fact is not too hard, but we won’t give it here. In practice, the question is when to stop your iterations.

For the Givens rotation, as well as for the Newton method, approximate facts about the eigenvalues would be helpful. Surprisingly, something can be said without much effort: the Gershgorin disks. All eigenvalues \lambda of the matrix A must lie in the union of the disks \left\{z\in\mathbb{C} : \left|z-a_{kk}\right|\leq r_k\right\}, where the radii are r_k=\sum_{i=1, i\neq k}^n \left|a_{ki}\right|.

The proof is remarkably simple: the eigenvalue equation Ax=\lambda x can be written as (\lambda-a_{kk})x_k=\sum_{i=1, i\neq k}^n a_{ki}x_i. One of the indices k must be such that x_k=\left\|x\right\|_\infty, then \left|\lambda-a_{kk}\right| \leq\left|\sum_{i=1, i\neq k}^na_{ki}\frac{x_i}{x_k}\right| \leq\sum_{i=1, i\neq k}^n\left|a_{ki}\right|. q.e.d.

These Gershgorin disks can tell you when your iterative algorithm has successfully split two eigenvalues. It will be rather undecisive with multiple zeros of the characteristic polynomial. But, as I mentioned above, those tricky situations will be troublesome for all of the algorithms.

Another fancy idea is an advanced power method. If you apply it to A^{-1} instead of A, it will give you the smallest eigenvalue. And you might even use it on (A-c I)^{-1} to find the eigenvalue closest to c\in\mathbb{C}. Gershgorin will tell you some proper choices of c to start with. On top of that, you can find the eigenvalues from the power method, as I mentioned above. That’s something that the Jacobi method will not provide – but of course you can use the power method on an eigenvalue approximation that you have gathered from one of the other algorithms. The bad thing is, that you will need to apply the power method about n times for this, and you might not even have the proper choices of c yet, so this will take even more effort. If you’re unlucky, you still have to distinguish multiple eigenvalues – it’s a hard problem with many nasty special cases.

The most famous algorithm, and probably the most widely used one, is the QR algorithm. It unites ideas from the power method and the Givens rotation (when you dig down deeply, that is), it is quite fast and highly stable. But still, this algorithm doesn’t make a tough problem easy. My own implementation in R works just fine on usual matrices, but it struggles with several problems of convergence in special cases and it uses quite some memory. But then again, I didn’t really try properly… The basic idea lies in the QR-decomposition of the matrix A = QR, which can also be used for solving linear equations and whatnot. As the names should suggest, R is an upper triangular matrix, while Q is orthogonal. Then, one will compute the new matrix A_{i+1} := R_iQ_i from A_i = Q_iR_i, and iterate. One can show that all the A_i have the same eigenvalues (they differ by orthogonal transformations only) and that the iterates will converge to some upper triangular matrix from which you can read the eigenvalues.

The proof is somewhat intricate – already in the base case that was proved by Wilkinson in the 1960’s. That is why we will not discuss it here – but most of the books I have consulted on this give the same proof and only in the special case of symmetric real matrices with some additional technical assumptions that you couldn’t check in the first place. The proof hinges on many QR-decompositions of various matrices performed in a row, and on the uniqueness of this decomposition.

It is rather tricky to decide when the algorithm should stop. You cannot expect the elements under the diagonal to vanish, since complex eigenvalues will not show up on the diagonal (how would they, the algorithm is real if you work on real matrices) but form boxes of dimension 2 from which you can still read off the eigenvalues. But then, fluctuations on the first subdiagonal can either come from two conjugate complex eigenvalues or from the algorithm not having properly converged yet. I am not aware of a way to decide what happens during runtime. But, apparently, sophisticated methods have been found to cope with this in the usual implementations.

Principally, the algorithm works in the general case as well, and it works quite fast to give all the eigenvalues at the same time. The decomposition itself will take about n^3 operations in each step of the iterations, which is on the faster side of eigenvalue algorithms (the power method takes n^2 operations for the multiplications, but you’ll need to do the iterations with n different matrices at least). People improve on the convergence by using the QR-decomposition of the matrix A_i - rI, where r is the so-called shift and is often applied such that the lower right element of A_i is set to 0. Depending on how deep you like to delve into this, you can also use more sophisticated shifts than that one. I am afraid, I didn’t get why this shifting should work, but, in practice, on many matrices it does wonders for the speed of convergence.

Finally, note that wikipedia has a list of eigenvalue-algorithms that goes beyond what we have covered here. It is yet another question whether this is an exhaustive list in any sense.

In the end, I have come to a much deeper appreciation of these algorithms. As I said earlier, you can’t do the first thing with the definition of an eigenvalue, you need more sophisticated algorithms for this problem. And these algorithms may be trivial to implement on easy matrices, but to make them work on “random” matrices is already very hard work – who guarantees that you found all the eigenvalues, and that they are reasonably exact? And that doesn’t include any thoughts about why these algorithms should work. In all of them, there is really hard work to be done, not only for the very many special cases that arise in your applications. All of this came as a surprise to me when I first thought about it – after all eigenvalues are a very basic, very natural and rather easy concept in linear algebra. By now, I have found and appreciated that for the numerical tackling of the problem, many intelligent people have spent huge amounts of time and effort on this highly important problem. Sometimes, the gap between theory and practice is really large.

Scientific Videos

Several years ago, I encountered a collection of youtube videos about the chemical elements, a periodic table of videos. These are made by chemists in Britain who keep their videos regularly updated, and who show the respective elements, do some experiments with them and so forth. Really fine stuff, especially for a non-specialist like me – and you can tell how excited these people get when they talk about their profession.

Now, other scientists have caught on: there are similar videos on physics and on mathematics. Of course, there can be no similarly exhausting list of videos, but they talk about their fields of expertise for a general audience. And again, these people are genuinely excited to talk about their field. Being a mathematician myself, the math videos bring few totally new things to me – but that doesn’t stop me from enjoying them a lot. Their team is a quite studded cast, comprising Barry Mazur, Ken Ribet and Don Knuth, next to professors and PhD Students. This makes me think of the physicists and chemists being similarly famous, it’s just I don’t know them… which is sad, actually. But even in mathematics, you get to know the superstars of your own field and you hardly get to know the other stars. It’s not such a small peer group, after all, until you’ve specialized.

Polynomials aren’t well-behaved after all

A remarkable problem can be encountered in the context of finding the eigenvalues of a given quadratic matrix A\in\mathbb{C}^{n\times n}. As everybody knows, an eigenvalue is some \lambda\in\mathbb{C} that satisfies the equation Av=\lambda v for some vector v\neq0 (which is appropriately called the eigenvector). Every book on linear algebra will tell you the theoretical properties of these eigenvectors, for instance how you can find them: look at the characteristic polynomial \det (A-x I) and find the zeros. As we’re dealing with complex numbers, this polynomial will have complex roots and these are the eigenvalues (because if \lambda is a root, A-\lambda I will have a non-trivial kernel, which means in turn that there is some vector v\neq0 with (A-\lambda I)v = 0).

So far, so simple. For small matrices, you can do just that with a pencil and paper. For larger matrices, you can do just that in a computer. But there are some difficulties to overcome: computing determinants is quite expensive, and you will need to do this often for your iterative zero-finders. Not to mention that the derivative of this polynomial is not easily at hand. You really would need a symbolic machine before you can start with the numerics. But there is another, much bigger problem.

Everybody has an intuitive impression that the zeros of a polynomial depend continuously on its coefficients (proving that lies a little deeper – I’d start with the theorem on implicit functions which might show a little more than that. That doesn’t really matter here.). But this problem is highly unstable in the case of characteristic polynomials. This is what makes this construction break down when you try to find eigenvalues numerically – you will be forced to use different methods in order to get useful results. Hence, the chapters on computing eigenvalues in texts on numerical analysis will hardly ever use the definition of eigenvalues given above. Maybe the proofs will, but the algorithms will avoid that wherever possible. This is one not-so-rare instance where you want to get some mathematical object but you can’t do the first thing with its definition – you need to dig a lot deeper (which is why the eigenvalue-algorithms were never taught in the lectures on numerical mathematics that I was fed with – everybody avoided those because of the necessary overhead; and, not to mention, I hadn’t been that interested in these methods after all. I was young and enthusiastic about theory, not numerics – I even deemed the quadrature formulae un-useful. People change.).

We won’t cover the standard eigenvalue algorithms here. This post is about the counterexample to the slogan “Polynomials are well-behaved”. Let’s look at Wilkinson’s polynomial \prod_{i=1}^{20}(x-i). This is rather nice: well-separated zeros, all of them real, low degree… what more could you wish for?

These are the reasons why Wilkinson looked at this thing in the 1950’s. He wanted some easy test case for a zero-finder. So, he ignored his built-in knowledge of the zeros of this polynomial and let his routine look for the zeros on a 9-digit-precision machine. The results were devastatingly bad. On Wikipedia you can find the computed zeros of the polynomial, some of which have rather rough errors, others have moved to the complex plane.

The basic problem lies in the representation of the explicit polynomial in machine numbers. The constant term of the polynomial is 20!, and this isn’t even the largest of them. On Wilkinson’s machine, several coefficients encountered massive loss of precision and even in today’s double precision, some of the numbers aren’t represented correctly. Hence the root of the ill-conditioning of the problem: you can’t even put the polynomial properly in a computer. Even evaluating the polynomial in a small area around one of the precise zeros will yield results that are dominated by rounding errors (including an almost pseudo-random sign).

Wilkinson published a survey article on the problem which he labelled “The perfidious polynomial” and where he also conducts some analysis of the problem sloppily described above. Besides, you can find there some examples of where the numerical problems can take their origin – it may sometimes be necessary to determine the coefficients of an explicit polynomial much more accurately than the data are: the eigenvalue equation \det(A-x I) itself is rather insensitive to small perturbations of the matrix, but the transformation to the explicit characteristic polynomial may demand for massively higher accuracy. This makes sense heuristically: the matrix contains n^2 entries which are condensed to n coefficients of the characteristic polynomial. In theory, no information is lost by this transformation, but in practice you might need to take a closer look than usual.

All this serves to show the care that needs to be taken in numerical analysis. Even things that are as well-behaved and well-understood as polynomials can show unanticipated behavior, notwithstanding fancy theorems like Weierstrass’s Approximation Theorem and others. You can hardly ever be careful enough.

The Principal Axis Theorem

In the past few weeks, I have been trapped by a basic mathematical problem that comes up in each undergraduate class on linear algebra: the Principal Axis Theorem (which, in German, has the rather more impressive name „Hauptachsentransformationssatz“, or – as friends of the playwright Brecht might say – “der Satz von der Hauptachsentransformation”. Never mind.) I got there from a discussion I’ve had with a colleague, concerning a certain covariance matrix and what happens when you transform it in a certain way. The key to the discussion was the diagonalization of the covariance matrix via some orthogonal transformation – in the end, I was aware that everything was alright, but I was sufficiently curious about where this powerful theorem is rooted (and I was sufficiently distant to my undergraduate course, so I was totally unable to remember the proof).

It didn’t take me long to understand the proof as it is. It’s an inductive proof constructing the orthogonal transformation and you need a theorem on the existence of some eigenvalue. Of course, my own handwritten lecture notes used the Fundamental Theorem of Algebra, as did the next best book on linear algebra that I could find. (Just as a side note: in all the years that I spent at university, I never encountered a book on linear algebra that I could properly work with or which I would recommend. Some can at least be used as a quick reference, but none were actually useful for my self-studies to deepen a special subject).

This reliance on the Fundamental Theorem of Algebra bugged me. Once you have left your undergraduate studies behind you, it’s a theorem like many others: use it at will, it’s correct. But flipping through my lecture notes, I saw: the proof of the Principal Axis Theorem hangs in midair, because the Fundamental Theorem of Algebra was out of reach for that lecture. So I ventured to look for a proof that could do without that deeper result from complex analysis – thereby ignoring the famous quote by Hadamard:

Le plus court chemin entre deux vérités dans le domaine réel passe par le domaine complexe.

In the end, I found it twice: once in the really nice book by Koecher, another time in the rougher book by Fischer. Fischer’s book was on my shelf, I own it since my undergraduate studies – that book and I have never become good friends. Koecher’s book on the other hand, is new to me, although it is a little more dated, but it contains a lot of little known side notes, historical remarks and results that are not so common in the standard literature. I am still flipping through the rest of Koecher’s book, but I have come to like it. Sadly, the proof of the Principal Axis Theorem is rather intricate – much deeper that the rest of the book is, actually. A nice little quote is connected to the problem at hand here:

Die Hauptachsentransformation ist ein so schlagkräftiges Hilfsmittel, dass man immer dann an sie denken sollte, wenn ein Problem über reelle symmetrische Matrizen ansteht

(The principal axis transformation is such a powerful tool that one should always think of it, whenever a problem on real symmetric matrices comes up). Let us therefore have a look at how you can prove this Theorem without leaving the realm of real numbers.

Let’s start with an example, taken from Koecher: the equation

41x_1^2-24x_1x_2+34x_2^2=25

defines a curve in the plane \mathbb{R}^2. Which curve?
The defining equation can also be written as a quadratic form, using the matrix \begin{pmatrix}\hphantom{-}41&-12\\ -12&\hphantom{-}34 \end{pmatrix}, which yields \left\langle x, Ax\right\rangle = 25. Well, maybe we can apply some rotation to make the equation more accessible. Any rotation is represented by the matrix T=\begin{pmatrix}\cos\alpha& -\sin\alpha\\ \sin\alpha&\hphantom{-}\cos\alpha\end{pmatrix}, for some \alpha\in\mathbb{R}. Now, after applying the linear map T to the coordinates (x_1,x_2)^t, we arrive at new coordinates y, say. In these new coordinates, the quadratic form looks like \left\langle Tx, ATx\right\rangle = \left\langle x, T^tATx\right\rangle, and we shall denote B=T^tAT. Obviously, our goal is to make B as simple as possible. Among the simplest matrices are diagonal matrices. So, we will want to see \begin{pmatrix}a&0 \\ 0&b\end{pmatrix}. What do the constant reals a and b need to be?

Using the abbreviation u=\cos\alpha and v=\sin\alpha, we find the equations

\begin{aligned}  41u^2-24uv+34v^2&=a\\  12v^2-7uv-12u^2&=0\\  41v^2+24uv+34v^2&=b.  \end{aligned}

Besides, ab=\det B = \det T^tAT = \det A = 1250. Those are four equations for four unknows (note: there are non-linear dependencies present! Finding the solution is not particularly difficult, but a little cumbersome).

This is sufficient information to get \cos\alpha=\frac35, \sin\alpha=\frac 45, which first shows T=\frac15 \begin{pmatrix}3& -4\\4&\hphantom{-}3\end{pmatrix} and finally B=25\begin{pmatrix}1&0 \\ 0&2\end{pmatrix}. In particular, \alpha\approx 53^\circ, and the curve in transformed coordinates is y_1^2+2y_2^2=1. It’s an ellipse.

We have rotated the axis of \mathbb{R}^2 such that the ellipse appears in its simplest form, without distracting linear terms. In particular: we only have rotated: the orthogonal transformation T has kept all distances and angles unchanged.

So much for our example. It has shown the underlying ideas of what we shall do geometrically. Let us now state the theorem in its algebraic form, we shall return to the geometric interpretation later:

Theorem: For any symmetric matrix S\in \mathbb{R}^{n\times n}, one can find some orthogonal matrix T\in O(n), such that T^tST is a diagonal matrix.

For the proof, let us consider the set of values of the quadratic form \left\langle x, Sx\right\rangle, which come from those x\in\mathbb{R}^n, for which \left|x\right|=1. Thus, we consider a continuous function on a compact set: there must be some \mu(S) = \max_{\left|x\right|=1} \left\langle x, Sx\right\rangle. In particular, this maximum \mu(S) must be finite and there must be some v with \left\langle v, Sv\right\rangle = \mu(S). Hence, \left\langle x, Sx\right\rangle \leq \left\langle v, Sv\right\rangle for any x with \left| x \right| = 1.

Now, let \tau\in[0,1], \sigma = \sqrt{1-\tau^2} and w \bot v with \left|w\right|=1. We consider the vector x = \sigma v + \tau w. We have \left |x\right|^2 = \sigma^2\left|v\right|^2 + \tau^2\left|w\right|^2 + 2\sigma \tau \left\langle v,w\right\rangle = \sigma^2 +\tau^2 = 1. Since S is symmetric, \left\langle Sv, w\right\rangle = \left\langle v, Sw\right\rangle and therefore,

\begin{aligned}  \left\langle v,Sv\right\rangle & \geq \left\langle x,Sx\right\rangle\\  &= \sigma^2\left\langle v,Sv\right\rangle + 2\sigma\tau\left\langle v,Sw\right\rangle + \tau^2\left\langle w,Sw\right\rangle.  \end{aligned}

Division by \tau and the relation between \tau and \sigma show

2\sigma\left\langle w, Sv\right\rangle \leq \tau (\left\langle v, Sv\right\rangle - \left\langle w, Sw\right\rangle).

Without loss of generality, \left\langle w, Sv\right\rangle \geq 0 (we can choose -w instead of w if need be). From here and from letting \tau\to 0, we find the right-hand side approaching 0, and hence \left\langle w, Sv\right\rangle=0.

In particular, Sv \bot w. But w was chosen from (\mathbb{R}v)^\bot and so Sv \in ((\mathbb{R}v)^\bot)^\bot = \mathbb{R}v.

This was the first step in our proof of the Principal Axis Theorem. We now need a second auxiliary result before we get to the finale: Let a and b two vectors in \mathbb{R}^n with \left|a\right|=\left|b\right|. Then there is an orthogonal matrix U such that Ua=b.

One can choose, for instance, the reflection

\displaystyle S_{a-b}(x) := x-2\frac{\left\langle a-b, x\right\rangle}{\left|{a-b}\right|^2}x.

By assumption,

\displaystyle \left|a-b\right|^2 = \left|a\right|^2+\left|b\right|^2-2\left\langle a,b\right\rangle = 2\left\langle a,a\right\rangle-2\left\langle a,b\right\rangle = 2\left\langle a-b,a\right\rangle.

This shows

\displaystyle S_{a-b}(a) = a-2\frac{\left\langle a-b,a\right\rangle}{\left|a-b\right|^2}a = a-(a-b)=b

and vice versa S_{a-b}(b)=a. It is easy to see that S_{a-b} is actually orthogonal.

One can even choose the orthogonal matrix to have positive determinant, but this is not important here.

The rest will be easy now. Let us use induction on the dimension of the matrix S.

The case n=1 is trivial: nothing is to be done, the matrix is diagonal already. Let the theorem be proved for any symmetric real (n-1)\times(n-1) matrix, n\geq 2. Let S be some real symmetric n\times n matrix. We have already proved that there is some v with \left|v\right|=1 and some \lambda\in\mathbb{R} such that Sv = \lambda v (actually, we have shown that \mu(S) is this \lambda).

Besides, there is some orthogonal matrix T such that Te_1 = v. Then we find T^tSTe_1 = T^t\lambda v=\lambda e_1, which means T^tST = \begin{pmatrix}\lambda & q \\ 0&S_1\end{pmatrix}. The symmetry of S implies the symmetry of T^tST, and hence S_1 is a symmetric (n-1)\times(n-1)-matrix, q=0. By induction, T_1^t S_1 T_1 is a diagonal matrix D_1, using an orthogonal transformation T_1.

With T' = \begin{pmatrix}1&0 \\ 0&T_1\end{pmatrix}, we have {T'}^tT^tSTT' = \begin{pmatrix}\lambda&0 \\ 0&D_1\end{pmatrix}.

The theorem is proved.

The entries of the diagonal matrix D=\begin{pmatrix}\lambda_1&&0 \\ &\ddots& \\ 0&&\lambda_n\end{pmatrix} obviously are special – let’s have a closer look at them: they were defined via Sv_i = \lambda_iv_i for some extremal vector v_i. This is the definition of an eigenvalue of S. If you are accustomed to another definition of eigenvalues: it follows directly from the theorem: One finds

\displaystyle\det(\lambda E-S) = \det(T^t(\lambda E-S)T) = \det(\lambda E-D) = \prod_{i=1}^n(\lambda-\lambda_i).

In particular, if you remember your course on linear algebra: the entries of D are exactly the zeros of the characteristic polynomial of S – they are exactly the eigenvalues of S.

This is a natural way of introducing eigenvalues, not as an abstract concept, but via an application from inside mathematics. I do like this approach, although it makes the following statement a lot less impressive: Any symmetric real n\times n-matrix has n real eigenvalues, and it allows for an orthonormal basis of eigenvectors. This is not obvious at all.

The key to the proof of the theorem was the existence of eigenvalues to the symmetric matrix. We haven’t even left the real numbers and we produced the existence of zeros to a certain polynomial of arbitrary degree. As I mentioned above, this statement of existence is often done via the Fundamental Theorem of Algebra – but we can do perfectly fine without this rather bulky instrument. We can’t compete at all without some information from calculus. In this case, we vitally needed the structure of the real numbers as a complete field, allowing for the existence of extremal values of a continuous function on a compact set. In particular, the theorem can’t work over any arbitrary field.

The proof of existence of eigenvalues was taken from Fischer’s book – it comes without too much unnecessary overhead information. Koecher used more information on positive definite matrices – which clouded the exact spot where we use the symmetry of our matrix. In the proof above, the spot is clear: we need to simplify the expression in the scalar product. Hence, the theorem will be wrong, in general, for non-symmetric matrices. The conclusion of the proof given above was taken from Koecher again. We have used the best of many worlds here.

On the other hand, the proof of the theorem could be done a lot shorter: there is an eigenvalue (because the characteristic polynomial splits over \mathbb{C}), by symmetry this eigenvalue must be real (because the inner product is sesquilinear), transform the eigenvector to a canonical base vector and use induction (as we did above). That’s the strategy from my handwritten lecture notes – at least it’s easy to see. But it also covers any inner relations and it uses tougher results than would be necessary. Different tools for different intents.

The orthogonal transformation comes from the fact that you can transform any two vectors with the same length into each other. We did it with a reflection here, but a rotation (being the concatenation of two reflections) could have done the same trick. That can yield an even stronger theorem: the eigenvectors are not only orthogonal but the transformation matrix also has positive determinant.

For conclusion, we give a little application of the theorem for use in geometry.

Corollary: Let \left\langle x, Sx\right\rangle + \left\langle x,a\right\rangle +\alpha be a general polynomial in n variables over the real numbers. Then, the polynomial can be brought in one of these normal forms:
\displaystyle\sum_{i=1}^n\lambda_ix_i^2 + \beta,
\displaystyle\sum_{i=1}^{n-1}\lambda_ix_i^2+2\gamma x_n.
In particular, any curve of degree 2 (which is the zero set of a general polynomial), can be classified in this manner.

Note here, that the matrix S can be chosen as symmetric without loss of generality: the polynomial will not change when you choose the corresponding off-diagonal entries as their arithmetic mean.

In the plane \mathbb{R}^2, this classification leads to the classical descriptions of a parabola, a hyperbola, an ellipse, straight lines or single points – there are no other curves of degree 2 in the plane.

We shall not prove the corollary, the details are a little messy (getting rid of the “linear term”, for instance) and are not spectacular as they stand. But the proof in the reals of the Principal Axis Theorem is a collection of nice arguments, as we have seen.