Brouwer’s Fixed Point Theorem

Recently, we have concluded the text on the Transformation formula, remarking that it was a tool in an elementary proof of Brouwer’s Fixed Point Theorem. Let’s have a closer look at that.

Brouwer’s Fixed Point Theorem is at the core of many insights in topology and functional analysis. As many other powerful theorems, it can be stated and understood very easily, however the proof is quite deep. In particular, the conclusions that are drawn from it, are considered even deeper. As we shall see, Brouwer’s theorem can be shown in an elementary fashion, where the Transformation Formula, the Inverse Function Theorem and Weierstrass’ Approximation Theorem are the toughest stepping stones; note that we have given a perfectly elementary proof of Weierstrass’ Theorem before. This makes Brouwer’s theorem accessible to undergraduate calculus students (even though, of course, these stepping stones already mean bringing the big guns to the fight). The downside is that the proof, even though elementary, is quite long-ish. The undergraduate student needs to exercise some patience.

Theorem (Brouwer, 1910): Let $K:=\{x\in\mathbb{R}^p\colon \left|x\right|\leq1\}$ be the compact unit ball, and let $f:K\to K$ be continuous. Then $f$ has a fixed point, i.e. there is some $x\in K$ such that $f(x)=x$.

There are many generalizations of the Theorem, considering more complex sets instead of $K$, and taking place in the infinite-dimensional space. We shall get back to that later. First, we shall look at a perfectly trivial and then a slightly less trivial special case.

For $p=1$, the statement asks to find a fixed point for the continuous mapping $f:[0,1]\to[0,1]$. W.l.o.g. we have shrunk the set $K$ to $[0,1]$ instead of $[-1,1]$ to avoid some useless notational difficulty. This is a standard exercise on the intermediate value theorem with the function $g(x):=f(x)-x$. Either, $f(1)=1$ is the fixed point, or else $f(1)<1$, meaning $g(1)<0$ and $g(0)=f(0)\geq0$. As $g$ is continuous, some point needs to be the zero of $g$, meaning $0=g(\xi) = f(\xi)-\xi$ and hence $f(\xi)=\xi$. q.e.d. ($p=1$)

For $p=2$, things are still easy to see, even though a little less trivial. This is an application of homotopy theory (even though one doesn’t need to know much about it). The proof is by contradiction however. We will show an auxiliary statement first: there is no continuous mapping $h:K\to\partial K$, which is the identity on $\partial K$, i.e. $h(x)=x$ for $x\in\partial K$. If there was, we would set

$\displaystyle H(t,s):=h(se^{it}), \qquad t\in[0,2\pi], s\in[0,1].$

$H$ is a homotopy of the constant curve $h(0) = H(t,0)$ to the circle $e^{it} = h(e^{it}) = H(t,1)$. This means, we can continuously transform the constant curve to the circle. This is a contradiction, as the winding number of the constant is $0$, but the winding number of the circle is $1$. There can be no such $h$.

Now, we turn to the proof of the actual statement of Brouwer’s Theorem: If $f$ had no fixed point, we could define a continuous mapping as follows: let $x\in K$, and consider the line through $x$ and $f(x)$ (which is well-defined by assumption). This line crosses $\partial K$ in the point $h(x)$; actually there are two such points, we shall use the one that is closer to $x$ itself. Apparently, $h(x)=x$ for $x\in\partial K$. By the auxiliary statement, there is no such $h$ and the assumption fails. $f$ must have a fixed point. q.e.d. ($p=2$)

For the general proof, we shall follow the lines of Heuser who has found this elementary fashion in the 1970’s and who made it accessible in his second volume of his book on calculus. It is interesting, that most of the standard literature for undergraduate students shies away from any proof of Brouwer’s theorem. Often, the theorem is stated without proof and then some conclusions and applications are drawn from it. Sometimes, a proof via differential forms is given (such as in Königsberger’s book, where it is somewhat downgraded to an exercise after the proof of Stoke’s Theorem) which I wouldn’t call elementary because of the theory which is needed to be developed first. The same holds for proofs using homology groups and the like (even though this is one of the simplest fashions to prove the auxiliary statement given above – it was done in my topology class, but this is by no means elementary).

A little downside is the non-constructiveness of the proof we are about to give. It is a proof by contradiction and it won’t give any indication on how to find the fixed point. For many applications, even the existence of a fixed point is already a gift (think of Peano’s theorem on the existence of solutions to a differential equation, for instance). On the other hand, there are constructive proofs as well, a fact that is quite in the spirit of Brouwer.

In some way, the basic structure of the following proof is similar to the proof that we gave for the case $p=2$. We will apply the same reasoning that concluded the proof for the special case (after the auxiliary statement), we will just add a little more formality to show that the mapping $g$ is actually continuous and well-defined. The trickier part in higher dimensions is to show the corresponding half from which the contradiction followed. Our auxiliary statement within this previous proof involved the non-existence of a certain continuous mapping, that is called a retraction: for a subset $A$ of a topological space $X$, $f:X\to A$ is called a retraction of $X$ to $A$, if $f(x)=x$ for all $x\in A$. We have found that there is no retraction from $K$ to $\partial K$. As a matter of fact, Brouwer’s Fixed Point Theorem and the non-existence of a retraction are equivalent (we’ll get back to that at the end).

The basic structure of the proof is like this:

• we reduce the problem to polynomials, so we only have to deal with those functions instead of a general continuous $f$;
• we formalize the geometric intuition that came across in the special case $p=2$ (this step is in essence identical to what we did above): basing on the assumption that Brouwer’s Theorem is wrong, we define a mapping quite similar to a retraction of $K$ to $\partial K$;
• we show that this almost-retraction is locally bijective;
• we find, via the Transformation Formula, a contradiction: there can be no retraction and there must be a fixed point.

Steps 3 and 4 are the tricky part. They may be replaced by some other argument that yields a contradiction (homology theory, for instance), but we’ll stick to the elementary parts. Let’s go.

Lemma (The polynomial simplification): It will suffice to show Brouwer’s Fixed Point Theorem for those functions $f:K\to K$, whose components are polynomials on $K$ and which have $f(K)\subset\mathring K$.

Proof: Let $f:K\to K$ continuous, it has the components $f = (f_1,\ldots,f_p)$, each of which has the arguments $x_1,\ldots,x_p$. By Weierstrass’ Approximation Theorem, for any $\varepsilon>0$ there are polynomials $p_k^\varepsilon$ such that $\left|f_k(x)-p_k^\varepsilon(x)\right| < \varepsilon$, $k=1,\ldots,p$, for any $x\in K$. In particular, there are polynomials $\varphi_{k,n}$ such that

$\displaystyle \left|f_k(x)-\varphi_{k,n}(x)\right| < \frac{1}{\sqrt{p}n}\qquad\text{for any }x\in K.$

If we define the function $\varphi_n:=(\varphi_{1,n},\ldots,\varphi_{p,n})$ which maps $K$ to $\mathbb{R}^p$, we get

\displaystyle \begin{aligned} \left|f(x)-\varphi_n(x)\right|^2 &= \sum_{k=1}^p\left|f_k(x)-\varphi_{k,n}(x)\right|^2 \\ &< \frac{p}{pn^2} \\ &= \frac1{n^2},\qquad\text{for any }x\in K \end{aligned}

and in particular $\varphi_n\to f$ uniformly in $K$.

Besides,

$\displaystyle \left|\varphi_n(x)\right|\leq\left|\varphi_n(x)-f(x)\right| + \left|f(x)\right| < \frac1n + \left|f(x)\right| \leq \frac1n + 1 =:\alpha_n.$

This allows us to set

$\displaystyle \psi_n(x) = \frac{\varphi_n(x)}{\alpha_n}$.

This function also converges uniformly to $f$, as for any $x\in K$,

\displaystyle \begin{aligned} \left|\psi_n(x)-f(x)\right| &= \left|\frac{\varphi_n(x)}{\alpha_n} - f(x)\right| \\ &= \frac1{\left|\alpha_n\right|}\left|\varphi_n(x)-\alpha_nf(x)\right|\\ &\leq \frac1{\left|\alpha_n\right|}\left|\varphi_n(x)-f(x)\right| + \frac1{\left|\alpha_n\right|}\left|f(x)-\alpha_nf(x)\right|\\ &< \frac1{\left|\alpha_n\right|}\frac1n + \frac1{\left|\alpha_n\right|}\left|f(x)\right|\left|1-\alpha_n\right|\\ &< (1+\delta)\frac1n + \frac{\delta}{1+\delta}\left|f(x)\right|\\ &< \varepsilon \qquad\text{for }n\gg0. \end{aligned}

Finally, for $x\in K$, by construction, $\left|\varphi_n(x)\right|\leq\alpha_n$, and so $\left|\psi_n(x)\right| = \frac{\left|\varphi_n(x)\right|}{\alpha_n} < 1$, which means that $\psi_n:K\to\mathring K$.

The point of this lemma is to state that if we had shown Brouwer’s Fixed Point Theorem for every such function $\psi_n:K\to\mathring K$, whose components are polynomials, we had proved it for the general continuous function $f$. This can be seen as follows:

As we suppose Brouwer’s Theorem was true for the $\psi_n$, there would be a sequence $(x_n)\subset K$ with $\psi_n(x_n) = x_n$. As $K$ is (sequentially) compact, there is a convergent subsequence $(x_{n_j})\subset(x_n)$, and $\lim_jx_{n_j} = x_0\in K$. For sufficiently large $j$, we see

$\displaystyle \left|\psi_{n_j}(x_{n_j})-f(x_0)\right| \leq\left|\psi_{n_j}(x_{n_j})-f(x_{n_j})\right| + \left|f(x_{n_j})-f(x_0)\right| < \frac\varepsilon2 + \frac\varepsilon2.$

The first bound follows from the fact that $\psi_{n_j}\to f$ uniformly, the second bound is the continuity of $f$ itself, with the fact that $x_{n_j}\to x_0$. In particular,

$\displaystyle x_0 = \lim_{j} x_{n_j} = \lim_{j} \psi_{n_j}(x_{n_j}) = f(x_0).$

So, $f$ has the fixed point $x_0$, which proves Brouwer’s Theorem.

In effect, it suffices to deal with functions like the $\psi_n$ for the rest of this text. q.e.d.

Slogan (The geometric intuition): If Brouwer’s Fixed Point Theorem is wrong, then there is “almost” a retraction of $K$ to $\partial K$.

Or, rephrased as a proper lemma:

Lemma: For $f$ being polynomially simplified as in the previous lemma, assuming $x\neq f(x)$ for any $x\in K$, we can construct a continuously differentiable function $g_t:K\to K$, $t\in[0,1]$, with $g_t(x)=x$ for $x\in\partial K$. This function is given via

$\displaystyle g_t(x) =x + t\lambda(x)\bigl(x-f(x)\bigr),$

$\displaystyle \lambda(x)=\frac{-x\cdot\bigl(x-f(x)\bigr)+\sqrt{\left(x\cdot\bigl(x-f(x)\bigr)\right)^2+\bigl(1-\left|x\right|^2\bigr)\left|x-f(x)\right|^2}}{\left|x-f(x)\right|^2}.$

The mapping $t\mapsto g_t$ is the direct line from $x$ to the boundary of $\partial K$, which also passes through $f(x)$. $\lambda(x)$ is the parameter in the straight line that defines the intersection with $\partial K$.

Proof: As we suppose, Brouwer’s Fixed Point Theorem is wrong the continuous function $\left|x-f(x)\right|$ is positive for any $x\in K$. Because of continuity, for every $y\in \partial K$, there is some $\varepsilon = \varepsilon(y) > 0$, such that still $\left|x-f(x)\right|>0$ in the neighborhood $U_{\varepsilon(y)}(y)$.

Here, we have been in technical need of a continuation of $f$ beyond $K$. As $f$ is only defined on $K$ itself, we might take $f(x):=f\bigl(\frac{x}{\left|x\right|}\bigr)$ for $\left|x\right|>1$. We still have $\left|f(x)\right| < 1$ and $f(x)\neq x$, which means that we don’t get contradictions to our assumptions on $f$. Let’s not dwell on this for longer than necessary.

On the compact set $\partial K$, finitely many of the neighborhoods $U_{\varepsilon(y)}(y)$ will suffice to cover $\partial K$. One of them will have a minimal radius. We shall set $\delta = \min_y\varepsilon(y) +1$, to find: there is an open set $U = U_\delta(0)\supset K$ with $\left|x-f(x)\right| >0$ for all $x\in U$.

Let us define for any $x\in U$

$\displaystyle d(x):=\frac{\left(x\cdot\bigl(x-f(x)\bigr)\right)^2+\bigl(1-\left|x\right|\bigr)^2\left|x-f(x)\right|^2}{\left|x-f(x)\right|^4}$.

It is well-defined by assumption. We distinguish three cases:

$a)$ $\left|x\right|<1$: Then $1-\left|x\right|^2>0$ and the numerator of $d(x)$ is positive.

$b)$ $\left|x\right|=1$: Then the numerator of $d(x)$ is

$\displaystyle \left(x\cdot\bigl(x-f(x)\bigr)\right)^2 = \bigl(x\cdot x - x\cdot f(x)\bigr)^2 = \bigl(\left|x\right|^2-x\cdot f(x)\bigr)^2 = \bigl(1-x\cdot f(x)\bigr)^2,$

where by Cauchy-Schwarz and by assumption on $f$, we get

$\displaystyle x\cdot f(x) \leq \left|x\right|\left|f(x)\right| = \left|f(x)\right| < 1\qquad (\spadesuit).$

In particular, the numerator of $d(x)$ is strictly positive.

$c)$ $\left|x\right|>1$: This case is not relevant for what’s to come.

We have seen that $d(x)>0$ for all $\left|x\right|\leq 1$. Since $d$ is continuous, a compactness argument similar to the one above shows that there is some $V = V_{\delta'}(0)\supset K$ with $d(x)>0$ for all $x\in V$. If we pick $\delta'=\delta$ if $\delta$ is smaller, we find: $d$ is positive and well-defined on $V$.

The reason why we have looked at $d$ is not clear yet. Let us grasp at some geometry first. Let $x\in V$ and $\Gamma_x = \left\{x+\lambda\bigl(x-f(x)\bigr)\colon\lambda\in\mathbb{R}\right\}$ the straight line through $x$ and $f(x)$. If we look for the intersection of $\Gamma_x$ with $\partial K$, we solve the equation

$\displaystyle\left|x+\lambda\bigl(x-f(x)\bigr)\right| = 1.$

The intersection “closer to” $x$ is denoted by some $\lambda>0$.

This equation comes down to

\displaystyle \begin{aligned} && \left(x+\lambda\bigl(x-f(x)\bigr)\right) \cdot \left(x+\lambda\bigl(x-f(x)\bigr)\right) &=1 \\ &\iff& \left|x\right|^2 + 2\lambda x\cdot\bigl(x-f(x)\bigr) + \lambda^2\left|x-f(x)\right|^2 &=1\\ &\iff& \lambda^2\left|x-f(x)\right|^2 + 2\lambda x\cdot\bigl(x-f(x)\bigr) &= 1-\left|x\right|^2\\ &\iff& \left(\lambda+\frac{x\cdot\bigl(x-f(x)\bigr)}{\left|x-f(x)\right|^2}\right)^2 &= \frac{1-\left|x\right|^2}{\left|x-f(x)\right|^2} + \left(\frac{x\cdot\bigl(x-f(x)\bigr)}{\left|x-f(x)\right|^2}\right)^2 \\ &\iff& \left(\lambda+\frac{x\cdot\bigl(x-f(x)\bigr)}{\left|x-f(x)\right|^2}\right)^2 &= \frac{(1-\left|x\right|)^2\left|x-f(x)\right|^2+\left(x\cdot\bigl(x-f(x)\bigr)\right)^2}{\left|x-f(x)\right|^4} \\ &\iff& \left(\lambda+\frac{x\cdot\bigl(x-f(x)\bigr)}{\left|x-f(x)\right|^2}\right)^2 &= d(x). \end{aligned}

As $x\in V$, $d(x)>0$, and hence there are two real solutions to the last displayed equation. Let $\lambda(x)$ be the larger one (to get the intersection with $\partial K$ closer to $x$), then we find

\displaystyle \begin{aligned} \lambda(x) &= \sqrt{d(x)} - \frac{x\cdot\bigl(x-f(x)\bigr)}{\left|x-f(x)\right|^2}\\ &= \frac{-x\cdot\bigl(x-f(x)\bigr)+\sqrt{\left(x\cdot\bigl(x-f(x)\bigr)\right)^2+\bigl(1-\left|x\right|^2\bigr)\left|x-f(x)\right|^2}}{\left|x-f(x)\right|^2}. \end{aligned}

By construction,

$\displaystyle \left|x+\lambda(x)\bigl(x-f(x)\bigr)\right| = 1,\qquad\text{for all }x\in V.\qquad(\clubsuit)$

Let us define

$\displaystyle g_t(x) = x+t\lambda(x)\bigl(x-f(x)\bigr),\qquad t\in[0,1],~x\in V.$

This is (at least) a continuously differentiable function, as we simplified $f$ to be a polynomial and the denominator in $\lambda(x)$ is bounded away from $0$. Trivially and by construction, $g_0(x)=x$ and $\left|g_1(x)\right| = 1$ for all $x\in V$.

For $\left|x\right|<1$ and $t<1$, we have

\displaystyle \begin{aligned} \left|x+t\lambda(x)\bigl(x-f(x)\bigr)\right| &\stackrel{\hphantom{(\clubsuit)}}{=} \left|\bigl(t+(1-t)\bigr)x + t\lambda(x)\bigl(x-f(x)\bigr)\right|\\ &\stackrel{\hphantom{(\clubsuit)}}{=}\left|t\left(x+\lambda(x)\bigl(x-f(x)\bigr)\right)+(1-t)x\right|\\ &\stackrel{\hphantom{(\clubsuit)}}{\leq} t\left|x+\lambda(x)\bigl(x-f(x)\bigr)\right|+(1-t)\left|x\right|\\ &\stackrel{(\clubsuit)}{=} t+(1-t)\left|x\right|\\ &\stackrel{\hphantom{(\clubsuit)}}{<} t+(1-t) = 1\qquad (\heartsuit). \end{aligned}

Hence, $\left|g_t(x)\right|<1$ for $\left|x\right|<1$ and $t\in[0,1)$. This means $g_t(\mathring K)\subset\mathring K$ for $t<1$.

For $\left|x\right|=1$, we find (notice that $x\cdot\bigl(x-f(x)\bigr)>0$ for $\left|x\right|=1$, by $(\spadesuit)$).

\displaystyle \begin{aligned} \lambda(x) &= \frac{-x\cdot\bigl(x-f(x)\bigr)+\sqrt{\left(x\cdot\bigl(x-f(x)\bigr)\right)^2}}{\left|x-f(x)\right|^4} \\ &= \frac{-x\cdot\bigl(x-f(x)\bigr)+x\cdot\bigl(x-f(x)\bigr)}{\left|x-f(x)\right|^4} = 0. \end{aligned}

This is geometrically entirely obvious, since $\lambda(x)$ denotes the distance of $x$ to the intersection with $\partial K$; if $x\in\partial K$, this distance is apparently $0$.

We have seen that $g_t(x)=x$ for $\left|x\right|=1$ for any $t\in[0,1]$. Hence, $g_t(\partial K)=\partial K$ for all $t$. $g_t$ is almost a retraction, $g_1$ actually is a retraction. q.e.d.

Note how tricky the general formality gets, compared to the more compact and descriptive proof that we gave in the special case $p=2$. The arguments of the lemma and in the special case are identical.

Lemma (The bijection): Let $\hat K$ be a closed ball around $0$, $K\subset\hat K\subset V$. The function $g_t$ is a bijection on $\hat K$, for $t\geq0$ sufficiently small.

Proof: We first show that $g_t$ is injective. Let us define $h(x):=\lambda(x)\bigl(x-f(x)\bigr)$, for reasons of legibility. As we saw above, $h$ is (at least) continuously differentiable. We have

$\displaystyle g_t(x) = x+th(x),\qquad g_t'(x)=\mathrm{Id}+th'(x).$

As $\hat K$ is compact, $h'$ is bounded by $\left|h'(x)\right|\leq C$, say. By enlarging $C$ if necessary, we can take $C\geq1$. Now let $x,y\in\hat K$ with $g_t(x)=g_t(y)$. That means $x+th(x)=y+th(y)$ and so, by the mean value theorem,

$\displaystyle \left|x-y\right| = t\left|h(x)-h(y)\right|\leq tC\left|x-y\right|.$

By setting $\varepsilon:=\frac1C$ and taking $t\in[0,\varepsilon)$, we get $\left|x-y\right| = 0$. $g_t$ is injective for $t<\varepsilon$.

Our arguments also proved $\left|th'(x)\right| < 1$. Let us briefly look at the convergent Neumann series $\sum_{k=0}^\infty\bigl(th'(x)\bigr)^k$, having the limit $s$, say. We find

$\displaystyle sth'(x) = \sum_{k=0}^\infty\bigl(th'(x)\bigr)^{k+1} = s-\mathrm{Id},$

which tells us

$\displaystyle \mathrm{Id} = s-s\cdot th'(x) = s\bigl(\mathrm{Id}-th'(x)\bigr).$

In particular, $g_t'(x) = \mathrm{Id}-th'(x)$ is invertible, with the inverse $s$. Therefore, $\det g_t'(x)\neq0$. Since this determinant is a continuous function of $t$, and $\det g_0'(x) = \det\mathrm{Id} = 1$, we have found

$\displaystyle \det g_t'(x) > 0 \text{ for any }t\in[0,\varepsilon),~x\in\hat K$.

Now, let us show that $g_t$ is surjective. As $\det g_t'(x)$ never vanishes on $\hat K$, $g_t$ is an open mapping (by an argument involving the inverse function theorem; $g_t$ can be inverted locally in any point, hence no point can be a boundary point of the image). This means that $g_t(\mathring K)$ is open.

Let $z\in K$ with $z\notin g_t(\mathring K)$; this makes $z$ the test case for non-surjectivity. Let $y\in g_t(\mathring K)$; there is some such $y$ due to $(\heartsuit)$. The straight line between $y$ and $z$ is

$\displaystyle \overline{yz}:=\left\{(1-\lambda)y+\lambda z\colon \lambda\in[0,1]\right\}.$

As $g_t$ is continuous, there must be some point $v\in\partial g_t(\mathring K)\cap\overline{yz}$; we have to leave the set $g_t(\mathring K)$ somewhere. Let us walk the line until we do, and then set

$\displaystyle v=(1-\lambda_0)y+\lambda_0z,\qquad\text{with }\lambda_0=\sup\left\{\lambda\in[0,1]\colon\overline{y;(1-\lambda)y+\lambda z}\subset g_t(\mathring K)\right\}.$

Now, continuous images of compact sets remain compact: $g_t(K)$ is compact and hence closed. Therefore, we can conclude

$\displaystyle g_t(\mathring K)\subset g_t(K)\quad\implies\quad \overline{g_t(\mathring K)}\subset g_t(K)\quad\implies\quad v\in\overline{g_t(\mathring K)}\subset g_t(K).$

This means that there is some $u\in K$ such that $v=g_t(u)$. As $g_t(\mathring K)$ is open, $u\in\partial K$ (since otherwise, $v\notin\partial g_t(\mathring K)$ which contradicts the construction). Therefore, $\left|u\right|=1$, and since $g_t$ is almost a retraction, $g_t(u)=u$. Now,

$\displaystyle v=g_t(u) = u \quad\implies\quad v\in\partial K.$

But by construction, $v$ is a point between $z\in K$ and $y\in g_t(\mathring K)$; however, $y\notin\partial K$, since $g_t(\mathring K)$ is open. Due to the convexity of $K$, we have no choice but $z\in\partial K$, and by retraction again, $g_t(z)=z$. In particular, $z\in g_t(\partial K)$.

We have shown that if $z\notin g_t(\mathring K)$, then $z\in g_t(\partial K)$. In particular, $z\in g_t(K)$ for any $z\in K$. $g_t$ is surjective. q.e.d.

Lemma (The Integral Application): The real-valued function

$\displaystyle V(t)=\int_K\det g_t'(x)dx$

is a polynomial and satisfies $V(1)>0$.

Proof: We have already seen in the previous lemma that $\det g_t'(x)>0$ on $x\in\mathring{\hat K}$ for $t<\varepsilon$. This fact allows us to apply the transformation formula to the integral:

$\displaystyle V(t) = \int_{g_t(K)}1dx.$

As $g_t$ is surjective, provided $t$ is this small, $g_t(K) = K$, and therefore

$\displaystyle V(t) = \int_K1dx = \mu(K).$

In particular, this no longer depends on $t$, which implies $V(t)>0$ for any $t<\varepsilon$.

By the Leibniz representation of the determinant, $\det g_t'(x)$ is a polynomial in $t$, and therefore, so is $V(t)$. The identity theorem shows that $V$ is constant altogether: in particular $V(1)=V(0)>0$. q.e.d.

Now we can readily conclude the proof of Brouwer’s Fixed Point Theorem, and we do it in a rather unexpected way. After the construction of $g_t$, we had found $\left|g_1(x)\right|=1$ for all $x\in V$. Let us write this in its components and take a partial derivative ($j=1,\ldots,p$)

\displaystyle \begin{aligned} &&1 &= \sum_{k=1}^p\bigl(g_{1,k}(x)\bigr)^2\\ &\implies& 0 &= \frac\partial{\partial x_j}\sum_{k=1}^p\bigl(g_{1,k}(x)\bigr)^2 = \sum_{k=1}^p2\frac{\partial g_{1,k}(x)}{\partial x_j}g_{1,k}(x) \end{aligned}

This last line is a homogeneous system of linear equations, that we might also write like this

$\displaystyle \begin{pmatrix}\frac{\partial g_{1,1}(x)}{\partial x_1}&\cdots &\frac{\partial g_{1,p}(x)}{\partial x_1}\\ \ldots&&\ldots\\ \frac{\partial g_{1,1}(x)}{\partial x_p}&\cdots&\frac{\partial g_{1,p}(x)}{\partial x_p}\end{pmatrix} \begin{pmatrix}\xi_1\\\ldots\\\xi_p\end{pmatrix} = 0,$

and our computation has shown that the vector $\bigl(g_{1,1}(x),\ldots,g_{1,p}(x)\bigr)$ is a solution. But the vector $0$ is a solution as well. These solutions are different because of $\left|g_1(x)\right| = 1$. If a system of linear equations has two different solutions, it must be singular (it is not injective), and the determinant of the linear system vanishes:

$\displaystyle 0 = \det \begin{pmatrix}\frac{\partial g_{1,1}(x)}{\partial x_1}&\cdots &\frac{\partial g_{1,p}(x)}{\partial x_1}\\ \ldots&&\ldots\\ \frac{\partial g_{1,1}(x)}{\partial x_p}&\cdots&\frac{\partial g_{1,p}(x)}{\partial x_p}\end{pmatrix} = \det g_1'(x).$

This means

$\displaystyle 0 = \int_K\det g_1'(x)dx = V(1) > 0$.

A contradiction, which stems from the basic assumption that Brouwer’s Fixed Point Theorem were wrong. The Theorem is thus proved. q.e.d.

Let us make some concluding remarks. Our proof made vivid use of the fact that if there is a retraction, Brouwer’s Theorem must be wrong (this is where we got our contradiction in the end: the retraction cannot exist). The proof may also be started the other way round. If we had proved Brouwer’s Theorem without reference to retractions (this is how Elstrodt does it), you can conclude that there is no retraction from $K$ to $\partial K$ as follows: if there was a retraction $g:K\to\partial K$, we could consider the mapping $-g$. It is, in particular, a mapping of $K$ to itself, but it does not have any fixed point – a contradiction to Brouwer’s Theorem.

Brouwer’s Theorem, as we have stated it here, is not yet ready to drink. For many applications, the set $K$ is too much of a restriction. It turns out, however, that the hardest work has been done. Some little approximation argument (which in the end amounts to continuous projections) allows to formulate, for instance:

• Let $C\subset\mathbb{R}^p$ be convex, compact and $C\neq\emptyset$. Let $f:C\to C$ be continuous. Then $f$ has a fixed point.
• Let $E$ be a normed space, $K\subset E$ convex and $\emptyset\neq C\subset K$ compact. Let $f:K\to C$ be continuous. Then $f$ has a fixed point.
• Let $E$ be a normed space, $K\subset E$ convex and $K\neq\emptyset$. Let $f:K\to K$ be continuous. Let either $K$ be compact or $K$ bounded and $f(K)$ relatively compact. Then $f$ has a fixed point.

The last two statements are called Schauder’s Fixed Point Theorems, which may often be applied in functional analysis, or are famously used for proofs of Peano’s Theorem in differential equations. But at the core of all of them is Brouwer’s Theorem. This seems like a good place to end.

On approximating functions – Part Two

The last time, we have proved the Stone-Weierstass-Theorem on how to find approximating functions to any continuous function on a compact interval only using a certain algebra of “easy” functions – such as polynomials.

Now that the hard, abstract work is done, we shall note that there are other, less abstract ways to prove Weierstrass’ Approximation Theorem. I am unaware of a less abstract method for Stone-Weierstrass, in particular the method given above is non-constructive. You could go through most of these things, but the way that we used compactness is not constructable in general – how do you choose the finite covering, as you only know there should exist one?

Let us therefore look at Bernstein Polynomials, which are associated with a given function $f$ on $[0,1]$:

$\displaystyle f_n(x) = \sum_{k=0}^n\binom{n}{k}f\left(\frac kn\right)x^k(1-x)^{n-k}.$

One can prove:

Proposition: Let $f:[0,1]\to\mathbb{R}$ be continuous. Then the Bernstein Polynomials converge uniformly to $f$.

Proof: Since $f$ is continuous on a compact interval, it is also uniformly continuous. Let $\varepsilon>0$. Then, there is some $\delta>0$ with $\left|f(x)-f(y)\right|<\varepsilon$ if $\left|x-y\right|<\delta$. Besides, for any $x\in[0,1]$,

$\displaystyle \left|f(x)-f_n(x)\right| \leq \sum_{k=0}^n\binom{n}{k}\left|f(x)-f\left(\frac kn\right)\right|x^k(1-x)^{n-k},$

where we have used the standard fact that

$\displaystyle\sum_{k=0}^n\binom nk x^k(1-x)^{n-k} = \bigl(x+(1-x)\bigr)^n = 1,$

which will (if used again and together with uniform continuity) yield

$\displaystyle \left|f(x)-f_n(x)\right| \leq \sum_{k=0}^n\binom{n}{k}\left|f(x)-f\left(\frac kn\right)\right|x^k(1-x)^{n-k}\mathbf{1}_{\left|x-\frac kn\right|\geq\delta} + \varepsilon.$

So, we have split our problem in two parts: in the first part, if we want to approximate $f$ well, it’s easy because an interpolation point for $f_n$ is close by. If this is not the case, the interpolation point is far away, meaning:

$\displaystyle \left|x-\frac kn\right|\geq\delta\quad\iff\quad 1 \leq \frac{\bigl(x-\frac kn\bigr)^2}{\delta^2}.$

On top of that, by continuity, $f$ will be bounded by a constant $M$, say. This gives:

\begin{aligned} \left|f(x)-f_n(x)\right| &\leq \sum_{k=0}^n\binom{n}{k}\left|f(x)-f\left(\frac kn\right)\right| x^k(1-x)^{n-k}\frac{\bigl(x-\frac kn\bigr)^2}{\delta^2} + \varepsilon \\ &\leq \frac{2M}{\delta^2}\sum_{k=0}^n\binom{n}{k}\left(x-\frac kn\right)^2 x^k(1-x)^{n-k} + \varepsilon. \end{aligned}

As far as ideas are concerned, we’re done. All that is left to do is to shove terms and binomial coefficients from left to right and to find out that they behave properly. We won’t do that here, as it’s not particularly hard. One will find:

$\displaystyle\left |f(x)-f_n(x)\right| \leq \frac{2M}{\delta^2}\left(x^2-2x^2+x^2+\frac{x(1-x)}{n}\right) + \varepsilon\leq \frac{2M}{\delta^2}\frac1{4n}+\varepsilon < 2\varepsilon,$

if $n$ is chosen sufficiently large. q.e.d.

There is a variant of the proof which avoids some of these messier calculations that we didn’t pull through above. Well, this variant doesn’t avoid the calculations, it rather shoves them away to another concept: the Weak Law of Large Numbers. We rely on Klenke’s book on Probability for this (but the idea is due to Bernstein).

Alternative Proof: Let $x\in[0,1]$, and let $x_1,\ldots,X_N$ be independent random variables with a Bernoulli distribution with parameter $x$ (the random experiment of throwing one coin and its chance of landing on heads is $x$). Then, the sum $S_n = \sum_{i=1}^nX_i$ has a binomial distribution $B_{n,x}$. Therefore,

$\displaystyle E\left[f\left(\frac{S_n}n\right)\right] = \sum_{k=0}^nf\left(\frac kn\right)P[S_n=k] = f_n(x).$

We find, again with $M = \left\|f\right\|_\infty$,

$\displaystyle \left|f\left(\frac{S_n}n\right) - f(x)\right| \leq \varepsilon + 2M \mathbf{1}_{|\frac{S_n}n-x|\geq\delta}$

and hence

$\displaystyle \left|f_n(x)-f(x)\right| \leq E\left[\left|f\left(\frac{S_n}n\right)-f(x)\right|\right] \leq \varepsilon + 2MP\left[\left|\frac{S_n}{n}-x\right|\geq\delta\right].$

By the Chebyshev Inequality (which is just shy from using the weak law of large numbers itself), since $E[\frac{S_n}n] = x$ and $\mathrm{Var}[\frac{S_n}n] = \frac1{n^2}nx(1-x)\leq\frac1{4n}$, we have

$\displaystyle P\left(\left|\frac{S_n}n-x\right|\geq\delta \right) = P\left(\left|\frac{S_n}n-E\left[\frac{S_n}n\right]\right|\geq\delta\right) \leq \mathrm{Var}\left[\frac{S_n}n\right]\delta^{-2} = \frac{1}{4n\delta^2}.$

Hence,

$\displaystyle \left|f_n(x)-f(x)\right| \leq \varepsilon + \frac{2M}{4\delta^2n} < 2\varepsilon,$

if $n$ is taken large enough. q.e.d.

These two have been more direct ways of proving the classical Weierstrass Approximation Theorem. It is what Werner’s book on Functional Analysis does, and so does the treatment of Körner’s on Fourier Series. The pro side is, it’s a direct and simple proof. The downside is, however, you won’t get the stronger theorems given at the start of this text; it doesn’t even say anything about trigonometric polynomials.

But, there’s a slight hope: actually, in the non-probabilistic approach, we only needed to prove that Bernstein Polynomials can deal well with the monomials $1$, $x$ and $x^2$ (that’s the part that we have avoided, since there is some computation to be done). Maybe one can generalize this one – and one can: this is Korovkin’s approach.

Theorem (Korovkin): Let $T_n:\mathcal{C}_{\mathbb{R}}(K)\to\mathcal{C}_{\mathbb{R}}(K)$ a sequence of positive linear operators, and let $q\subset \mathcal{C}_{\mathbb{R}}(K)$ be a test set that satisfies certain conditions on positivity and on its zero set.

If for any $q\in Q$, we have $\lim_{n\to\infty}\left\|T_nq-q\right\|_\infty=0$, then we also have $\lim_{n\to\infty}\left\|T_nf-f\right\|_\infty=0$ for all $f\in\mathcal{C}_{\mathbb{R}}(K)$.

Remember that an operator is just a function on a function space (plug a function in, get a function out). It is linear by the usual definition of that term, and by positive one means $T_nf\geq0$ as long as $f\geq0$ everywhere it is defined.

Korovkin tells us that you only need to approximate certain functions well with a cleverly chosen operator, to make the classical Weierstrass-Theorem work. This is what can be achieved by Bernstein Polynomials (they are a sequence of operators which only use finitely many values of a function) and the test set $q=\{1, x, x^2\}$. You can also prove the Theorem on trigonometric polynomials with the operator $T_n$ chosen as the Fejér-kernel and the test set $q=\{1, \cos, \sin\}$. We don’t delve into this here, especially not in the test set $q$; let’s just say that there is one and in the given examples you can’t make it smaller. In the detailed proof, there is some computation required and you can actually learn many things about Fourier series (so go into it; for instance in the books by Werner on Functional Analysis, or in Hämmerlin’s book on Numerics – however, to place a quote by Gauss in here: Most readers may well wish in some places for greater clarity in the presentation).

We won’t prove that here because for the main applications there is not much to gain on top of what we already have. But the approach has a certain elegance to it which is especially captured by functional analysts. I am not aware of generalizations of its applicability, I’d be very interested to hear about them.

As a final alternate proof of the full real version of Stone-Weierstrass Theorem (not just Weierstrass’ Theorem), we shall look at an elementary approach suggested by Brosowski and Deutsch in 1981 (see the original paper here). There is no dealing with convergent series, operators and the like; the proof only hinges on definitions of compactness and continuity and three easily proved ideas: closed subsets of compact spaces are compact, continuous positive functions have a positive minimum on compact sets, and the Bernoulli inequality $(1+h)^n\geq 1+nh$ for any $n\in\mathbb{N}$ and $h\geq-1$. We will give two lemmas before we are able to prove the real version of Stone-Weierstrass.

The proof will actually show a little bit more, since we need not suppose that $K\subset\mathbb{R}$. In fact, Stone-Weierstrass holds for any real-valued function on a compact topological space (it’s just that we didn’t make a fuss about that earlier on – most applications deal with functions on $\mathbb{R}$, as a matter of fact).

The pointwise nearly indicator Lemma: Let $x_0\in K$ and let $U$ be a neighborhood of $x_0$. Then, for any $\varepsilon>0$ there is some neighborhood $V(x_0)$ of $x_0$ with $V\subset U$ and some function $g\in\mathcal{A}$ such that

• $0\leq g(x)\leq 1$ for $x\in K$,
• $g(x)<\varepsilon$ for $x\in V(t_0)$,
• $g(x)>1-\varepsilon$ for $x\in K\setminus U$.

The lemma got its name because $g$ serves as an indicator function (or characteristic function for the non-probabilists out there) for the set $K\setminus U$. Of course, the algebra $\mathcal A$ is not rich enough for a proper indicator function, since that wouldn’t be continuous. But the algebra approximates this indicator of $K\setminus U$, and we can guarantee the function $g$ to nearly vanish in some neighborhood of the given point $x_0$. We’ll improve on that soon.

Proof: First of all, we need to find a function that takes its values only in the interval $[0,1]$ and which already has a controllably small value in $x_0$. What easier way to pick that small value as $0$?

For any $y\in K\setminus U$, by separation, there is some function $g_y\in\mathcal{A}$ with $g_y(y)\neq g_y(x_0)$. Besides, the function $h_y(x) = g_y(x)-g_y(x_0)$ is in $\mathcal{A}$ and of course $h_y(y)\neq h_y(x_0) = 0$. Even the function $p_y(x) = \frac1{\left\|h_y^2\right\|_\infty}h_y^2(x)$ is in $\mathcal{A}$, we have $p_y(x_0) = 0$, $p_y(y)>0$ and $0\leq p_y\leq 1$.

That was the first important step. $p_y$ vanishes at $x_0$ and it doesn’t vanish in some point outside of $U$. Now, we need to push the “doesn’t vanish” part to make our function take values near $1$, uniformly outside of $U$.

Consider the set $W(y)=\{x\in K : p_y(x)>0\}$. This is a neighborhood of $y$, hence its name. We can look at any of those $W(y)$, whenever $y\in K\setminus U$. Now, $K\setminus U$ is compact (it’s a closed subset of a compact space), and hence finitely many $W(y)$ will suffice to cover $K\setminus U$, let’s say $K\setminus U \subset\bigcup_{k=1}^m W(y_k)$. The function $p(x) = \frac1m\sum_{k=1}^mp_{t_k}(x)$ is in $\mathcal{A}$, we still have $0\leq p\leq1$, $p(x_0)=0$ and $p(x)>0$ for any $x\in K\setminus U$.

Because of compactness, $p$ must take a positive minimum on the compact set $K\setminus U$, let’s call it $\delta$. Then, for this $\delta\in(0,1]$, we look at the set $V=\{x\in K : p(x)<\frac\delta2\}$. Obviously, $V\subset U$ and $V$ is a neighborhood of $x_0$.

Let us look at the integer $K = \left\lceil \frac1\delta+1\right\rceil$. Thus, $K-1\leq\frac1\delta$ and $K\leq\frac{1+\delta}{\delta} \leq \frac2\delta$. In total, $1< k\delta \leq 2$. We can therefore, for every $n\in\mathbb{N}$, define the functions $q_n(x) = \bigl(1-p^n(x)\bigr)^{k^n}$, which, of course, are in $\mathcal{A}$, which have $0\leq q_n\leq 1$ and which have $q_n(x_0)=1$.

Let $x\in V$. Then, $Kp(x)< \frac{k\delta}2\leq 1$ and Bernoulli’s inequality (used with $n := k^n$ and $h:= -p^n(x)\geq -1$) shows $q_n(x) \geq 1-k^np^n(x)\geq 1-\left(\frac{k\delta}2\right)^n$. Therefore, $\lim_{n\to\infty}q_n(x) = 1$, and this limit is uniform in $x\in V$.

Let $x\in K\setminus U$. Then, $Kp(x)\geq k\delta> 1$ and Bernoulli’s inequality shows

\begin{aligned} q_n(x) &= \frac{1}{k^np^n(x)}\bigl(1-p^n(x)\bigr)^{k^n} k^np^n(x) \\ &\leq \frac1{k^np^n(x)}\bigl(1-p^n(x)\bigr)^{k^n}\bigl(1+k^np^n(x)\bigr)\\ &\leq \frac{1}{k^np^n(x)}\bigl(1-p^n(x)\bigr)^{k^n}\bigl(1+p^n(x)\bigr)^{k^n} \\ &= \frac{1}{k^np^n(x)}\bigl(1-p^{2n}(x)\bigr)^{k^n} \\ &\leq \frac{1}{k^n\delta^n}, \end{aligned}

and this converges to $0$ uniformly.

In particular, for any $\varepsilon>0$, we can find some $n$ large enough such that $q_n\leq\varepsilon$ on $K\setminus U$ and $q_n\geq 1-\varepsilon$ on $V$. Now, to conclude the proof, pick $V(x_0) = V$ (remember that $V$ was a neighborhood of $x_0$ by construction) and $g(x) = 1-q_n(x)$. q.e.d.

This one felt quite chore-ish. It was a little messy and technical, but note the necessary tools: no operator theory, no convergence of series, just stuff from high school calculus and the very basics of the theory of compact sets. That would seem like a value in and of itself.

The setwise nearly indicator Lemma: Let $A,B\subset K$ disjoint and closed. Let $\varepsilon\in(0,1)$. Then, there is some $g\in\mathcal{A}$ with

• $0\leq g(x)\leq 1$ for $x\in K$,
• $g(x)<\varepsilon$ for $x\in A$,
• $g(x)>1-\varepsilon$ for $x\in B$.

Obviously, this lemma improves on the first one, we have nearly an indicator function on $B$ which nearly vanishes on $A$. In this respect, it resembles Urysohn’s Lemma from elementary topology, but Urysohn isn’t restricted to the algebra $\mathcal A$ and can hence do even a little better. Never mind.

Proof: This is going to be a lot quicker: you know the drill. Let $\varepsilon>0$. Let $U=K\setminus B$ and let $y\in A$. We can choose neighborhoods $V(y)$ as in the previous Lemma. Of course, since $A$ is compact, finitely many of those $V(y)$ suffice to cover $A$, let’s say $A\subset\bigcup_{k=1}^m V(y_k)$. For each of the $V(y_k)$, there is a function $g_k\in\mathcal{A}$ such that $0\leq g_k\leq 1$, $g_k(x)>1-\frac\varepsilon m$ for all $x\in K\setminus U = B$ and $g_k(x)<\frac\varepsilon m$ for all $x\in V(y_k)$.

The function $g(x) = \prod_{k=1}^mg_k(x)$ will now do the trick: Of course, we have $0\leq g\leq1$.

Let $x\in A\subset\bigcup_{k=1}^mV(y_k)$, in particular $x\in V(y_{n_0})$, say. Then

$\displaystyle g(x) = \prod_{k=1}^mg_k(x)= g_{n_0}(x)\prod_{k=1, k\neq n_0}^mg_k(x) \leq \frac\varepsilon m\prod_{k=1, k\neq n_0}^m1 \leq\varepsilon.$

Now, let $x\in B$. Then Bernoulli tells us that $g(x)=\prod_{k=1}^mg_k(x)\geq\left(1-\frac\varepsilon m\right)^m \geq 1-\varepsilon$. Note that we need $\varepsilon\leq m$ here (which is not a very hard restriction anyway – nevertheless, we included some condition like that in the assertion).

That’s all we needed to show. q.e.d.

Note that the notion of compactness has been used rather slightly differently from the very first proof we gave for Stone-Weierstrass. Then, we had the Approximation Lemma which used continuity and direct neighborhoods where we could approximate $f$ adequately. Here, we have decomposed the underlying set $K$ by using the level sets of $f$ to approximate it. Then, compactness tells us every time that we can do with finitely many sets – but the nature of those sets is ever so slightly different. The math, however, is not.

Now, let’s harvest the ideas:

Proof of the Stone-Weierstrass, real version:

Let $f\in\mathcal{C}_{\mathbb{R}}(K)$ and let $\varepsilon>0$. Without loss of generality, we may assume that $f\geq 0$, since we might as well replace it by $f+\left\|f\right\|_\infty$. Besides, $\varepsilon <\frac13$ is not much of a restriction.

Let $n\in\mathbb{N}$ be such that $(n-1)\varepsilon>\left\|f\right\|_\infty$. Let, for $j=0,\ldots,n$, $A_j$ and $B_j$ be the sets

$\displaystyle A_j = \left\{x\in K : f(x)\leq \left(j-\frac13\right)\varepsilon\right\}$

$\displaystyle B_j = \left\{x\in K : f(x)\geq \left(j+\frac13\right)\varepsilon\right\}.$

Obviously, those sets are disjoint for fixed $j$, and the $A_j$ are increasing to $A_n = K$, while the $B_j$ are decreasing to $B_n = \emptyset$. The setwise nearly indicator Lemma then says that there is some $g_j\in\mathcal{A}$ with $0\leq g_j\leq 1$, $g_j(x) < \frac\varepsilon n$ for $x\in A_j$ and $g_j(x) > 1-\frac{\varepsilon}{n}$ for $x\in B_j$.

That’s all well and good – but how will this help us to prove Stone-Weierstrass? Imagine the space $K$ decomposed like an onion into the $A_j$. The functions $g_j$ will have very small values on the $A_j$, but large values outside, in the $B_j$. We will combine these small and large values properly, to have similar values as the function $f$ in every ring of this onion. Note that the values of $f$ are closely attached to the $g_j$ – by the very definition of $A_j$ and $B_j$.

Now consider the function $g(x) = \varepsilon\sum_{k=0}^n g_j(x)$, which of course is in $\mathcal{A}$. Let $x\in K$, then there is some $j\geq 1$ with $x\in A_j\setminus A_{j-1}$, and so

$\displaystyle \left(j-\frac43\right)\varepsilon < f(x) \leq \left(j-\frac13\right)\varepsilon$,

and of course $g_i(x) < \frac\varepsilon n$ for all $i\geq j$. Besides, our fixed $x\in K$ is in $B_i$ for $i\leq j-2$. Therefore, $g_i(x)\geq 1-\frac\varepsilon n$. All of this tells us (remember that $\varepsilon<\frac13$)

\begin{aligned} g(x) = \varepsilon\sum_{k=0}^{j-1}g_k(x) + \varepsilon\sum_{k=j}^ng_k(x)&\leq j\varepsilon + \varepsilon(n-j+1)\frac{\varepsilon}{n} \\ &\leq j\varepsilon + \varepsilon^2 < j\varepsilon + \frac\varepsilon3 = \left(j+\frac13\right)\varepsilon. \end{aligned}

Now, if $j\geq2$,

\begin{aligned} g(x) \geq \varepsilon\sum_{k=0}^{j-2}g_k(x) &\geq (j-1)\varepsilon \left(1-\frac\varepsilon n\right) \\ &= (j-1)\varepsilon - \left(\frac{j-1}{n}\right)\varepsilon^2 > (j-1)\varepsilon - \varepsilon^2 > \left(j-\frac43\right)\varepsilon. \end{aligned}

Note that, for $j = 1$, $g(x) > \bigl(j-\frac43\bigr)\varepsilon$ appears to be correct.

To conclude, we wish to show something about $\left|f(x)-g(x)\right|$. Let us consider the case $f(x)\geq g(x)$ first:

$\displaystyle\left|f(x)-g(x)\right| = f(x)-g(x) \leq \left(j-\frac13\right)\varepsilon - \left(j-\frac43\right)\varepsilon = \varepsilon.$

The other case is $f(x) < g(x)$:

$\displaystyle\left|f(x)-g(x)\right| = g(x) - f(x) \leq \left(j+\frac13\right)\varepsilon - \left((j-2)+\frac13\right)\varepsilon = 2\varepsilon.$

This shows, that in every case $f$ and $g$ differ by at most $2\varepsilon$, uniformly on $K$. q.e.d.

To conclude, let us mention some little question that is related to the classical Weierstrass Aproximation Theorem. If we have all monomials $x^n$ at our disposal, we can uniformly approximate any continuous function; but what if some monomials are missing? With how few of them can the Approimation Theorem still hold? The surprisingly neat answer is given in Müntz’ Theorem (sometimes called the Müntz-Szász-Theorem for the sake of more signs on top of the letters):

Theorem (Müntz): Let $\lambda_k$ be a strictly increasing sequence of natural numbers with $\lambda_1>0$. Then, we have

$\displaystyle \overline{\mathrm{span}\{1, x^{\lambda_1},x^{\lambda_2},\ldots,\}} = \mathcal{C}_{\mathbb{R}}([0,1]) \qquad\iff\qquad \sum_{k=1}^\infty\lambda_k^{-1}=\infty.$

The span is meant as the linear space generated by the given monomials, i.e. the polynomials that can be constructed using only those monomials.  Hence, the Weierstrass Theorem can even hold if you disallow to use infinitely many monomials. However, you will always need the monomial $x^0=1$, since otherwise your approximating functions could only satisfy $f(0)=0$.

The proof of that theorem is not particularly easy, especially the part where you show that the series must diverge if you can approximate any function. For the other direction there has been a very elementary proof that we shall give here because of its inherent beauty:

Proof of Müntz’ Theorem, $\Leftarrow$-part:

Let $\sum_{k=1}^\infty\lambda_k^{-1}=\infty$.

Let $q\in\mathbb{N}$ be such that $q\notin\{\lambda_1,\lambda_2,\ldots\}$. Without loss of generality, we assume that $q$ is smaller than any of the $\lambda_k$: just take away those $\lambda_k$ which are smaller than $q$ and re-enumerate the $\lambda_k$ starting from $k=1$; the condition $\sum_{k=1}^\infty\lambda_k^{-1}=\infty$ will then still hold.

We will show that $x^q\in\overline{\mathrm{span}\{x^{\lambda_1},x^{\lambda_2},\ldots\}}$. This will imply that all monomials $x^n$ are in the closure of that span, and if we have that, Weierstrass tells us that we can approximate any continuous function.

Note that we don’t need the constant function $1$ to approximate $x^q$, since for our purposes the relevant functions all have $f(0)=0$.

Now consider the function

$\displaystyle Q_n(x) := x^q-\sum_{k=1}^na_{k,n}x^{\lambda_k}$

with certain $a_{k,n}\in\mathbb{R}$. Then first of all, $Q_0(x) = x^q$, and if we choose certain $a_{k,n}$, we find the recursion

$\displaystyle Q_n(x) = (\lambda_n-q)x^{\lambda_n}\int_x^1Q_{n-1}(t)t^{-1-\lambda_n}dt.$

Let’s prove this by induction:

\begin{aligned} (\lambda_n-q)x^{\lambda_n}\int_x^1Q_{n-1}(t)t^{-1-\lambda_n}dt &= (\lambda_n-q)x^{\lambda_n}\int_x^1\left(t^q-\sum_{k=1}^{n-1}a_{k,n-1}t^k\right)t^{-1-\lambda_n}dt\\ &=(\lambda_n-q)x^{\lambda_n}\int_x^1 t^{q-1-\lambda_n}-\sum_{k=1}^{n-1}a_{k,n-1}t^{k-1-\lambda_n}dt\\ &=(\lambda_n-q)x^{\lambda_n}\left(\left[\frac1{q-1-\lambda_n+1}t^{q-1-\lambda_n+1}\right]_x^1\right. \\ &\hphantom{=}\left.- \sum_{k=1}^{n-1}a_{k,n-1}\left[\frac1{k-1-\lambda_n+1}t^{k-1-\lambda_n+1}\right]_x^1\right)\\ &=(\lambda_n-q)x^{\lambda_n}\left(\frac1{q-\lambda_n}(1-x^{q-\lambda_n}) \vphantom{\sum_{k=1}^n}\right.\\ &\hphantom{=}-\left.\sum_{k=1}^{n-1}a_{k,n-1}\frac1{k-\lambda_n}(1-x^{k-\lambda_n})\right)\\ &=x^{\lambda_n}(x^{q-\lambda_n}-1)-(\lambda_n-q)\sum_{k=1}^{n-1}\frac{a_{k,n-1}}{k-\lambda_n}(1-x^{k-\lambda_n})\\ &=x^q-x^{\lambda_n}-(\lambda_n-q)\sum_{k=1}^{n-1}\frac{a_{k,n-1}}{k-\lambda_n}(1-x^{k-\lambda_n})\\ &=: x^q - \sum_{k=1}^na_{k,n}x^{\lambda_k}. \end{aligned}

Obviously, one could try to define the $a_{k,n}$ appropriately or to give some closed formula expression; but it doesn’t seem to be useful for the problem at hand.

Now, we have $\left\|Q_0\right\|_\infty = 1$, and

\begin{aligned} \left\|Q_n\right\|_\infty &= \left\|(\lambda_n-q)x^{\lambda_n}\int_x^1Q_{n-1}(t)t^{-1-\lambda_n}dt\right\|_\infty \\ &\leq \left|\lambda_n-q\right|\left\|x^{\lambda_n}\right\|_\infty\left\|Q_{n-1}\right\|_\infty\left\|\int_x^1t^{-1-\lambda_n}dt\right\|_\infty\\ &=\left|\lambda_n-q\right|\left\|Q_{n-1}\right\|_\infty\left\|-\frac1{\lambda_n}(1-x^{-\lambda_n})\right\|_\infty\\ &=\left|\lambda_n-q\right|\left\|Q_{n-1}\right\|_\infty\frac1{\left|\lambda_n\right|}\left\|1-x^{-\lambda_n}\right\|_\infty \\ &=\left|1-\frac q{\lambda_n}\right|\left\|Q_{n-1}\right\|_\infty. \end{aligned}

This shows

$\displaystyle \left\|Q_n\right\|_\infty\leq\prod_{k=1}^n\left|1-\frac q{\lambda_k}\right|,$

and hence

$\displaystyle \lim_{n\to\infty}\left\|Q_n\right\|_\infty \leq \prod_{k=1}^\infty\left|1-\frac q{\lambda_k}\right|$.

But the infinite product will be 0, because $0<\frac q{\lambda_k}<1$ by assumption, and because of the basic inequality $1-x\leq e^{-x}$ for any $x>0$, we have

$\displaystyle \prod_{k=1}^\infty\left|1-\frac q{\lambda_k}\right| \leq \prod_{k=1}^\infty\exp\left(-\frac q{\lambda_k}\right) = \exp\left(-\sum_{k=1}^\infty \frac q{\lambda_k}\right) = 0.$

Hence, $Q_n$ vanishes uniformly as $n\to\infty$, and so $x^q$ is being approximated uniformly by some polynomials using only the allowed monomials $x^{\lambda_k}$. This concludes the proof. q.e.d.

At first, I was excited because I thought one could reverse this idea to show the other direction of Müntz’ Theorem as well. However, it doesn’t work because of the second-to-last inequality given above: If the product is finite, that doesn’t say anything about how well (or how badly) $x^q$ is approximated. The proof of the reverse direction is harder and I am only aware of deeper ideas from Fourier Analysis (that’s historically the first proof) or using the Hahn-Banach Theorem (as in Rudin’s book).

But the elementary idea given here is actually very neat, even though it doesn’t, to the best my knowledge, solve any real-world problems. A very pretty academic finger-exercise. And this should be a good place to stop.

On approximating functions – Part One

Let us have a closer look at the Stone-Weierstrass-Theorem. It is a generalization of Weierstrass’ classical approximation theorem on polynomials which stands at the base of approximation theory. Stone took Weierstrass’ classical statement and rendered it more abstractly, so it can be used for a number of things in proofs about continuous functions. In practice, I have rarely seen it used in its pure form – maybe because it lacks a direct statement about the approximation error, and because you can base more powerful statements upon it.

Let us state the Stone-Weierstrass-Theorem in its pure form first, and its two most useful corollaries next, before we turn to different methods to prove them and to a deeper rendition of their significance.

We shall denote by $\mathcal{C}(K)$ the set of continuous complex-valued functions on the set $K$, and by $\mathcal{C}_{\mathbb{R}}(K)$ the set of real-valued continuous functions on the set $K$.

Remember that an algebra $\mathcal{A}$ over a field $f$ is a set which satisfies the following conditions:

• for any $f\in \mathcal{A}$ and any $\alpha\in F$, $\alpha f\in \mathcal{A}$, and
• for any $f, g\in \mathcal{A}$, $f+g\in \mathcal{A}$ and $f\cdot g\in \mathcal{A}$.

Also, we shall call a family of functions on the set $x$ separating, if for any $x, y\in X$, with $x\neq y$, there is some $f\in \mathcal{A}$ such that $f(x)\neq f(y)$.

Stone-Weierstrass-Theorem (real version): Let $K\subset\mathbb{R}$ be a compact set and let $\mathcal{A}\subset\mathcal{C}_{\mathbb{R}}(K)$ a separating algebra with the constant function $1\in\mathcal{A}$.

Then, $\mathcal{A}$ is dense in $\mathcal{C}_{\mathbb{R}}(K)$ in the topology of uniform convergence.

Equivalently, $\overline{\mathcal{A}} = \mathcal{C}_{\mathbb{R}}(K)$. Equivalently, for every $\varepsilon>0$ and every $f\in\mathcal{C}_{\mathbb{R}}(K)$, there is some $g\in\mathcal{A}$ with $\left\|f-g\right\|_\infty<\varepsilon$.

Stone-Weierstrass-Theorem (complex version): Let $K\subset\mathbb{C}$ be a compact set and let $\mathcal{A}\subset\mathcal{C}(K)$ a separating algebra with the constant function $1\in\mathcal{A}$ and for any $g\in\mathcal{A}$ let $\overline g\in\mathcal{A}$.

Then, $\mathcal{A}$ is dense in $\mathcal{C}(K)$ in the topology of uniform convergence.

Corollary: Weierstrass’ classical Theorem: For every continuous real-valued function $f$ on an interval $[a,b]$ there is a sequence of polynomials $p_n$ which converges uniformly to $f$ on $[a,b]$.

Corollary: On trigonometric polynomials: For every continuous $2\pi$-periodic real-valued function $f$ there is a sequence of trigonometric polynomials $\sum_{j=0}^n \alpha_j\cos(jx)+\beta_j\sin(jx)$ which converges uniformly to $f$.

In this second corollary, the objects don’t look like polynomials at all, but in the proof we will give a hint on why the series can be called that. It has to do with the representation $(e^{iz})^j$.

The conditions in the theorem are quite natural. You need some sort of richness in your set of approximating functions, this is achieved by the separation: if there were two points $x\neq y$ and you couldn’t find a separating $f\in\mathcal A$, then how could you properly approximate any given continuous function which takes different values at $x$ and $y$? You could never, well, separate the values with what the algebra provides you. Hence, this is obviously a necessary condition. As Hewitt/Stromberg put out nicely, this obviously necessary condition is also sufficient in the real case – that makes some of the beauty and elegance of the theorem.

In the complex version, there is some sort of extra-richness that we must provide for the theorem to hold. We obviously can’t do without that: Let us think of the family of polynomials on the unit disk in $\mathbb{C}$. They are an algebra, alright, they contain the continuous functions, and they are separating (even the identity polynomial $f(x)=x$ can do that). But they are holomorphic as well, and the uniform limit on a compact set of holomorphic functions will be holomorphic again (this follows very naturally from Cauchy’s Integral Formula) – but certainly, there are continuous functions on the unit disk which are not holomorphic, so there needs to be an extra assumption in the Stone-Weierstrass-Theorem. And it follows very smoothly, that the lacking assumption can already be met by demanding that complex conjugates be contained in $\mathcal{A}$ as well. Let us show this, why not:

Proof of the complex version basing on the real version

The real and imaginary part of any $g\in\mathcal{A}$ can be represented as

$\displaystyle\mathrm{Re}g=\frac{g+\overline g}{2}$

and

$\displaystyle\mathrm{Im}g=\frac{g-\overline g}{2i}$,

which means that $\mathrm{Re}g, \mathrm{Im}g \in \mathcal{A}$, by assumption. Moreover, if $x, y\in K$ and $x\neq y$, then by separation there is some $g\in\mathcal{A}$ which gives us $g(x)\neq g(y)$, and that means at least $\mathrm{Re}g(x)\neq\mathrm{Re}g(y)$ or $\mathrm{Im}g(x)\neq\mathrm{Im}g(y)$. So the family of the real parts of the functions in $\mathcal{A}$ is separating and thus meets the conditions of the real version of the theorem, and so does the family of imaginary parts. The theorem yields that either one of these families is dense in $\mathcal{C}_{\mathbb{R}}(K)$.

Hence, for any $f\in\mathcal{C}(K)$ and for any $\varepsilon>0$, there are some $g_0, g_1\in\mathcal{A}$ such that

$\displaystyle\left\|\mathrm{Re}f - g_0\right\|_\infty < \frac\varepsilon2$

and

$\displaystyle\left\|\mathrm{Im}f - g_1\right\|_\infty < \frac\varepsilon2$.

Thus,

\begin{aligned}\displaystyle\left\|f - (g_0+ig_1)\right\|_\infty &= \left\|\mathrm{Re}f - g_0 + i(\mathrm{Im}f - g_1)\right\|_\infty \\ &\leq \left\|\mathrm{Re}f - g_0\right\|_\infty + \left\|\mathrm{Im}f - g_1\right\|_\infty \\ &< \varepsilon. \end{aligned}

Since $g_0+ig_1\in\mathcal{A}$, the complex version is proved. q.e.d.

While we’re at it, let’s give the proofs for the corollaries now.

Proof of Weierstrass’ classical Theorem

That’s rather simple: the set of polynomials is obviously an algebra, and it’s separating because you can find the identity polynomial $g(x)=x$ in it. The conditions of the theorem are met – the corollary follows. q.e.d.

Proof of the statement on trigonometric polynomials

This follows from the complex version. Let us look at some $2\pi$-periodic function $f:\mathbb{R}\to\mathbb{R}$. The domain of $f$ is not compact, but all that matters is the compact interval $[0,2\pi]$.

Now consider the set of functions $\sum_{j=-n}^nc_jz^j$ with $z\in [0,2\pi]$, which obviously form a separating algebra. The conditions of the real version are thus met. Those functions can approximate the function $f$ on $[0,2\pi]$ uniformly. We are interested in trigonometric polynomials, however, and so far, for any $\varepsilon>0$, we’ve only found some appropriate $n$ and $c_j$ with

$\displaystyle\left\|f(z) - \sum_{j=-n}^nc_jz^j\right\|_\infty < \varepsilon.$

Of course, that’s not news. We already knew that $c_j=0$ for $j<0$ from the real version. But we can now consider the bijection $z\mapsto e^{iz}$, which transforms our setting to the compact complex set $S^1$. Let us take $g(e^{iz}):=f(z)$ which maps $g:S^1\to\mathbb{R}$. We try to apply the complex version to this function $g$ and this time we use the approximating functions $\sum_{j=-n}^nc_je^{ijz}$ (in order to show the resemblance with the approximating functions for $f$, they can also be written like $\sum_{j=-n}^nc_k(e^{iz})^j$). They are still an algebra containing constant functions, they are separating because $e^{iz}$ is a bijection on $S^1$, and now we need to ensure the extra assumption on complex conjugation – but that’s not a problem because of $\overline{e^{ijz}} = e^{-ijz}$. The complex version thus holds and we can find some appropriate $n$ and $c_j$ with

$\displaystyle\left\|g(e^{iz}) - \sum_{j=-n}^nc_je^{ijz}\right\|_\infty < \varepsilon.$

Now, we can represent $c_j = a_j+ib_j$ with $a_j,b_j\in\mathbb{R}$, and $e^{ijz} = \cos(jz)+i\sin(jz)$, which yields

$\displaystyle\left\|g(e^{iz}) - \sum_{j=-n}^n\bigl(a_j\cos(jz)-b_j\sin(jz)\bigr) - i\sum_{j=-n}^n\bigl(a_j\sin(jz)+b_j\cos(jz)\bigr)\right\|_\infty < \varepsilon,$

but we had chosen $g(e^{iz})=f(z)$ and this is real by assumption (no imaginary part). So:

$\displaystyle\left\|f(z) - \sum_{j=-n}^n\bigl(a_j\cos(jz)-b_j\sin(jz)\bigr)\right\|_\infty < \varepsilon.$

Finally, we employ the relations $\cos(-x) = \cos(x)$ and $\sin(-x) = -\sin(x)$ to find

$\displaystyle\left\|f(z) - a_0 - \sum_{j=1}^n \bigl((a_j+a_{-j}) \cos(jz) - (b_j-b_{-j})\sin(jz)\bigr)\right\|_\infty < \varepsilon.$

By appropriate definitions of the coefficients $\alpha_j$ and $\beta_j$, we have proved the corollary:

$\displaystyle\left\|f(z) - \sum_{j=0}^n \bigl(\alpha_j\cos(jz)+\beta_j\sin(jz)\bigr)\right\|_\infty < \varepsilon.$ q.e.d.

In a way, that was surprisingly non-smooth for the proof of a corollary. But the Fourier series people are thankful for a proof like that. We shall see a sketch of another proof of that later, but this one here does without explicit computations of integrals and fuzzing around with sine-cosine-identities.

Up until now, everything has hinged on the real version of the Stone-Weierstrass-Theorem. In order to give a self-contained proof of this, we are going to need several lemmas and a little work. We take the proof from Königsberger’s book on Calculus (it’s very similar to the proof given in Heuser’s book). Let us start with the

The square root Lemma: Define the generalized binomial coefficient via $\binom{\alpha}{n}:=\frac{\alpha(\alpha-1)\cdots(\alpha-n+1)}{n!}$. Then we have the generalized binomial formula $(1+x)^\alpha=\sum_{k=0}^\infty\binom{\alpha}{k}x^k$ for any $\left|x\right|<1$.

For $\alpha>0$, uniform and absolute convergence even holds for any $\left|x\right|\leq1$

In particular, for $\alpha=\frac12$ and any $x\in[-1,1]$,

$\displaystyle\sqrt{1+x} = \sum_{k=0}^\infty\binom{\frac12}{k}x^k$.

Proof: One part is easy, if we apply the approriate machinery. Since $(1+x)^\alpha$ is holomorphic with a possible singularity at $x=-1$, we can find its absolutely convergent power series in the point $x=0$ with a radius of convergence at least $1$:

$\displaystyle (1+x)^{\alpha} = \sum_{k=0}^\infty \left.\frac{d^k}{dx^k}(1+x)^{\alpha}\right|_{x=0} \frac1{k!}x^k,\qquad\left|x\right|<1.$

But the derivatives in $x=0$ are easily found to be $\frac{d^k}{dx^k}(1+x)^{\alpha} = \prod_{j=0}^{k-1}(\alpha-j)$. This proves the series representation.

This can also be shown in an elementary way, as seen in Hewitt/Stromberg.

We still need to prove something about convergence on the boundary of the circle for $\alpha>0$. Let us consider the series $\sum_{k=0}^\infty\left|\binom{\alpha}k\right|$. We have, for $K\geq\alpha$:

$\displaystyle\frac{\left|\binom{\alpha}{k+1}\right|}{\left|\binom{\alpha}k\right|} = \frac{\left|\alpha-k\right|}{k+1} = \frac{k-\alpha}{k+1}.$

In particular, for those large $K$:

$\displaystyle k\left|\binom{\alpha}k\right|-(k+1)\left|\binom{\alpha}{k+1}\right| = \alpha \left|\binom{\alpha}k\right| > 0$

The sequence $K\left|\binom{\alpha}k\right|$ is thus eventually decreasing in $K$, it’s bounded and therefore has a limit $\gamma\geq0$. Now look at the telescoping series

$\displaystyle \sum_{k=0}^\infty \left(k\left|\binom{\alpha}k\right| - (k+1)\left|\binom{\alpha}{k+1}\right|\right),$

whose $p$-th partial sum is $-(p+1)\left|\binom{\alpha}{p+1}\right|$, which converges to $-\gamma$ (however, only the first few terms of it are negative, which gives the negative limit). So, the telescoping series converges, and we find

\begin{aligned}\displaystyle \sum_{k=0}^\infty\left|\binom{\alpha}k\right| &= \sum_{k=0}^{\left\lfloor\alpha\right\rfloor}\left|\binom{\alpha}k\right| + \sum_{k=\left\lfloor\alpha\right\rfloor+1}^\infty\left|\binom{\alpha}k\right| \\ &= C_1 + \frac1\alpha\sum_{k=\left\lfloor\alpha\right\rfloor+1}^\infty\left(k\left|\binom{\alpha}k\right| - (k+1)\left|\binom{\alpha}{k+1}\right|\right) \\ &= C_1 + C_2. \end{aligned}

This shows that the series $\sum_{k=0}^\infty\left|\binom{\alpha}k\right|$ is convergent. For in $x\in[-1,1]$, we are actually interested in

$\displaystyle \left|\sum_{k=0}^\infty\binom{\alpha}{k}x^k\right|\leq \sum_{k=0}^\infty\left|\binom{\alpha}{k}\right|,$

which is convergent. The series thus defines a uniformly and absolutely convergent function on the compact interval $[-1,1]$.

The fact that $(1+x)^\alpha = \sum_{k=0}^\infty\binom{\alpha}{k}x^k$ even for $x=-1$ and $x=1$ follows by continuity on $[-1,1]$ and by equality of both sides on $(-1,1)$. q.e.d.

The Closure Lemma: Let $\mathcal A$ be an algebra of continuous functions on a compact set and $\overline{\mathcal A}$ its closure in the topology of uniform convergence. Then, for any $f,g\in\overline{\mathcal A}$:

$f+g$, $fg$, $\left|f\right|$, $\max(f,g)$ and $\min(f,g)$ are in $\overline{\mathcal A}$.

Proof: Let $\varepsilon>0$. There are $p,q\in\mathcal A$ with $\left\|f-p\right\|_\infty<\varepsilon$ and $\left\|g-q\right\|_\infty<\varepsilon$. Therefore,

$\displaystyle\left\|f+g-(p+q)\right\|_\infty \leq \left\|f-p\right\|_\infty + \left\|g-q\right\|_\infty < 2\varepsilon,$

and

\begin{aligned} \displaystyle\left\|fg-pq\right\|_\infty &= \left\|fg-fq+fq-pq\right\|_\infty \\ &\leq \left\|f-p\right\|_\infty\left\|q\right\|_\infty + \left\|f\right\|_\infty\left\|g-q\right\|_\infty \\ &\leq \left\|f-p\right\|_\infty\bigl(\left\|g\right\|_\infty+\left\|g-q\right\|_\infty\bigr) + \left\|f\right\|_\infty\left\|g-q\right\|_\infty\\ &\leq \varepsilon\bigl(\left\|g\right\|_\infty + \varepsilon + \left\|f\right\|_\infty\bigr) < C\varepsilon. \end{aligned}

The right-hand side can be made arbitrarily small, which proves the first two statements.

To deal with $\sqrt{f}$, we employ the compactness of the set: $f$ will be bounded. We can therefore consider the function $\phi = \frac{f}{\left\|f\right\|_\infty}$ taking values in $[-1,1]$. Note that the constant function $\left\|f\right\|_\infty^{-1}$ belongs to $\mathcal A$, and since $f\in\overline{\mathcal A}$, $\phi\in\overline{\mathcal A}$. By the Square root Lemma,

$\displaystyle\left|\phi(x)\right| = \sqrt{\left|\phi(x)\right|^2} = \sqrt{1+\bigl(\phi^2(x)-1\bigr)} = \sum_{k=0}^\infty\binom{\frac12}{k}\bigl(\phi^2(x)-1\bigr)^k,$

with an absolutely and uniformly convergent series representation. There is some $N_0$ with

$\displaystyle \left\|\left|\phi(x)\right| - \sum_{k=0}^{N_0}\binom{\frac12}{k}\bigl(\phi^2(x)-1\bigr)^k\right\|_\infty < \frac\varepsilon2.$

By the arguments given above, together with induction, one sees that the partial sum up to $N_0$ belongs to $\overline{\mathcal A}$, if $\phi$ does. Hence, there is some $r\in\mathcal A$ with

$\displaystyle\left\|r - \sum_{k=0}^{N_0}\binom{\frac12}{k}\bigl(\phi^2(x)-1\bigr)^k\right\|_\infty < \frac\varepsilon2.$

This yields

\begin{aligned} \displaystyle \bigl\|\left|\phi(x)\right| - r\bigr\|_\infty &\leq \left\|\left|\phi(x)\right| - \sum_{k=0}^{N_0}\binom{\frac12}{k}\bigl(\phi^2(x)-1\bigr)^k\right\|_\infty + \left\|\sum_{k=0}^{N_0}\binom{\frac12}{k}\bigl(\phi^2(x)-1\bigr)^k - r\right\|_\infty \\ &< \frac\varepsilon2+\frac\varepsilon2 = \varepsilon. \end{aligned}

Thus, $\left|\phi\right|\in\overline{\mathcal A}$, and therefore $\left|f\right|\in\overline{\mathcal A}$.

Finally, we apply the equalities

$\displaystyle \max(f,g) = \frac12\bigl(f+g\left|f-g\right|\bigr),\qquad \min(f,g) = \frac12\bigl(f+g-\left|f-g\right|\bigr),$

to show the final statements. q.e.d.

Note that we need $\overline{\mathcal{A}}$ to find these functions, even if $f,g\in\mathcal{A}$. The algebra $\mathcal{A}$ itself is not rich enough, in general, to contain the functions in the statement of the Closure Lemma. Besides, we have made frequent use of $\overline{\overline{\mathcal{A}}}=\overline{\mathcal{A}}$, which is at the heart of why the Closure Lemma works.

The Approximation Lemma: Let $\mathcal A\subset\mathcal{C}_{\mathbb{R}}(K)$ be a separating algebra and let $f\in\mathcal{C}_{\mathbb{R}}(K)$, $x\in K$, $\varepsilon>0$. Then, there is some $q_x\in\overline{\mathcal A}$ with the properties

$q_x(x) = f(x)$ and $q_x(y) \leq f(y)+\varepsilon$ for any $y\in K$.

Proof: First of all, for any $z\in K$ there is some function $h_z\in\mathcal A$ with $h_z(x)=a$ and $h_z(z) = b$. Take

$\displaystyle h_z(y) = (b-a)\frac{g(y)-g(x)}{g(z)-g(x)}+a$

with an appropriate $g\in\mathcal{A}$ for instance (separation allows for this to be well-defined: pick $g$ such that $g(z)\neq g(x)$). In particular, one can choose $a=f(x)$, $b=f(z)$, which makes $h_z$ coincide with $f$ at least in the points $x$ and $z$.

Since $h_z$ and $f$ are continuous, there is some interval $I_z$ around $z$, with $h_z(y)\leq f(y)+\varepsilon$ for all $y\in I_z$. Now, any point $z\in K$ is at least contained in the interval $I_z$, and by compactness, finitely many of those intervals suffice to cover $K$. Say, $K\subset\bigcup_{i=1}^nI_{z_i}$. Then consider

$\displaystyle q_x:=\min(h_{z_1},\ldots,h_{z_n}),$

which is in $\overline{\mathcal A}$ by the Closure Lemma. Besides, $q_x(x) = \min\bigl(h_{z_1}(x),\ldots,h_{z_n}(x)\bigr) = \min\bigl(f(x),\ldots,f(x)\bigr) = f(x)$ by construction, and for every $y\in K$, there is some $j$, such that $y\in I_{z_j}$, and so $q_x(y) = \min\bigl(h_{z_1}(y),\ldots,h_{z_n}(y)\bigr) \leq h_{z_j}(y) \leq f(y)+\varepsilon$. q.e.d.

Proof of the Stone-Weierstrass-Theorem, real version: For any $x\in K$, choose some $q_x\in\overline{\mathcal A}$ as in the Approximation Lemma (in particular: $q_x(x)=f(x)$). Then, by continuity, pick an open interval $U_x$ around $x$, such that for any of the $y\in U_x\cap K$,

$\displaystyle q_x(y) \geq f(y)-\frac\varepsilon2.$

By compactness, finitely many of the $U_{x_i}$ suffice to cover $K$. Consider

$\displaystyle g:=\max(q_{x_1},\ldots,q_{x_n}),$

which is in $\overline{\mathcal A}$ by the Closure Lemma. For any $z\in K$, we find some $i$ with $z\in U_{x_i}$, and thus

$\displaystyle g(z) = \max\bigl(q_{x_1}(z),\ldots,q_{x_n}(z)\bigr) \geq q_{x_i}(z) \geq f(z)-\frac\varepsilon2.$

By the Approximation Lemma, however,

$\displaystyle g(z) = \max\bigl(q_{x_1}(z),\ldots,q_{x_n}(z)\bigr) \leq \max\bigl(f(z)+\varepsilon,\ldots,f(z)+\varepsilon\bigr) = f(z)+\varepsilon.$

Now, since $g\in\overline{\mathcal A}$, we can pick some $p\in\mathcal A$ with $\left\|g-p\right\|_\infty<\varepsilon$. Then, we have found

$\displaystyle\left\|f-p\right\|_\infty\leq\left\|f-g\right\|_\infty+\left\|g-p\right\|_\infty < \varepsilon+\varepsilon = 2\varepsilon$.

The Stone-Weierstrass-Theorem is proved. q.e.d.

A slightly different proof is given in Hewitt/Stromberg. The idea is about the same, but the compactness argument works on the image of $f$, w.l.o.g. the interval $[0,1]$. Here, we have used the compactness of the definition space $K$. Of course, Hewitt/Stromberg can’t do without the compactness of $K$, since they need that $f$ takes its maximum and minimum value. The downside of their approach is that they need to iterate their argument, so their approximating function is actually a series of those max-min-functions that we have used. But there are not too many differences in the proofs, so let’s just stick to that. I have actually found the proof given above a little slicker; but I have come to admire the book by Hewitt/Stromberg for its clarity and style. There are several nice things hidden inside there to come back to.

For now, we’ll stop. But there’s some more backstory about the Stone-Weierstrass-Theorem to come soon.