Brouwer’s Fixed Point Theorem

Recently, we have concluded the text on the Transformation formula, remarking that it was a tool in an elementary proof of Brouwer’s Fixed Point Theorem. Let’s have a closer look at that.

Brouwer’s Fixed Point Theorem is at the core of many insights in topology and functional analysis. As many other powerful theorems, it can be stated and understood very easily, however the proof is quite deep. In particular, the conclusions that are drawn from it, are considered even deeper. As we shall see, Brouwer’s theorem can be shown in an elementary fashion, where the Transformation Formula, the Inverse Function Theorem and Weierstrass’ Approximation Theorem are the toughest stepping stones; note that we have given a perfectly elementary proof of Weierstrass’ Theorem before. This makes Brouwer’s theorem accessible to undergraduate calculus students (even though, of course, these stepping stones already mean bringing the big guns to the fight). The downside is that the proof, even though elementary, is quite long-ish. The undergraduate student needs to exercise some patience.

Theorem (Brouwer, 1910): Let $K:=\{x\in\mathbb{R}^p\colon \left|x\right|\leq1\}$ be the compact unit ball, and let $f:K\to K$ be continuous. Then $f$ has a fixed point, i.e. there is some $x\in K$ such that $f(x)=x$.

There are many generalizations of the Theorem, considering more complex sets instead of $K$, and taking place in the infinite-dimensional space. We shall get back to that later. First, we shall look at a perfectly trivial and then a slightly less trivial special case.

For $p=1$, the statement asks to find a fixed point for the continuous mapping $f:[0,1]\to[0,1]$. W.l.o.g. we have shrunk the set $K$ to $[0,1]$ instead of $[-1,1]$ to avoid some useless notational difficulty. This is a standard exercise on the intermediate value theorem with the function $g(x):=f(x)-x$. Either, $f(1)=1$ is the fixed point, or else $f(1)<1$, meaning $g(1)<0$ and $g(0)=f(0)\geq0$. As $g$ is continuous, some point needs to be the zero of $g$, meaning $0=g(\xi) = f(\xi)-\xi$ and hence $f(\xi)=\xi$. q.e.d. ($p=1$)

For $p=2$, things are still easy to see, even though a little less trivial. This is an application of homotopy theory (even though one doesn’t need to know much about it). The proof is by contradiction however. We will show an auxiliary statement first: there is no continuous mapping $h:K\to\partial K$, which is the identity on $\partial K$, i.e. $h(x)=x$ for $x\in\partial K$. If there was, we would set

$\displaystyle H(t,s):=h(se^{it}), \qquad t\in[0,2\pi], s\in[0,1].$

$H$ is a homotopy of the constant curve $h(0) = H(t,0)$ to the circle $e^{it} = h(e^{it}) = H(t,1)$. This means, we can continuously transform the constant curve to the circle. This is a contradiction, as the winding number of the constant is $0$, but the winding number of the circle is $1$. There can be no such $h$.

Now, we turn to the proof of the actual statement of Brouwer’s Theorem: If $f$ had no fixed point, we could define a continuous mapping as follows: let $x\in K$, and consider the line through $x$ and $f(x)$ (which is well-defined by assumption). This line crosses $\partial K$ in the point $h(x)$; actually there are two such points, we shall use the one that is closer to $x$ itself. Apparently, $h(x)=x$ for $x\in\partial K$. By the auxiliary statement, there is no such $h$ and the assumption fails. $f$ must have a fixed point. q.e.d. ($p=2$)

For the general proof, we shall follow the lines of Heuser who has found this elementary fashion in the 1970’s and who made it accessible in his second volume of his book on calculus. It is interesting, that most of the standard literature for undergraduate students shies away from any proof of Brouwer’s theorem. Often, the theorem is stated without proof and then some conclusions and applications are drawn from it. Sometimes, a proof via differential forms is given (such as in Königsberger’s book, where it is somewhat downgraded to an exercise after the proof of Stoke’s Theorem) which I wouldn’t call elementary because of the theory which is needed to be developed first. The same holds for proofs using homology groups and the like (even though this is one of the simplest fashions to prove the auxiliary statement given above – it was done in my topology class, but this is by no means elementary).

A little downside is the non-constructiveness of the proof we are about to give. It is a proof by contradiction and it won’t give any indication on how to find the fixed point. For many applications, even the existence of a fixed point is already a gift (think of Peano’s theorem on the existence of solutions to a differential equation, for instance). On the other hand, there are constructive proofs as well, a fact that is quite in the spirit of Brouwer.

In some way, the basic structure of the following proof is similar to the proof that we gave for the case $p=2$. We will apply the same reasoning that concluded the proof for the special case (after the auxiliary statement), we will just add a little more formality to show that the mapping $g$ is actually continuous and well-defined. The trickier part in higher dimensions is to show the corresponding half from which the contradiction followed. Our auxiliary statement within this previous proof involved the non-existence of a certain continuous mapping, that is called a retraction: for a subset $A$ of a topological space $X$, $f:X\to A$ is called a retraction of $X$ to $A$, if $f(x)=x$ for all $x\in A$. We have found that there is no retraction from $K$ to $\partial K$. As a matter of fact, Brouwer’s Fixed Point Theorem and the non-existence of a retraction are equivalent (we’ll get back to that at the end).

The basic structure of the proof is like this:

• we reduce the problem to polynomials, so we only have to deal with those functions instead of a general continuous $f$;
• we formalize the geometric intuition that came across in the special case $p=2$ (this step is in essence identical to what we did above): basing on the assumption that Brouwer’s Theorem is wrong, we define a mapping quite similar to a retraction of $K$ to $\partial K$;
• we show that this almost-retraction is locally bijective;
• we find, via the Transformation Formula, a contradiction: there can be no retraction and there must be a fixed point.

Steps 3 and 4 are the tricky part. They may be replaced by some other argument that yields a contradiction (homology theory, for instance), but we’ll stick to the elementary parts. Let’s go.

Lemma (The polynomial simplification): It will suffice to show Brouwer’s Fixed Point Theorem for those functions $f:K\to K$, whose components are polynomials on $K$ and which have $f(K)\subset\mathring K$.

Proof: Let $f:K\to K$ continuous, it has the components $f = (f_1,\ldots,f_p)$, each of which has the arguments $x_1,\ldots,x_p$. By Weierstrass’ Approximation Theorem, for any $\varepsilon>0$ there are polynomials $p_k^\varepsilon$ such that $\left|f_k(x)-p_k^\varepsilon(x)\right| < \varepsilon$, $k=1,\ldots,p$, for any $x\in K$. In particular, there are polynomials $\varphi_{k,n}$ such that

$\displaystyle \left|f_k(x)-\varphi_{k,n}(x)\right| < \frac{1}{\sqrt{p}n}\qquad\text{for any }x\in K.$

If we define the function $\varphi_n:=(\varphi_{1,n},\ldots,\varphi_{p,n})$ which maps $K$ to $\mathbb{R}^p$, we get

\displaystyle \begin{aligned} \left|f(x)-\varphi_n(x)\right|^2 &= \sum_{k=1}^p\left|f_k(x)-\varphi_{k,n}(x)\right|^2 \\ &< \frac{p}{pn^2} \\ &= \frac1{n^2},\qquad\text{for any }x\in K \end{aligned}

and in particular $\varphi_n\to f$ uniformly in $K$.

Besides,

$\displaystyle \left|\varphi_n(x)\right|\leq\left|\varphi_n(x)-f(x)\right| + \left|f(x)\right| < \frac1n + \left|f(x)\right| \leq \frac1n + 1 =:\alpha_n.$

This allows us to set

$\displaystyle \psi_n(x) = \frac{\varphi_n(x)}{\alpha_n}$.

This function also converges uniformly to $f$, as for any $x\in K$,

\displaystyle \begin{aligned} \left|\psi_n(x)-f(x)\right| &= \left|\frac{\varphi_n(x)}{\alpha_n} - f(x)\right| \\ &= \frac1{\left|\alpha_n\right|}\left|\varphi_n(x)-\alpha_nf(x)\right|\\ &\leq \frac1{\left|\alpha_n\right|}\left|\varphi_n(x)-f(x)\right| + \frac1{\left|\alpha_n\right|}\left|f(x)-\alpha_nf(x)\right|\\ &< \frac1{\left|\alpha_n\right|}\frac1n + \frac1{\left|\alpha_n\right|}\left|f(x)\right|\left|1-\alpha_n\right|\\ &< (1+\delta)\frac1n + \frac{\delta}{1+\delta}\left|f(x)\right|\\ &< \varepsilon \qquad\text{for }n\gg0. \end{aligned}

Finally, for $x\in K$, by construction, $\left|\varphi_n(x)\right|\leq\alpha_n$, and so $\left|\psi_n(x)\right| = \frac{\left|\varphi_n(x)\right|}{\alpha_n} < 1$, which means that $\psi_n:K\to\mathring K$.

The point of this lemma is to state that if we had shown Brouwer’s Fixed Point Theorem for every such function $\psi_n:K\to\mathring K$, whose components are polynomials, we had proved it for the general continuous function $f$. This can be seen as follows:

As we suppose Brouwer’s Theorem was true for the $\psi_n$, there would be a sequence $(x_n)\subset K$ with $\psi_n(x_n) = x_n$. As $K$ is (sequentially) compact, there is a convergent subsequence $(x_{n_j})\subset(x_n)$, and $\lim_jx_{n_j} = x_0\in K$. For sufficiently large $j$, we see

$\displaystyle \left|\psi_{n_j}(x_{n_j})-f(x_0)\right| \leq\left|\psi_{n_j}(x_{n_j})-f(x_{n_j})\right| + \left|f(x_{n_j})-f(x_0)\right| < \frac\varepsilon2 + \frac\varepsilon2.$

The first bound follows from the fact that $\psi_{n_j}\to f$ uniformly, the second bound is the continuity of $f$ itself, with the fact that $x_{n_j}\to x_0$. In particular,

$\displaystyle x_0 = \lim_{j} x_{n_j} = \lim_{j} \psi_{n_j}(x_{n_j}) = f(x_0).$

So, $f$ has the fixed point $x_0$, which proves Brouwer’s Theorem.

In effect, it suffices to deal with functions like the $\psi_n$ for the rest of this text. q.e.d.

Slogan (The geometric intuition): If Brouwer’s Fixed Point Theorem is wrong, then there is “almost” a retraction of $K$ to $\partial K$.

Or, rephrased as a proper lemma:

Lemma: For $f$ being polynomially simplified as in the previous lemma, assuming $x\neq f(x)$ for any $x\in K$, we can construct a continuously differentiable function $g_t:K\to K$, $t\in[0,1]$, with $g_t(x)=x$ for $x\in\partial K$. This function is given via

$\displaystyle g_t(x) =x + t\lambda(x)\bigl(x-f(x)\bigr),$

$\displaystyle \lambda(x)=\frac{-x\cdot\bigl(x-f(x)\bigr)+\sqrt{\left(x\cdot\bigl(x-f(x)\bigr)\right)^2+\bigl(1-\left|x\right|^2\bigr)\left|x-f(x)\right|^2}}{\left|x-f(x)\right|^2}.$

The mapping $t\mapsto g_t$ is the direct line from $x$ to the boundary of $\partial K$, which also passes through $f(x)$. $\lambda(x)$ is the parameter in the straight line that defines the intersection with $\partial K$.

Proof: As we suppose, Brouwer’s Fixed Point Theorem is wrong the continuous function $\left|x-f(x)\right|$ is positive for any $x\in K$. Because of continuity, for every $y\in \partial K$, there is some $\varepsilon = \varepsilon(y) > 0$, such that still $\left|x-f(x)\right|>0$ in the neighborhood $U_{\varepsilon(y)}(y)$.

Here, we have been in technical need of a continuation of $f$ beyond $K$. As $f$ is only defined on $K$ itself, we might take $f(x):=f\bigl(\frac{x}{\left|x\right|}\bigr)$ for $\left|x\right|>1$. We still have $\left|f(x)\right| < 1$ and $f(x)\neq x$, which means that we don’t get contradictions to our assumptions on $f$. Let’s not dwell on this for longer than necessary.

On the compact set $\partial K$, finitely many of the neighborhoods $U_{\varepsilon(y)}(y)$ will suffice to cover $\partial K$. One of them will have a minimal radius. We shall set $\delta = \min_y\varepsilon(y) +1$, to find: there is an open set $U = U_\delta(0)\supset K$ with $\left|x-f(x)\right| >0$ for all $x\in U$.

Let us define for any $x\in U$

$\displaystyle d(x):=\frac{\left(x\cdot\bigl(x-f(x)\bigr)\right)^2+\bigl(1-\left|x\right|\bigr)^2\left|x-f(x)\right|^2}{\left|x-f(x)\right|^4}$.

It is well-defined by assumption. We distinguish three cases:

$a)$ $\left|x\right|<1$: Then $1-\left|x\right|^2>0$ and the numerator of $d(x)$ is positive.

$b)$ $\left|x\right|=1$: Then the numerator of $d(x)$ is

$\displaystyle \left(x\cdot\bigl(x-f(x)\bigr)\right)^2 = \bigl(x\cdot x - x\cdot f(x)\bigr)^2 = \bigl(\left|x\right|^2-x\cdot f(x)\bigr)^2 = \bigl(1-x\cdot f(x)\bigr)^2,$

where by Cauchy-Schwarz and by assumption on $f$, we get

$\displaystyle x\cdot f(x) \leq \left|x\right|\left|f(x)\right| = \left|f(x)\right| < 1\qquad (\spadesuit).$

In particular, the numerator of $d(x)$ is strictly positive.

$c)$ $\left|x\right|>1$: This case is not relevant for what’s to come.

We have seen that $d(x)>0$ for all $\left|x\right|\leq 1$. Since $d$ is continuous, a compactness argument similar to the one above shows that there is some $V = V_{\delta'}(0)\supset K$ with $d(x)>0$ for all $x\in V$. If we pick $\delta'=\delta$ if $\delta$ is smaller, we find: $d$ is positive and well-defined on $V$.

The reason why we have looked at $d$ is not clear yet. Let us grasp at some geometry first. Let $x\in V$ and $\Gamma_x = \left\{x+\lambda\bigl(x-f(x)\bigr)\colon\lambda\in\mathbb{R}\right\}$ the straight line through $x$ and $f(x)$. If we look for the intersection of $\Gamma_x$ with $\partial K$, we solve the equation

$\displaystyle\left|x+\lambda\bigl(x-f(x)\bigr)\right| = 1.$

The intersection “closer to” $x$ is denoted by some $\lambda>0$.

This equation comes down to

\displaystyle \begin{aligned} && \left(x+\lambda\bigl(x-f(x)\bigr)\right) \cdot \left(x+\lambda\bigl(x-f(x)\bigr)\right) &=1 \\ &\iff& \left|x\right|^2 + 2\lambda x\cdot\bigl(x-f(x)\bigr) + \lambda^2\left|x-f(x)\right|^2 &=1\\ &\iff& \lambda^2\left|x-f(x)\right|^2 + 2\lambda x\cdot\bigl(x-f(x)\bigr) &= 1-\left|x\right|^2\\ &\iff& \left(\lambda+\frac{x\cdot\bigl(x-f(x)\bigr)}{\left|x-f(x)\right|^2}\right)^2 &= \frac{1-\left|x\right|^2}{\left|x-f(x)\right|^2} + \left(\frac{x\cdot\bigl(x-f(x)\bigr)}{\left|x-f(x)\right|^2}\right)^2 \\ &\iff& \left(\lambda+\frac{x\cdot\bigl(x-f(x)\bigr)}{\left|x-f(x)\right|^2}\right)^2 &= \frac{(1-\left|x\right|)^2\left|x-f(x)\right|^2+\left(x\cdot\bigl(x-f(x)\bigr)\right)^2}{\left|x-f(x)\right|^4} \\ &\iff& \left(\lambda+\frac{x\cdot\bigl(x-f(x)\bigr)}{\left|x-f(x)\right|^2}\right)^2 &= d(x). \end{aligned}

As $x\in V$, $d(x)>0$, and hence there are two real solutions to the last displayed equation. Let $\lambda(x)$ be the larger one (to get the intersection with $\partial K$ closer to $x$), then we find

\displaystyle \begin{aligned} \lambda(x) &= \sqrt{d(x)} - \frac{x\cdot\bigl(x-f(x)\bigr)}{\left|x-f(x)\right|^2}\\ &= \frac{-x\cdot\bigl(x-f(x)\bigr)+\sqrt{\left(x\cdot\bigl(x-f(x)\bigr)\right)^2+\bigl(1-\left|x\right|^2\bigr)\left|x-f(x)\right|^2}}{\left|x-f(x)\right|^2}. \end{aligned}

By construction,

$\displaystyle \left|x+\lambda(x)\bigl(x-f(x)\bigr)\right| = 1,\qquad\text{for all }x\in V.\qquad(\clubsuit)$

Let us define

$\displaystyle g_t(x) = x+t\lambda(x)\bigl(x-f(x)\bigr),\qquad t\in[0,1],~x\in V.$

This is (at least) a continuously differentiable function, as we simplified $f$ to be a polynomial and the denominator in $\lambda(x)$ is bounded away from $0$. Trivially and by construction, $g_0(x)=x$ and $\left|g_1(x)\right| = 1$ for all $x\in V$.

For $\left|x\right|<1$ and $t<1$, we have

\displaystyle \begin{aligned} \left|x+t\lambda(x)\bigl(x-f(x)\bigr)\right| &\stackrel{\hphantom{(\clubsuit)}}{=} \left|\bigl(t+(1-t)\bigr)x + t\lambda(x)\bigl(x-f(x)\bigr)\right|\\ &\stackrel{\hphantom{(\clubsuit)}}{=}\left|t\left(x+\lambda(x)\bigl(x-f(x)\bigr)\right)+(1-t)x\right|\\ &\stackrel{\hphantom{(\clubsuit)}}{\leq} t\left|x+\lambda(x)\bigl(x-f(x)\bigr)\right|+(1-t)\left|x\right|\\ &\stackrel{(\clubsuit)}{=} t+(1-t)\left|x\right|\\ &\stackrel{\hphantom{(\clubsuit)}}{<} t+(1-t) = 1\qquad (\heartsuit). \end{aligned}

Hence, $\left|g_t(x)\right|<1$ for $\left|x\right|<1$ and $t\in[0,1)$. This means $g_t(\mathring K)\subset\mathring K$ for $t<1$.

For $\left|x\right|=1$, we find (notice that $x\cdot\bigl(x-f(x)\bigr)>0$ for $\left|x\right|=1$, by $(\spadesuit)$).

\displaystyle \begin{aligned} \lambda(x) &= \frac{-x\cdot\bigl(x-f(x)\bigr)+\sqrt{\left(x\cdot\bigl(x-f(x)\bigr)\right)^2}}{\left|x-f(x)\right|^4} \\ &= \frac{-x\cdot\bigl(x-f(x)\bigr)+x\cdot\bigl(x-f(x)\bigr)}{\left|x-f(x)\right|^4} = 0. \end{aligned}

This is geometrically entirely obvious, since $\lambda(x)$ denotes the distance of $x$ to the intersection with $\partial K$; if $x\in\partial K$, this distance is apparently $0$.

We have seen that $g_t(x)=x$ for $\left|x\right|=1$ for any $t\in[0,1]$. Hence, $g_t(\partial K)=\partial K$ for all $t$. $g_t$ is almost a retraction, $g_1$ actually is a retraction. q.e.d.

Note how tricky the general formality gets, compared to the more compact and descriptive proof that we gave in the special case $p=2$. The arguments of the lemma and in the special case are identical.

Lemma (The bijection): Let $\hat K$ be a closed ball around $0$, $K\subset\hat K\subset V$. The function $g_t$ is a bijection on $\hat K$, for $t\geq0$ sufficiently small.

Proof: We first show that $g_t$ is injective. Let us define $h(x):=\lambda(x)\bigl(x-f(x)\bigr)$, for reasons of legibility. As we saw above, $h$ is (at least) continuously differentiable. We have

$\displaystyle g_t(x) = x+th(x),\qquad g_t'(x)=\mathrm{Id}+th'(x).$

As $\hat K$ is compact, $h'$ is bounded by $\left|h'(x)\right|\leq C$, say. By enlarging $C$ if necessary, we can take $C\geq1$. Now let $x,y\in\hat K$ with $g_t(x)=g_t(y)$. That means $x+th(x)=y+th(y)$ and so, by the mean value theorem,

$\displaystyle \left|x-y\right| = t\left|h(x)-h(y)\right|\leq tC\left|x-y\right|.$

By setting $\varepsilon:=\frac1C$ and taking $t\in[0,\varepsilon)$, we get $\left|x-y\right| = 0$. $g_t$ is injective for $t<\varepsilon$.

Our arguments also proved $\left|th'(x)\right| < 1$. Let us briefly look at the convergent Neumann series $\sum_{k=0}^\infty\bigl(th'(x)\bigr)^k$, having the limit $s$, say. We find

$\displaystyle sth'(x) = \sum_{k=0}^\infty\bigl(th'(x)\bigr)^{k+1} = s-\mathrm{Id},$

which tells us

$\displaystyle \mathrm{Id} = s-s\cdot th'(x) = s\bigl(\mathrm{Id}-th'(x)\bigr).$

In particular, $g_t'(x) = \mathrm{Id}-th'(x)$ is invertible, with the inverse $s$. Therefore, $\det g_t'(x)\neq0$. Since this determinant is a continuous function of $t$, and $\det g_0'(x) = \det\mathrm{Id} = 1$, we have found

$\displaystyle \det g_t'(x) > 0 \text{ for any }t\in[0,\varepsilon),~x\in\hat K$.

Now, let us show that $g_t$ is surjective. As $\det g_t'(x)$ never vanishes on $\hat K$, $g_t$ is an open mapping (by an argument involving the inverse function theorem; $g_t$ can be inverted locally in any point, hence no point can be a boundary point of the image). This means that $g_t(\mathring K)$ is open.

Let $z\in K$ with $z\notin g_t(\mathring K)$; this makes $z$ the test case for non-surjectivity. Let $y\in g_t(\mathring K)$; there is some such $y$ due to $(\heartsuit)$. The straight line between $y$ and $z$ is

$\displaystyle \overline{yz}:=\left\{(1-\lambda)y+\lambda z\colon \lambda\in[0,1]\right\}.$

As $g_t$ is continuous, there must be some point $v\in\partial g_t(\mathring K)\cap\overline{yz}$; we have to leave the set $g_t(\mathring K)$ somewhere. Let us walk the line until we do, and then set

$\displaystyle v=(1-\lambda_0)y+\lambda_0z,\qquad\text{with }\lambda_0=\sup\left\{\lambda\in[0,1]\colon\overline{y;(1-\lambda)y+\lambda z}\subset g_t(\mathring K)\right\}.$

Now, continuous images of compact sets remain compact: $g_t(K)$ is compact and hence closed. Therefore, we can conclude

$\displaystyle g_t(\mathring K)\subset g_t(K)\quad\implies\quad \overline{g_t(\mathring K)}\subset g_t(K)\quad\implies\quad v\in\overline{g_t(\mathring K)}\subset g_t(K).$

This means that there is some $u\in K$ such that $v=g_t(u)$. As $g_t(\mathring K)$ is open, $u\in\partial K$ (since otherwise, $v\notin\partial g_t(\mathring K)$ which contradicts the construction). Therefore, $\left|u\right|=1$, and since $g_t$ is almost a retraction, $g_t(u)=u$. Now,

$\displaystyle v=g_t(u) = u \quad\implies\quad v\in\partial K.$

But by construction, $v$ is a point between $z\in K$ and $y\in g_t(\mathring K)$; however, $y\notin\partial K$, since $g_t(\mathring K)$ is open. Due to the convexity of $K$, we have no choice but $z\in\partial K$, and by retraction again, $g_t(z)=z$. In particular, $z\in g_t(\partial K)$.

We have shown that if $z\notin g_t(\mathring K)$, then $z\in g_t(\partial K)$. In particular, $z\in g_t(K)$ for any $z\in K$. $g_t$ is surjective. q.e.d.

Lemma (The Integral Application): The real-valued function

$\displaystyle V(t)=\int_K\det g_t'(x)dx$

is a polynomial and satisfies $V(1)>0$.

Proof: We have already seen in the previous lemma that $\det g_t'(x)>0$ on $x\in\mathring{\hat K}$ for $t<\varepsilon$. This fact allows us to apply the transformation formula to the integral:

$\displaystyle V(t) = \int_{g_t(K)}1dx.$

As $g_t$ is surjective, provided $t$ is this small, $g_t(K) = K$, and therefore

$\displaystyle V(t) = \int_K1dx = \mu(K).$

In particular, this no longer depends on $t$, which implies $V(t)>0$ for any $t<\varepsilon$.

By the Leibniz representation of the determinant, $\det g_t'(x)$ is a polynomial in $t$, and therefore, so is $V(t)$. The identity theorem shows that $V$ is constant altogether: in particular $V(1)=V(0)>0$. q.e.d.

Now we can readily conclude the proof of Brouwer’s Fixed Point Theorem, and we do it in a rather unexpected way. After the construction of $g_t$, we had found $\left|g_1(x)\right|=1$ for all $x\in V$. Let us write this in its components and take a partial derivative ($j=1,\ldots,p$)

\displaystyle \begin{aligned} &&1 &= \sum_{k=1}^p\bigl(g_{1,k}(x)\bigr)^2\\ &\implies& 0 &= \frac\partial{\partial x_j}\sum_{k=1}^p\bigl(g_{1,k}(x)\bigr)^2 = \sum_{k=1}^p2\frac{\partial g_{1,k}(x)}{\partial x_j}g_{1,k}(x) \end{aligned}

This last line is a homogeneous system of linear equations, that we might also write like this

$\displaystyle \begin{pmatrix}\frac{\partial g_{1,1}(x)}{\partial x_1}&\cdots &\frac{\partial g_{1,p}(x)}{\partial x_1}\\ \ldots&&\ldots\\ \frac{\partial g_{1,1}(x)}{\partial x_p}&\cdots&\frac{\partial g_{1,p}(x)}{\partial x_p}\end{pmatrix} \begin{pmatrix}\xi_1\\\ldots\\\xi_p\end{pmatrix} = 0,$

and our computation has shown that the vector $\bigl(g_{1,1}(x),\ldots,g_{1,p}(x)\bigr)$ is a solution. But the vector $0$ is a solution as well. These solutions are different because of $\left|g_1(x)\right| = 1$. If a system of linear equations has two different solutions, it must be singular (it is not injective), and the determinant of the linear system vanishes:

$\displaystyle 0 = \det \begin{pmatrix}\frac{\partial g_{1,1}(x)}{\partial x_1}&\cdots &\frac{\partial g_{1,p}(x)}{\partial x_1}\\ \ldots&&\ldots\\ \frac{\partial g_{1,1}(x)}{\partial x_p}&\cdots&\frac{\partial g_{1,p}(x)}{\partial x_p}\end{pmatrix} = \det g_1'(x).$

This means

$\displaystyle 0 = \int_K\det g_1'(x)dx = V(1) > 0$.

A contradiction, which stems from the basic assumption that Brouwer’s Fixed Point Theorem were wrong. The Theorem is thus proved. q.e.d.

Let us make some concluding remarks. Our proof made vivid use of the fact that if there is a retraction, Brouwer’s Theorem must be wrong (this is where we got our contradiction in the end: the retraction cannot exist). The proof may also be started the other way round. If we had proved Brouwer’s Theorem without reference to retractions (this is how Elstrodt does it), you can conclude that there is no retraction from $K$ to $\partial K$ as follows: if there was a retraction $g:K\to\partial K$, we could consider the mapping $-g$. It is, in particular, a mapping of $K$ to itself, but it does not have any fixed point – a contradiction to Brouwer’s Theorem.

Brouwer’s Theorem, as we have stated it here, is not yet ready to drink. For many applications, the set $K$ is too much of a restriction. It turns out, however, that the hardest work has been done. Some little approximation argument (which in the end amounts to continuous projections) allows to formulate, for instance:

• Let $C\subset\mathbb{R}^p$ be convex, compact and $C\neq\emptyset$. Let $f:C\to C$ be continuous. Then $f$ has a fixed point.
• Let $E$ be a normed space, $K\subset E$ convex and $\emptyset\neq C\subset K$ compact. Let $f:K\to C$ be continuous. Then $f$ has a fixed point.
• Let $E$ be a normed space, $K\subset E$ convex and $K\neq\emptyset$. Let $f:K\to K$ be continuous. Let either $K$ be compact or $K$ bounded and $f(K)$ relatively compact. Then $f$ has a fixed point.

The last two statements are called Schauder’s Fixed Point Theorems, which may often be applied in functional analysis, or are famously used for proofs of Peano’s Theorem in differential equations. But at the core of all of them is Brouwer’s Theorem. This seems like a good place to end.

What does the determinant have to do with transformed measures?

Let us consider transformations of the space $\mathbb{R}^p$. How does Lebesgue measure change by this transformation? And how do integrals change? The general case is answered by Jacobi’s formula for integration by substitution. We will start out slowly and only look at how the measure of sets is transformed by linear mappings.

It is folklore in the basic courses on linear algebra, when the determinant of a matrix is introduced, to convey the notion of size of the parallelogram spanned by the column vectors of the matrix. The following theorem shows why this folklore is true; this of course is based on the axiomatic description of the determinant which encodes the notion of size already. But coming from pure axiomatic reasoning, we can connect the axioms of determinant theory to their actual meaning in measure theory.

First, remember the definition of the pushforward measure. Let $X$ and $Y$ be measurable spaces, and $f:X\to Y$ a measurable mapping (i.e. it maps Borel-measurable sets to Borel-measurable sets; we shall not deal with finer details of measure theory here). Let $\mu$ be a measure on $X$. Then we define a measure on $Y$ in the natural – what would $\mu$ do? – way:

$\displaystyle \mu_f(A) := \mu\bigl(f^{-1}(A)\bigr).$

In what follows, $X = Y = \mathbb{R}^p$ and $\mu = \lambda$ the Lebesgue measure.

Theorem (Transformation of Measure): Let $f$ be a bijective linear mapping and let $A\subset\mathbb{R}^p$ be a measurable set. Then, the pushforward measure satisfies:

$\displaystyle \lambda_f(A) = \left|\det f\right|^{-1}\lambda(A)\qquad\text{for any Borel set }A\in\mathcal{B}^p.$

Lemma (The Translation-Lemma): Lebesgue measure is invariant under translations.

Proof: Let $c,d\in\mathbb{R}^p$ with $c\leq d$ component-wise. Let $t_a$ be the shift by the vector $a\in\mathbb{R}^p$, i.e. $t_a(v) = v+a$ and $t_a(A) = \{x+a\in\mathbb{R}^p\colon x\in A\}$. Then,

$\displaystyle t_a^{-1}\bigl((c,d]\bigr) = (c-a,d-a],$

where the interval is meant as the cartesian product of the component-intervals. For the Lebesgue-measure, we get

$\displaystyle \lambda_{t_a}\bigl((c,d]\bigr) = \lambda\bigl((c-a,d-a]\bigr) = \prod_{i=1}^p\bigl((d-a)-(c-a)\bigr) = \prod_{i=1}^p(d-c) = \lambda\bigl((c,d]\bigr).$

The measures $\lambda$, $\lambda_{t_a}$ hence agree on the rectangles (open on their left-hand sides), i.e. on a semi-ring generating the $\sigma$-algebra $\mathcal{B}^p$. With the usual arguments (which might as well involve $\cap$-stable Dynkin-systems, for instance), we find that the measures agree on the whole $\mathcal{B}^p$.

q.e.d.

Lemma (The constant-multiple-Lemma): Let $\mu$ be a translation-invariant measure on $\mathcal{B}^p$, and $\alpha:=\mu\bigl([0,1]^p\bigr) < \infty$. Then $\mu(A) = \alpha\lambda(A)$ for any $A\in\mathcal(B)^p$.

Note that the Lemma only holds for finite measures on $[0,1]^p$. For instance, the counting measure is translation-invariant, but it is not a multiple of Lebesgue measure.

Proof: We divide the set $(0,1]^p$ via a rectangular grid of sidelengths $\frac1{n_i}$, $i=1,\ldots,p$:

$\displaystyle (0,1]^p = \bigcup_{\stackrel{k_j=0,\ldots,n_j-1}{j=1,\ldots,p}}\left(\times_{i=1}^p\left(0,\frac1{n_i}\right] + \left(\frac{k_1}{n_1},\ldots,\frac{k_p}{n_p}\right)\right).$

On the right-hand side there are $\prod_{i=1}^pn_i$ sets which have the same measure (by translation-invariance). Hence,

$\displaystyle \mu\bigl((0,1]^p\bigr) = \mu\left(\bigcup\cdots\right) = n_1\cdots n_p \cdot \mu\left(\times_{i=1}^p \left(0,\frac1{n_i}\right]\right).$

Here, we distinguish three cases.

Case 1: $\mu\bigl((0,1]^p\bigr) = 1$. Then,

$\displaystyle\mu\left(\times_{i=1}^p (0,\frac1{n_i}]\right) = \prod_{i=1}^p\frac1{n_i} = \lambda\left(\times_{i=1}^p (0,\frac1{n_i}]\right).$

By choosing appropriate grids and further translations, we get that $\mu(I) = \lambda(I)$ for any rectangle $I$ with rational bounds. Via the usual arguments, $\mu=\lambda$ on the whole of $\mathcal{B}^p$.

Case 2: $\mu\bigl((0,1]^p\bigr) \neq 1$ and $>0$. By assumption, however, this measure is finite. Setting $\gamma = \mu\bigl((0,1]^p\bigr)$, we can look at the measure $\gamma^{-1}\mu$, which of course has $\gamma^{-1}\mu\bigl((0,1]^p\bigr) = 1$. By Case 1, $\gamma^{-1}\mu = \lambda$.

Case 3: $\mu\bigl((0,1]^p\bigr) = 0$. Then, using translation invariance again,

$\displaystyle \mu(\mathbb{R}^p) = \mu\bigl(\bigcup_{z\in\mathbb{Z}^p}((0,1]^p+z)\bigr) = \sum_{z\in\mathbb{Z}^p}\mu\bigl((0,1]^p\bigr) = 0.$

Again, we get $\mu(A) = 0$ for all $A\in\mathcal{B}^p$.

That means, in all cases, $\mu$ is equal to a constant multiple of $\lambda$, the constant being the measure of $(0,1]^p$. That is not quite what we intended, as we wish the constant multiple to be the measure of the compact set $[0,1]^p$.

Remember our setting $\alpha:=\mu\bigl([0,1]^p\bigr)$ and $\gamma := \mu\bigl((0,1]^p\bigr)$. Let $A\in\mathcal{B}^p$. We distinguish another two cases:

Case $a)$ $\alpha = 0$. By monotony, $\gamma = 0$ and Case 3 applies: $\mu(A) = 0 = \alpha\lambda(A)$.

Case $b)$ $\alpha > 0$. By monotony and translation-invariance,

$\displaystyle \alpha \leq \mu\bigl((-1,1]^p\bigr) = 2^p\gamma,$

meaning $\gamma\geq\frac{\alpha}{2^p}$. Therefore, as $\alpha>0$, we get $\gamma>0$, and by Case 1, $\frac1\gamma\mu(A) = \lambda(A)$. In particular,

$\displaystyle \frac\alpha\gamma = \frac1\gamma\mu\bigl([0,1]^p\bigr) = \lambda\bigl([0,1]^p\bigr) = 1,$

and so, $\alpha = \gamma$, meaning $\mu(A) = \gamma\lambda(A) = \alpha\lambda(A)$.

q.e.d.

Proof (of the Theorem on Transformation of Measure). We will first show that the measure $\lambda_f$ is invariant under translations.

We find, using the Translation-Lemma in $(\ast)$, and the linearity of $f$ before that,

\displaystyle \begin{aligned} \lambda_{t_a\circ f}(A) = \lambda_f(A-a) &\stackrel{\hphantom{(\ast)}}{=} \lambda\bigl(f^{-1}(A-a)\bigr) \\ &\stackrel{\hphantom{(\ast)}}{=} \lambda\bigl(f^{-1}(A) - f^{-1}(a)\bigr) \\ &\stackrel{(\ast)}{=} \lambda\bigl(f^{-1}(A)\bigr) \\ &\stackrel{\hphantom{(\ast)}}{=} \lambda_f(A), \end{aligned}

which means that $\lambda_f$ is indeed invariant under translations.

As $[0,1]^p$ is compact, so is $f^{-1}\bigl([0,1]^p\bigr)$ – remember that continuous images of compact sets are compact (here, the continuous mapping is $f^{-1}$). In particular, $f^{-1}\bigl([0,1]^p\bigr)$ is bounded, and thus has finite Lebesgue measure.

We set $c(f) := \lambda_f\bigl([0,1]^p)\bigr)$. By the constant-multiple-Lemma, $\lambda_f$ is a multiple of Lebesgue measure: we must have

$\displaystyle \lambda_f(A) = c(f)\lambda(A)\text{ for all }A\in\mathcal{B}^p.\qquad (\spadesuit)$

We only have left to prove that $c(f) = \left|\det f\right|^{-1}$. To do this, there may be two directions to follow. We first give the way that is laid out in Elstrodt’s book (which we are basically following in this whole post). Later, we shall give the more folklore-way of concluding this proof.

We consider more and more general fashions of the invertible linear mapping $f$.

Step 1: Let $f$ be orthogonal. Then, for the unit ball $B_1(0)$,

$\displaystyle c(f)\lambda\bigl(B_1(0)\bigr) \stackrel{(\spadesuit)}{=} \lambda_f\bigl(B_1(0)\bigr) = \lambda\left(f^{-1}\bigl(B_1(0)\bigr)\right) = \lambda\bigl(B_1(0)\bigr).$

This means, that $c(f) = 1 = \left|\det(f)\right|$.

This step shows for the first time how the properties of a determinant encode the notion of size already: we have only used the basic lemmas on orthogonal matrices (they leave distances unchanged and hence the ball $B_1(0)$ doesn’t transform; besides, their inverse is their adjoint) and on determinants (they don’t react to orthogonal matrices because of their multiplicative property and because they don’t care for adjoints).

Step 2: Let $f$ have a representation as a diagonal matrix (using the standard basis of $\mathbb{R}^p$). Let us assume w.l.o.g. that $f = \mathrm{diag}(d_1,\ldots,d_p)$ with $d_i>0$. The case of $d_i<0$ is only notationally cumbersome. We get

$\displaystyle c(f)\lambda_f\bigl([0,1]^p\bigr) = \lambda\left(f^{-1}\bigl([0,1]^p\bigr)\right) = \lambda\bigl(\times_{i=1}^p[0,d_i^{-1}]\bigr) = \prod_{i=1}^pd_i^{-1} = \left|\det f\right|^{-1}.$

Again, the basic lemmas on determinants already make use of the notion of size without actually saying so. Here, it is the computation of the determinant by multiplication of the diagonal.

Step 3: Let $f$ be linear and invertible, and let $f^\ast$ be its adjoint. Then $f^\ast f$ is non-negative definite (since for $x\in\mathbb{R}^p$, $x^\ast(f^\ast f)x = (fx)^\ast(fx) = \left\|fx\right\|^2\geq0$). By the Principal Axis Theorem, there is some orthogonal matrix $v$ and some diagonal matrix with non-negative entries $d$ with $f^\ast f = vd^2v^\ast$. As $f$ was invertible, no entry of $d$ may vanish here (since then, its determinant would vanish and in particular, $f$ would no longer be invertible). Now, we set

$\displaystyle w:=d^{-1}v^\ast f^\ast,$

which is orthogonal because of

$\displaystyle ww^\ast = d^{-1} v^\ast f^\ast fv d^{-1} = d^{-1}v^\ast (vd^2v^\ast)v d^{-1} = d^{-1}d^2d^{-1}v^\ast v v^\ast v = \mathrm{id}.$

As $f = w^\ast dv$, we see from Step 1

\displaystyle \begin{aligned} c(f) = \lambda_f\bigl([0,1]^p\bigr) &= \lambda\left(v^{-1}d^{-1}w\bigl([0,1]^p\bigr)\right) \\ &= \lambda\left(d^{-1}w\bigl([0,1]^p\bigr)\right)\\ &= \left|\det d\right|^{-1}\lambda\left(w\bigl([0,1]^p\bigr)\right) \\ &= \left|\det d\right|^{-1}\lambda\bigl([0,1]^p\bigr) \\ &= \left|\det f\right|^{-1}\lambda\bigl([0,1]^p\bigr) \\ &= \left|\det f\right|^{-1}, \end{aligned}

by the multiplicative property of determinants again ($\det f = \det d$).

q.e.d.(Theorem)

As an encore, we show another way to conclude in the Theorem, once all the Lemmas are shown and applied. This is the more folklore way alluded to in the proof, making use of the fact that any invertible matrix is the product of elementary matrices (and, of course, making use of the multiplicative property of determinants). Hence, we only consider those.

Because Step 2 of the proof already dealt with diagonal matrices, we only have to look at shear-matrices like $E_{ij}(r) := \bigl(\delta_{kl}+r\delta_{ik}\delta_{jl}\bigr)_{k,l=1,\ldots,p}$. They are the identity matrix with the (off-diagonal) entry $r$ in row $i$ and column $j$. One readily finds $\bigl(E_{ij}(r)\bigr)^{-1} = E_{ij}(-r)$, and $\det E_{ij}(r) = 1$. Any vector $v\in[0,1]^p$ is mapped to

$\displaystyle E_{ij}(v_1,\ldots,v_i,\ldots, v_p)^t = (v_1,\ldots,v_i+rv_j,\ldots,v_p)^t.$

This gives

\displaystyle \begin{aligned}\lambda_{E_{ij}(r)}\bigl([0,1]^p\bigr) &= \lambda\left(E_{ij}(-r)\bigl([0,1]^p\bigr)\right) \\ &= \lambda\left(\left\{x\in\mathbb{R}^p\colon x=(v_1,\ldots,v_i-rv_j,\ldots,v_p), v_k\in[0,1]\right\}\right). \end{aligned}

This is a parallelogram that may be covered by $n$ rectangles as follows: we fix the dimension $i$ and one other dimension to set a rectangle of height $\frac1n$, width $1+\frac rn$ (all other dimension-widths = 1; see the image for an illustration). Implicitly, we have demanded that $p\geq2$ here; but $p=1$ is uninteresting for the proof, as there are too few invertible linear mappings in $\mathbb{R}^1$.

By monotony, this yields

$\lambda_{E_{ij}(r)}\bigl([0,1]^p\bigr) \leq n\frac1n\left(1+\frac{r}{n}\right) = 1+\frac rn\xrightarrow{n\to\infty}1.$

On the other hand, this parallelogram itself covers the rectangles of width $1-\frac rn$, and a similar computation shows that in the limit $\lambda_{E_{ij}(r)}\bigl([0,1]^p\bigr)\geq1$.

In particular: $\lambda_{E_{ij}(r)}\bigl([0,1]^p\bigr) = 1 = \left|\det E_{ij}(r)\right|^{-1}$.

q.e.d. (Theorem encore)

Proving the multidimensional transformation formula for integration by substitution is considerably more difficult than in one dimension, where it basically amounts to reading the chain rule reversedly. Let us state the formula here first:

Theorem (The Transformation Formula, Jacobi): Let $U,V\subset \mathbb{R}^p$ be open sets and let $\Phi:U\to V$ be a $\mathcal{C}^1$-diffeomorphism (i.e. $\Phi^{-1}$ exists and both $\Phi$ and $\Phi^{-1}$ are $\mathcal{C}^1$-functions). Let $f:V\to\mathbb{R}$ be measurable. Then, $f\circ\Phi:U\to\mathbb{R}$ is measurable and

$\displaystyle\int_V f(t)dt = \int_U f\bigl(\Phi(s)\bigr)\left|\det\Phi'(s)\right|ds.$

At the core of the proof is the Theorem on Transformation of Measure that we have proved above. The idea is to approximate $\Phi$ by linear mappings, which locally transform the Lebesgue measure underlying the integral and yield the determinant in each point as correction factor. The technical difficulty is to show that this approximation does no harm for the evaluation of the integral.

We will need a lemma first, which carries most of the weight of the proof.

The Preparatory Lemma: Let $U,V\subset \mathbb{R}^p$ be open sets and let $\Phi:U\to V$ be a $\mathcal{C}^1$-diffeomorphism. If $X\subset U$ is a Borel set, then so is $\Phi(X)\subset V$, and

$\displaystyle \lambda\bigl(\Phi(X)\bigr)\leq\int_X\left|\det\Phi'(s)\right|ds.$

Proof: Without loss of generality, we can assume that $\Phi$, $\Phi'$ and $(\Phi')^{-1}$ are defined on a compact set $K\supset U$. We consider, for instance, the sets

$\displaystyle U_k:=\left\{ x\in U\colon \left|x\right|\frac1k\right\}.$

The $U_k$ are open and bounded, $\overline U_k$ is hence compact, and there is a chain $U_k\subset\overline U_k\subset U_{k+1}\subset\cdots$ for all $k$, with $U=\bigcup_kU_k$. To each $U_k$ there is, hence, a compact superset on which $\Phi$, $\Phi'$ and $(\Phi')^{-1}$ are defined. Now, if we can prove the statement of the Preparatory Lemma on $X_k := X\cap U_k$, it will also be true on $X=\lim_kX_k$ by the monotone convergence theorem.

As we can consider all relevant functions to be defined on compact sets, and as they are continuous (and even more) by assumption, they are readily found to be uniformly continuous and bounded.

It is obvious that $\Phi(X)$ will be a Borel set, as $\Phi^{-1}$ is continuous.

Let us prove that the Preparatory Lemma holds for rectangles $I$ with rational endpoints, being contained in $U$.

There is some $r>0$ such that for any $a\in I$, $B_r(a)\subset U$. By continuity, there is a finite constant $M$ with

$\displaystyle M:=\sup_{t\in I}\left\|\bigl(\Phi'(t)\bigr)^{-1}\right\|,$

and by uniform continuity, $r$ may be chosen small enough such that, for any $\varepsilon>0$, even

$\displaystyle \sup_{x\in B_r(a)}\left\|\Phi'(x)-\Phi'(a)\right\|\leq\frac{\varepsilon}{M\sqrt p} \text{ for every }a\in I.$

With this $r$, we may now sub-divide our rectangle $I$ into disjoint cubes $I_k$ of side-length $d$ such that $d<\frac{r}{\sqrt p}$. In what follows, we shall sometimes need to consider the closure $\overline I_k$ for some of the estimates, but we shall not make the proper distinction for reasons of legibility.

For any given $b\in I_k$, every other point $c$ of $I_k$ may at most have distance $d$ in each of its components, which ensures

$\displaystyle\left\|b-c\right\|^2 \leq \sum_{i=1}^pd^2 = pd^2 < r^2.$

This, in turn, means, $I_k\subset B_r(b)$ (and $B_r(b)\subset U$ has been clear because of the construction of $r$).

Now, in every of the cubes $I_k$, we choose the point $a_k\in I_k$ with

$\displaystyle\left|\det\Phi'(a_k)\right| = \min_{t\in I_k}\left|\det\Phi'(t)\right|,$

and we define the linear mapping

$\displaystyle \Phi_k:=\Phi'(a_k)$.

Remember that for convex sets $A$, differentiable mappings $h:A\to\mathbb{R}^p$, and points $x,y\in A$, the mean value theorem shows

$\displaystyle \left\|h(x)-h(y)\right\|\leq\left\|x-y\right\|\sup_{\lambda\in[0,1]}\left\|h'\bigl(x+\lambda(y-x)\bigr)\right\|.$

Let $a\in I_k$ be a given point in one of the cubes. We apply the mean value theorem to the mapping $h(x):=\Phi(x)-\Phi_k(x)$, which is certainly differentiable, to $y:=a_k$, and to the convex set $A:=B_r(a)$:

\displaystyle \begin{aligned} \left\|h(x)-h(y)\right\|&\leq\left\|x-y\right\|\sup_{\lambda\in[0,1]}\left\|h'\bigl(x+\lambda(y-x)\bigr)\right\|\\ \left\|\Phi(x)-\Phi_k(x)-\Phi(a_k)+\Phi_k(a_k)\right\| & \leq \left\|x-a_k\right\|\sup_{\lambda\in[0,1]}\left\|\Phi'\bigl(x+\lambda(x-a_k)\bigr)-\Phi'(a_k)\right\|\\ \left\|\Phi(x)-\Phi(a_k)-\Phi_k(x-a_k)\right\| &< \left\|x-a_k\right\| \frac{\varepsilon}{M\sqrt p}\qquad (\clubsuit). \end{aligned}

Note that as $a_k\in I_k\subset B_r(a)$, $x+\lambda(x-a_k)\in B_r(a)$ by convexity, and hence the upper estimate of uniform continuity is applicable. Note beyond that, that $\Phi_k$ is the linear mapping $\Phi'(a_k)$ and the derivative of a linear mapping is the linear mapping itself.

Now, $\left\|x-a_k\right\|< d\sqrt p$, as both points are contained in $I_k$, and hence $(\clubsuit)$ shows

\displaystyle \begin{aligned} \Phi(I_k) &\subset \Phi(a_k)+\Phi_k(I_k-a_k)+B_{\frac{\varepsilon}{M\sqrt p}d\sqrt p}(0) \\ &\subset \Phi(a_k)+\Phi_k(I_k-a_k)+B_{\frac{d\varepsilon}{M}}(0). \end{aligned}

By continuity (and hence boundedness) of $\Phi'$, we also have

$\displaystyle \left\|(\Phi_k)^{-1}(x)\right\|\leq\left\|\bigl(\Phi'(a_k)\bigr)^{-1}\right\|\left\|x\right\|\leq M \left\|x\right\|$,

which means $B_{\frac{d\varepsilon}{M}}(0) = \Phi_k\left(\Phi_k^{-1}\bigl(B_{\frac{d\varepsilon}{M}}(0)\bigr)\right) \subset \Phi_k\bigl(B_{d\varepsilon}(0)\bigr)$.

Hence:

$\displaystyle \Phi(I_k) \subset \Phi(a_k) + \Phi_k\bigl(I_k-a_k+B_{d\varepsilon}(0)\bigr).$

Why all this work? We want to bound the measure of the set $\Phi(I_k)$, and we can get it now: the shift $\Phi(a_k)$ is unimportant by translation invariance. And the set $I_k-a_k+B_{d\varepsilon}(0)$ is contained in a cube of side-length $d+2d\varepsilon$. As promised, we have approximated the mapping $\Phi$ by a linear mapping $\Phi_k$ on a small set, and the transformed set has become only slightly bigger. By the Theorem on Transformation of Measure, this shows

\displaystyle \begin{aligned} \lambda\bigl(\Phi(I_k)\bigr) &\leq \lambda\left(\Phi_k\bigl(I_k-a_k+Blat_{d\varepsilon}(0)\bigr)\right) \\ &=\left|\det\Phi_k\right|\lambda\bigl(I_k-a_k+B_{d\varepsilon}(0)\bigr)\\ &\leq \left|\det\Phi_k\right|d^p(1+2\varepsilon)^p \\ &= \left|\det\Phi_k\right|(1+2\varepsilon)^p\lambda(I_k). \end{aligned}

Summing over all the cubes $I_k$ of which the rectangle $I$ was comprised, (remember that $\Phi$ is a diffeomorphism and disjoint sets are kept disjoint; besides, $a_k$ has been chosen to be the point in $I_k$ of smallest determinant for $\Phi'$)

\displaystyle \begin{aligned} \lambda\bigl(\Phi(I)\bigr) &\leq (1+2\varepsilon)^p\sum_{k=1}^n\left|\det \Phi_k\right|\lambda(I_k) \\ &= (1+2\varepsilon)^p\sum_{k=1}^n\left|\det \Phi'(a_k)\right|\lambda(I_k)\\ &= (1+2\varepsilon)^p\sum_{k=1}^n\int_{I_k}\left|\det\Phi'(a_k)\right|ds\\ &\leq (1+2\varepsilon)^p\int_I\left|\det\Phi'(s)\right|ds. \end{aligned}

Taking $\varepsilon\to0$ yields to smaller subdivisions $I_k$ and in the limit to the conclusion. The Preparatory Lemma holds for rectangles.

Now, let $X\subset U$ be any Borel set, and let $\varepsilon>0$. We cover $X$ by disjoint (rational) rectangles $R_k\subset U$, such that $\lambda\bigl(\bigcup R_k \setminus X\bigr)<\varepsilon$. Then,

\displaystyle \begin{aligned} \lambda\bigl(\Phi(X)\bigr) &\leq \sum_{k=1}^\infty \lambda\bigl(\Phi(R_k)\bigr)\\ &\leq\sum_{k=1}^\infty\int_{R_k}\left|\det \Phi'(s)\right|ds\\ &= \int_{\bigcup R_k}\left| \det\Phi'(s)\right| ds\\ &= \int_X\left| \det\Phi'(s)\right| ds + \int_{\bigcup R_k\setminus X}\left|\det\Phi'(s)\right|ds\\ &\leq \int_X\left| \det\Phi'(s)\right| ds + M\lambda\left(\bigcup R_k\setminus X\right)\\ &\leq \int_X\left| \det\Phi'(s)\right| ds + M\varepsilon. \end{aligned}

If we let $\varepsilon\to0$, we see $\lambda\bigl(\Phi(X)\bigr)\leq\int_X\bigl|\det\Phi'(s)\bigr|ds$.

q.e.d. (The Preparatory Lemma)

We didn’t use the full generality that may be possible here: we already focused ourselves on the Borel sets, instead of the larger class of Lebesgue-measurable sets. We shall skip the technical details that are linked to this topic, and switch immediately to the

Proof of Jacobi’s Transformation Formula: We can focus on non-negative functions $f$ without loss of generality (take the positive and the negative part separately, if needed). By the Preparatory Lemma, we already have

\displaystyle\begin{aligned} \int_{\Phi(U)}\mathbf{1}_{\Phi(X)}(s)ds &= \int_{V}\mathbf{1}_{\Phi(X)}(s)ds\\ &= \int_{\Phi(X)}ds\\ &= \lambda\bigl(\Phi(X)\bigr)\\ &\leq \int_X\left|\det\Phi'(s)\right|ds\\ &= \int_U\mathbf{1}_X(s)\left|\det\Phi'(s)\right|ds\\ &= \int_U\mathbf{1}_{\Phi(X)}\bigl(\Phi(s)\bigr)\left|\det\Phi'(s)\right|ds, \end{aligned}

which proves the inequality

$\displaystyle \int_{\Phi(U)}f(t)dt \leq \int_U f\bigl(\Phi(s)\bigr)\left|\det\Phi'(s)\right|ds,$

for indicator functions $f = \mathbf{1}_{\Phi(X)}$. By usual arguments (linearity of the integral, monotone convergence), this also holds for any measurable function $f$. To prove the Transformation Formula completely, we apply this inequality to the transformation $\Phi^{-1}$ and the function $g(s):=f\bigl(\Phi(s)\bigr)\left|\det\Phi'(s)\right|$:

\displaystyle \begin{aligned} \int_Uf\bigl(\Phi(s)\bigr)\left|\det\Phi'(s)\right|ds &= \int_{\Phi^{-1}(V)}g(t)dt\\ &\leq \int_Vg\bigl(\Phi^{-1}(t)\bigr)\left|\det(\Phi^{-1})'(t)\right|dt\\ &=\int_{\Phi(U)}f\Bigl(\Phi\bigl(\Phi^{-1}(t)\bigr)\Bigr)\left|\det\Phi'\bigl(\Phi^{-1}(t)\bigr)\right|\left|\det(\Phi^{-1})'(t)\right|dt\\ &=\int_{\Phi(U)}f(t)dt, \end{aligned}

since the chain rule yields $\Phi'\bigl(\Phi^{-1}\bigr)(\Phi^{-1})' = \bigl(\Phi(\Phi^{-1})\bigr)' = \mathrm{id}$. This means that the reverse inequality also holds. The Theorem is proved.

q.e.d. (Theorem)

There may be other, yet more intricate proofs of this Theorem. We shall not give any other of them here, but the rather mysterious looking way in which the determinant pops up in the transformation formula is not the only way to look at it. There is a proof by induction, given in Heuser’s book, where the determinant just appears from the inductive step. However, there is little geometric intuition in this proof, and it is by no means easier than what we did above (as it make heavy use of the theorem on implicit functions). Similar things may be said about the rather functional analytic proof in Königsberger’s book (who concludes the transformation formula by step functions converging in the $L^1$-norm, he found the determinant pretty much in the same way that we did).

Let us harvest a little of the hard work we did on the Transformation Formula. The most common example is the integral of the standard normal distribution, which amounts to the evaluation of

$\displaystyle \int_{-\infty}^\infty e^{-\frac12x^2}dx.$

This can happen via the transformation to polar coordinates:

$\Phi:(0,\infty)\times(0,2\pi)\to\mathbb{R}^2,\qquad (r,\varphi)\mapsto (r\cos \varphi, r\sin\varphi).$

For this transformation, which is surjective on all of $\mathbb{R}^2$ except for a set of measure $0$, we find

$\Phi'(r,\varphi) = \begin{pmatrix}\cos\varphi&-r\sin\varphi\\\sin\varphi&\hphantom{-}r\cos\varphi\end{pmatrix},\qquad \det\Phi'(r,\varphi) = r.$

From the Transformation Formula we now get

\displaystyle \begin{aligned} \left(\int_{-\infty}^\infty e^{-\frac12x^2}dx\right)^2 &= \int_{\mathbb{R}^2}\exp\left(-\frac12x^2-\frac12y^2\right)dxdy\\ &= \int_{\Phi\left((0,\infty)\times(0,2\pi)\right)}\exp\left(-\frac12x^2-\frac12y^2\right)dxdy\\ &= \int_{(0,\infty)\times(0,2\pi)}\exp\left(-\frac12r^2\cos^2(\varphi)-\frac12r^2\sin^2(\varphi)\right)\left|\det\Phi'(r,\varphi)\right|drd\varphi\\ &= \int_{(0,\infty)\times(0,2\pi)}\exp\left(-\frac12r^2\right)rdrd\varphi\\ &= \int_0^\infty\exp\left(-\frac12r^2\right)rdr\int_0^{2\pi}d\varphi \\ &= 2\pi \left[-\exp\left(-\frac12r^2\right)\right]_0^\infty\\ &= 2\pi \left(1-0\right)\\ &= 2\pi. \end{aligned}

In particular, $\int_{-\infty}^\infty e^{-\frac12x^2}dx=\sqrt{2\pi}$. One of the very basic results in probability theory.

Another little gem that follows from the Transformation Formula are the Fresnel integrals

$\displaystyle \int_{0}^\infty \cos(x^2)dx = \int_{0}^\infty\sin(x^2)dx = \sqrt{\frac{\pi}{8}}.$

They follow from the same basic trick given above for the standard normal density, but as other methods for deriving this result involve even trickier uses of similarly hard techniques (the Residue Theorem, for instance, as given in Remmert’s book), we shall give the proof of this here:

Consider

$\displaystyle F(t)=\int_{0}^\infty e^{-tx^2}\cos(x^2)dx\qquad\text{and}\qquad\int_{0}^\infty e^{-tx^2}\sin(x^2)dx.$

Then, the trigonometric identity $\cos a+b = \cos a \cos b - \sin a\sin b$ tells us

\displaystyle \begin{aligned} \bigl(F(t)\bigr)^2 - \big(G(t)\bigr)^2 &= \int_0^\infty\int_0^\infty e^{-t(x^2+y^2)}\cos(x^2)\cos(y^2)dxdy - \int_0^\infty\int_0^\infty e^{-t(x^2+y^2)}\sin(x^2)\sin(y^2)dxdy\\ &= \int_0^\infty\int_0^\infty e^{-t(x^2+y^2)}\bigl(\cos(x^2+y^2)+\sin(x^2)\sin(y^2) - \sin(x^2)\sin(y^2)\bigr)dxdy\\ &= \int_0^\infty\int_0^{\frac\pi2}e^{-tr^2}\cos(r^2)rd\varphi dr\\ &= \frac\pi2 \int_0^\infty e^{-tu}\cos u\frac12du. \end{aligned}

This integral can be evaluated by parts to show

$\displaystyle \int_0^\infty e^{-tu}\cos udu\left(1+\frac1{t^2}\right) = \frac1t,$

which means

$\displaystyle \bigl(F(t)\bigr)^2 - \bigl(G(t)\bigr)^2 = \frac\pi4\int_0^\infty\int_0^\infty e^{-tu}\cos udu = \frac\pi4\frac t{t^2+1}.$

Then we consider the product $F(t)G(t)$ and use the identity $\sin(a+b) = \cos a\sin b + \cos b\sin a$, as well as the symmetry of the integrand and integration by parts, to get

\displaystyle \begin{aligned} F(t)G(t) &= \int_0^\infty\int_0^\infty e^{-t(x^2+y^2)}\bigl(\cos(x^2)\sin(y^2)\bigr)dxdy\\ &=2\int_0^\infty\int_0^ye^{-t(x^2+y^2)}\sin(x^2+y^2)dxdy\\ &=2\int_0^\infty\int_0^{\frac\pi8}e^{-tr^2}\sin(r^2)rd\varphi dr\\ &=\frac\pi4\int_0^\infty e^{-tr^2}\sin(r^2)r dr\\ &=\frac\pi4\int_0^\infty e^{-tu}\sin u\frac12du\\ &=\frac\pi8\frac1{1+t^2}. \end{aligned}

We thus find by the dominated convergence theorem

\displaystyle \begin{aligned} \left(\int_0^\infty\cos x^2dx\right)^2-\left(\int_0^\infty\sin x^2dx\right)^2 &= \left(\int_0^\infty\lim_{t\downarrow0}e^{-tx}\cos x^2dx\right)^2-\left(\int_0^\infty\lim_{t\downarrow0}e^{-tx}\sin x^2dx\right)^2 \\ &=\lim_{t\downarrow0}\left(\bigl(F(t)\bigr)^2-\bigl(G(t)\bigr)^2\right)\\ &=\lim_{t\downarrow0}\frac\pi4\frac{t}{t^2+1}\\ &=0, \end{aligned}

and

\displaystyle \begin{aligned} \left(\int_0^\infty\cos x^2dx\right)^2 &= \left(\int_0^\infty\cos x^2dx\right)\left(\int_0^\infty\sin x^2dx\right)\\ &=\left(\int_0^\infty\lim_{t\downarrow0}e^{-tx}\cos x^2dx\right)\left(\int_0^\infty\lim_{t\downarrow0}e^{-tx}\sin x^2dx\right)\\ &=\lim_{t\downarrow0}F(t)G(t)\\ &=\lim_{t\downarrow0}\frac\pi8\frac1{1+t^2}\\ &=\frac\pi8. \end{aligned}

One can easily find the bound that both integrals must be positive and from the first computation, we get

$\int_0^\infty\cos x^2dx = \int_0^\infty\sin x^2dx,$

from the second computation follows that the integrals have value $\sqrt{\frac\pi8}$.

q.e.d. (Fresnel integrals)

Even Brouwer’s Fixed Point Theorem may be concluded from the Transformation Formula (amongst a bunch of other theorems, none of which is actually as deep as this one though). This is worthy of a seperate text, mind you.