# What does the determinant have to do with transformed measures?

Let us consider transformations of the space $\mathbb{R}^p$. How does Lebesgue measure change by this transformation? And how do integrals change? The general case is answered by Jacobi’s formula for integration by substitution. We will start out slowly and only look at how the measure of sets is transformed by linear mappings.

It is folklore in the basic courses on linear algebra, when the determinant of a matrix is introduced, to convey the notion of size of the parallelogram spanned by the column vectors of the matrix. The following theorem shows why this folklore is true; this of course is based on the axiomatic description of the determinant which encodes the notion of size already. But coming from pure axiomatic reasoning, we can connect the axioms of determinant theory to their actual meaning in measure theory.

First, remember the definition of the pushforward measure. Let $X$ and $Y$ be measurable spaces, and $f:X\to Y$ a measurable mapping (i.e. it maps Borel-measurable sets to Borel-measurable sets; we shall not deal with finer details of measure theory here). Let $\mu$ be a measure on $X$. Then we define a measure on $Y$ in the natural – what would $\mu$ do? – way:

$\displaystyle \mu_f(A) := \mu\bigl(f^{-1}(A)\bigr).$

In what follows, $X = Y = \mathbb{R}^p$ and $\mu = \lambda$ the Lebesgue measure.

Theorem (Transformation of Measure): Let $f$ be a bijective linear mapping and let $A\subset\mathbb{R}^p$ be a measurable set. Then, the pushforward measure satisfies:

$\displaystyle \lambda_f(A) = \left|\det f\right|^{-1}\lambda(A)\qquad\text{for any Borel set }A\in\mathcal{B}^p.$

Lemma (The Translation-Lemma): Lebesgue measure is invariant under translations.

Proof: Let $c,d\in\mathbb{R}^p$ with $c\leq d$ component-wise. Let $t_a$ be the shift by the vector $a\in\mathbb{R}^p$, i.e. $t_a(v) = v+a$ and $t_a(A) = \{x+a\in\mathbb{R}^p\colon x\in A\}$. Then,

$\displaystyle t_a^{-1}\bigl((c,d]\bigr) = (c-a,d-a],$

where the interval is meant as the cartesian product of the component-intervals. For the Lebesgue-measure, we get

$\displaystyle \lambda_{t_a}\bigl((c,d]\bigr) = \lambda\bigl((c-a,d-a]\bigr) = \prod_{i=1}^p\bigl((d-a)-(c-a)\bigr) = \prod_{i=1}^p(d-c) = \lambda\bigl((c,d]\bigr).$

The measures $\lambda$, $\lambda_{t_a}$ hence agree on the rectangles (open on their left-hand sides), i.e. on a semi-ring generating the $\sigma$-algebra $\mathcal{B}^p$. With the usual arguments (which might as well involve $\cap$-stable Dynkin-systems, for instance), we find that the measures agree on the whole $\mathcal{B}^p$.

q.e.d.

Lemma (The constant-multiple-Lemma): Let $\mu$ be a translation-invariant measure on $\mathcal{B}^p$, and $\alpha:=\mu\bigl([0,1]^p\bigr) < \infty$. Then $\mu(A) = \alpha\lambda(A)$ for any $A\in\mathcal(B)^p$.

Note that the Lemma only holds for finite measures on $[0,1]^p$. For instance, the counting measure is translation-invariant, but it is not a multiple of Lebesgue measure.

Proof: We divide the set $(0,1]^p$ via a rectangular grid of sidelengths $\frac1{n_i}$, $i=1,\ldots,p$:

$\displaystyle (0,1]^p = \bigcup_{\stackrel{k_j=0,\ldots,n_j-1}{j=1,\ldots,p}}\left(\times_{i=1}^p\left(0,\frac1{n_i}\right] + \left(\frac{k_1}{n_1},\ldots,\frac{k_p}{n_p}\right)\right).$

On the right-hand side there are $\prod_{i=1}^pn_i$ sets which have the same measure (by translation-invariance). Hence,

$\displaystyle \mu\bigl((0,1]^p\bigr) = \mu\left(\bigcup\cdots\right) = n_1\cdots n_p \cdot \mu\left(\times_{i=1}^p \left(0,\frac1{n_i}\right]\right).$

Here, we distinguish three cases.

Case 1: $\mu\bigl((0,1]^p\bigr) = 1$. Then,

$\displaystyle\mu\left(\times_{i=1}^p (0,\frac1{n_i}]\right) = \prod_{i=1}^p\frac1{n_i} = \lambda\left(\times_{i=1}^p (0,\frac1{n_i}]\right).$

By choosing appropriate grids and further translations, we get that $\mu(I) = \lambda(I)$ for any rectangle $I$ with rational bounds. Via the usual arguments, $\mu=\lambda$ on the whole of $\mathcal{B}^p$.

Case 2: $\mu\bigl((0,1]^p\bigr) \neq 1$ and $>0$. By assumption, however, this measure is finite. Setting $\gamma = \mu\bigl((0,1]^p\bigr)$, we can look at the measure $\gamma^{-1}\mu$, which of course has $\gamma^{-1}\mu\bigl((0,1]^p\bigr) = 1$. By Case 1, $\gamma^{-1}\mu = \lambda$.

Case 3: $\mu\bigl((0,1]^p\bigr) = 0$. Then, using translation invariance again,

$\displaystyle \mu(\mathbb{R}^p) = \mu\bigl(\bigcup_{z\in\mathbb{Z}^p}((0,1]^p+z)\bigr) = \sum_{z\in\mathbb{Z}^p}\mu\bigl((0,1]^p\bigr) = 0.$

Again, we get $\mu(A) = 0$ for all $A\in\mathcal{B}^p$.

That means, in all cases, $\mu$ is equal to a constant multiple of $\lambda$, the constant being the measure of $(0,1]^p$. That is not quite what we intended, as we wish the constant multiple to be the measure of the compact set $[0,1]^p$.

Remember our setting $\alpha:=\mu\bigl([0,1]^p\bigr)$ and $\gamma := \mu\bigl((0,1]^p\bigr)$. Let $A\in\mathcal{B}^p$. We distinguish another two cases:

Case $a)$ $\alpha = 0$. By monotony, $\gamma = 0$ and Case 3 applies: $\mu(A) = 0 = \alpha\lambda(A)$.

Case $b)$ $\alpha > 0$. By monotony and translation-invariance,

$\displaystyle \alpha \leq \mu\bigl((-1,1]^p\bigr) = 2^p\gamma,$

meaning $\gamma\geq\frac{\alpha}{2^p}$. Therefore, as $\alpha>0$, we get $\gamma>0$, and by Case 1, $\frac1\gamma\mu(A) = \lambda(A)$. In particular,

$\displaystyle \frac\alpha\gamma = \frac1\gamma\mu\bigl([0,1]^p\bigr) = \lambda\bigl([0,1]^p\bigr) = 1,$

and so, $\alpha = \gamma$, meaning $\mu(A) = \gamma\lambda(A) = \alpha\lambda(A)$.

q.e.d.

Proof (of the Theorem on Transformation of Measure). We will first show that the measure $\lambda_f$ is invariant under translations.

We find, using the Translation-Lemma in $(\ast)$, and the linearity of $f$ before that,

\displaystyle \begin{aligned} \lambda_{t_a\circ f}(A) = \lambda_f(A-a) &\stackrel{\hphantom{(\ast)}}{=} \lambda\bigl(f^{-1}(A-a)\bigr) \\ &\stackrel{\hphantom{(\ast)}}{=} \lambda\bigl(f^{-1}(A) - f^{-1}(a)\bigr) \\ &\stackrel{(\ast)}{=} \lambda\bigl(f^{-1}(A)\bigr) \\ &\stackrel{\hphantom{(\ast)}}{=} \lambda_f(A), \end{aligned}

which means that $\lambda_f$ is indeed invariant under translations.

As $[0,1]^p$ is compact, so is $f^{-1}\bigl([0,1]^p\bigr)$ – remember that continuous images of compact sets are compact (here, the continuous mapping is $f^{-1}$). In particular, $f^{-1}\bigl([0,1]^p\bigr)$ is bounded, and thus has finite Lebesgue measure.

We set $c(f) := \lambda_f\bigl([0,1]^p)\bigr)$. By the constant-multiple-Lemma, $\lambda_f$ is a multiple of Lebesgue measure: we must have

$\displaystyle \lambda_f(A) = c(f)\lambda(A)\text{ for all }A\in\mathcal{B}^p.\qquad (\spadesuit)$

We only have left to prove that $c(f) = \left|\det f\right|^{-1}$. To do this, there may be two directions to follow. We first give the way that is laid out in Elstrodt’s book (which we are basically following in this whole post). Later, we shall give the more folklore-way of concluding this proof.

We consider more and more general fashions of the invertible linear mapping $f$.

Step 1: Let $f$ be orthogonal. Then, for the unit ball $B_1(0)$,

$\displaystyle c(f)\lambda\bigl(B_1(0)\bigr) \stackrel{(\spadesuit)}{=} \lambda_f\bigl(B_1(0)\bigr) = \lambda\left(f^{-1}\bigl(B_1(0)\bigr)\right) = \lambda\bigl(B_1(0)\bigr).$

This means, that $c(f) = 1 = \left|\det(f)\right|$.

This step shows for the first time how the properties of a determinant encode the notion of size already: we have only used the basic lemmas on orthogonal matrices (they leave distances unchanged and hence the ball $B_1(0)$ doesn’t transform; besides, their inverse is their adjoint) and on determinants (they don’t react to orthogonal matrices because of their multiplicative property and because they don’t care for adjoints).

Step 2: Let $f$ have a representation as a diagonal matrix (using the standard basis of $\mathbb{R}^p$). Let us assume w.l.o.g. that $f = \mathrm{diag}(d_1,\ldots,d_p)$ with $d_i>0$. The case of $d_i<0$ is only notationally cumbersome. We get

$\displaystyle c(f)\lambda_f\bigl([0,1]^p\bigr) = \lambda\left(f^{-1}\bigl([0,1]^p\bigr)\right) = \lambda\bigl(\times_{i=1}^p[0,d_i^{-1}]\bigr) = \prod_{i=1}^pd_i^{-1} = \left|\det f\right|^{-1}.$

Again, the basic lemmas on determinants already make use of the notion of size without actually saying so. Here, it is the computation of the determinant by multiplication of the diagonal.

Step 3: Let $f$ be linear and invertible, and let $f^\ast$ be its adjoint. Then $f^\ast f$ is non-negative definite (since for $x\in\mathbb{R}^p$, $x^\ast(f^\ast f)x = (fx)^\ast(fx) = \left\|fx\right\|^2\geq0$). By the Principal Axis Theorem, there is some orthogonal matrix $v$ and some diagonal matrix with non-negative entries $d$ with $f^\ast f = vd^2v^\ast$. As $f$ was invertible, no entry of $d$ may vanish here (since then, its determinant would vanish and in particular, $f$ would no longer be invertible). Now, we set

$\displaystyle w:=d^{-1}v^\ast f^\ast,$

which is orthogonal because of

$\displaystyle ww^\ast = d^{-1} v^\ast f^\ast fv d^{-1} = d^{-1}v^\ast (vd^2v^\ast)v d^{-1} = d^{-1}d^2d^{-1}v^\ast v v^\ast v = \mathrm{id}.$

As $f = w^\ast dv$, we see from Step 1

\displaystyle \begin{aligned} c(f) = \lambda_f\bigl([0,1]^p\bigr) &= \lambda\left(v^{-1}d^{-1}w\bigl([0,1]^p\bigr)\right) \\ &= \lambda\left(d^{-1}w\bigl([0,1]^p\bigr)\right)\\ &= \left|\det d\right|^{-1}\lambda\left(w\bigl([0,1]^p\bigr)\right) \\ &= \left|\det d\right|^{-1}\lambda\bigl([0,1]^p\bigr) \\ &= \left|\det f\right|^{-1}\lambda\bigl([0,1]^p\bigr) \\ &= \left|\det f\right|^{-1}, \end{aligned}

by the multiplicative property of determinants again ($\det f = \det d$).

q.e.d.(Theorem)

As an encore, we show another way to conclude in the Theorem, once all the Lemmas are shown and applied. This is the more folklore way alluded to in the proof, making use of the fact that any invertible matrix is the product of elementary matrices (and, of course, making use of the multiplicative property of determinants). Hence, we only consider those.

Because Step 2 of the proof already dealt with diagonal matrices, we only have to look at shear-matrices like $E_{ij}(r) := \bigl(\delta_{kl}+r\delta_{ik}\delta_{jl}\bigr)_{k,l=1,\ldots,p}$. They are the identity matrix with the (off-diagonal) entry $r$ in row $i$ and column $j$. One readily finds $\bigl(E_{ij}(r)\bigr)^{-1} = E_{ij}(-r)$, and $\det E_{ij}(r) = 1$. Any vector $v\in[0,1]^p$ is mapped to

$\displaystyle E_{ij}(v_1,\ldots,v_i,\ldots, v_p)^t = (v_1,\ldots,v_i+rv_j,\ldots,v_p)^t.$

This gives

\displaystyle \begin{aligned}\lambda_{E_{ij}(r)}\bigl([0,1]^p\bigr) &= \lambda\left(E_{ij}(-r)\bigl([0,1]^p\bigr)\right) \\ &= \lambda\left(\left\{x\in\mathbb{R}^p\colon x=(v_1,\ldots,v_i-rv_j,\ldots,v_p), v_k\in[0,1]\right\}\right). \end{aligned}

This is a parallelogram that may be covered by $n$ rectangles as follows: we fix the dimension $i$ and one other dimension to set a rectangle of height $\frac1n$, width $1+\frac rn$ (all other dimension-widths = 1; see the image for an illustration). Implicitly, we have demanded that $p\geq2$ here; but $p=1$ is uninteresting for the proof, as there are too few invertible linear mappings in $\mathbb{R}^1$.

By monotony, this yields

$\lambda_{E_{ij}(r)}\bigl([0,1]^p\bigr) \leq n\frac1n\left(1+\frac{r}{n}\right) = 1+\frac rn\xrightarrow{n\to\infty}1.$

On the other hand, this parallelogram itself covers the rectangles of width $1-\frac rn$, and a similar computation shows that in the limit $\lambda_{E_{ij}(r)}\bigl([0,1]^p\bigr)\geq1$.

In particular: $\lambda_{E_{ij}(r)}\bigl([0,1]^p\bigr) = 1 = \left|\det E_{ij}(r)\right|^{-1}$.

q.e.d. (Theorem encore)

Proving the multidimensional transformation formula for integration by substitution is considerably more difficult than in one dimension, where it basically amounts to reading the chain rule reversedly. Let us state the formula here first:

Theorem (The Transformation Formula, Jacobi): Let $U,V\subset \mathbb{R}^p$ be open sets and let $\Phi:U\to V$ be a $\mathcal{C}^1$-diffeomorphism (i.e. $\Phi^{-1}$ exists and both $\Phi$ and $\Phi^{-1}$ are $\mathcal{C}^1$-functions). Let $f:V\to\mathbb{R}$ be measurable. Then, $f\circ\Phi:U\to\mathbb{R}$ is measurable and

$\displaystyle\int_V f(t)dt = \int_U f\bigl(\Phi(s)\bigr)\left|\det\Phi'(s)\right|ds.$

At the core of the proof is the Theorem on Transformation of Measure that we have proved above. The idea is to approximate $\Phi$ by linear mappings, which locally transform the Lebesgue measure underlying the integral and yield the determinant in each point as correction factor. The technical difficulty is to show that this approximation does no harm for the evaluation of the integral.

We will need a lemma first, which carries most of the weight of the proof.

The Preparatory Lemma: Let $U,V\subset \mathbb{R}^p$ be open sets and let $\Phi:U\to V$ be a $\mathcal{C}^1$-diffeomorphism. If $X\subset U$ is a Borel set, then so is $\Phi(X)\subset V$, and

$\displaystyle \lambda\bigl(\Phi(X)\bigr)\leq\int_X\left|\det\Phi'(s)\right|ds.$

Proof: Without loss of generality, we can assume that $\Phi$, $\Phi'$ and $(\Phi')^{-1}$ are defined on a compact set $K\supset U$. We consider, for instance, the sets

$\displaystyle U_k:=\left\{ x\in U\colon \left|x\right|\frac1k\right\}.$

The $U_k$ are open and bounded, $\overline U_k$ is hence compact, and there is a chain $U_k\subset\overline U_k\subset U_{k+1}\subset\cdots$ for all $k$, with $U=\bigcup_kU_k$. To each $U_k$ there is, hence, a compact superset on which $\Phi$, $\Phi'$ and $(\Phi')^{-1}$ are defined. Now, if we can prove the statement of the Preparatory Lemma on $X_k := X\cap U_k$, it will also be true on $X=\lim_kX_k$ by the monotone convergence theorem.

As we can consider all relevant functions to be defined on compact sets, and as they are continuous (and even more) by assumption, they are readily found to be uniformly continuous and bounded.

It is obvious that $\Phi(X)$ will be a Borel set, as $\Phi^{-1}$ is continuous.

Let us prove that the Preparatory Lemma holds for rectangles $I$ with rational endpoints, being contained in $U$.

There is some $r>0$ such that for any $a\in I$, $B_r(a)\subset U$. By continuity, there is a finite constant $M$ with

$\displaystyle M:=\sup_{t\in I}\left\|\bigl(\Phi'(t)\bigr)^{-1}\right\|,$

and by uniform continuity, $r$ may be chosen small enough such that, for any $\varepsilon>0$, even

$\displaystyle \sup_{x\in B_r(a)}\left\|\Phi'(x)-\Phi'(a)\right\|\leq\frac{\varepsilon}{M\sqrt p} \text{ for every }a\in I.$

With this $r$, we may now sub-divide our rectangle $I$ into disjoint cubes $I_k$ of side-length $d$ such that $d<\frac{r}{\sqrt p}$. In what follows, we shall sometimes need to consider the closure $\overline I_k$ for some of the estimates, but we shall not make the proper distinction for reasons of legibility.

For any given $b\in I_k$, every other point $c$ of $I_k$ may at most have distance $d$ in each of its components, which ensures

$\displaystyle\left\|b-c\right\|^2 \leq \sum_{i=1}^pd^2 = pd^2 < r^2.$

This, in turn, means, $I_k\subset B_r(b)$ (and $B_r(b)\subset U$ has been clear because of the construction of $r$).

Now, in every of the cubes $I_k$, we choose the point $a_k\in I_k$ with

$\displaystyle\left|\det\Phi'(a_k)\right| = \min_{t\in I_k}\left|\det\Phi'(t)\right|,$

and we define the linear mapping

$\displaystyle \Phi_k:=\Phi'(a_k)$.

Remember that for convex sets $A$, differentiable mappings $h:A\to\mathbb{R}^p$, and points $x,y\in A$, the mean value theorem shows

$\displaystyle \left\|h(x)-h(y)\right\|\leq\left\|x-y\right\|\sup_{\lambda\in[0,1]}\left\|h'\bigl(x+\lambda(y-x)\bigr)\right\|.$

Let $a\in I_k$ be a given point in one of the cubes. We apply the mean value theorem to the mapping $h(x):=\Phi(x)-\Phi_k(x)$, which is certainly differentiable, to $y:=a_k$, and to the convex set $A:=B_r(a)$:

\displaystyle \begin{aligned} \left\|h(x)-h(y)\right\|&\leq\left\|x-y\right\|\sup_{\lambda\in[0,1]}\left\|h'\bigl(x+\lambda(y-x)\bigr)\right\|\\ \left\|\Phi(x)-\Phi_k(x)-\Phi(a_k)+\Phi_k(a_k)\right\| & \leq \left\|x-a_k\right\|\sup_{\lambda\in[0,1]}\left\|\Phi'\bigl(x+\lambda(x-a_k)\bigr)-\Phi'(a_k)\right\|\\ \left\|\Phi(x)-\Phi(a_k)-\Phi_k(x-a_k)\right\| &< \left\|x-a_k\right\| \frac{\varepsilon}{M\sqrt p}\qquad (\clubsuit). \end{aligned}

Note that as $a_k\in I_k\subset B_r(a)$, $x+\lambda(x-a_k)\in B_r(a)$ by convexity, and hence the upper estimate of uniform continuity is applicable. Note beyond that, that $\Phi_k$ is the linear mapping $\Phi'(a_k)$ and the derivative of a linear mapping is the linear mapping itself.

Now, $\left\|x-a_k\right\|< d\sqrt p$, as both points are contained in $I_k$, and hence $(\clubsuit)$ shows

\displaystyle \begin{aligned} \Phi(I_k) &\subset \Phi(a_k)+\Phi_k(I_k-a_k)+B_{\frac{\varepsilon}{M\sqrt p}d\sqrt p}(0) \\ &\subset \Phi(a_k)+\Phi_k(I_k-a_k)+B_{\frac{d\varepsilon}{M}}(0). \end{aligned}

By continuity (and hence boundedness) of $\Phi'$, we also have

$\displaystyle \left\|(\Phi_k)^{-1}(x)\right\|\leq\left\|\bigl(\Phi'(a_k)\bigr)^{-1}\right\|\left\|x\right\|\leq M \left\|x\right\|$,

which means $B_{\frac{d\varepsilon}{M}}(0) = \Phi_k\left(\Phi_k^{-1}\bigl(B_{\frac{d\varepsilon}{M}}(0)\bigr)\right) \subset \Phi_k\bigl(B_{d\varepsilon}(0)\bigr)$.

Hence:

$\displaystyle \Phi(I_k) \subset \Phi(a_k) + \Phi_k\bigl(I_k-a_k+B_{d\varepsilon}(0)\bigr).$

Why all this work? We want to bound the measure of the set $\Phi(I_k)$, and we can get it now: the shift $\Phi(a_k)$ is unimportant by translation invariance. And the set $I_k-a_k+B_{d\varepsilon}(0)$ is contained in a cube of side-length $d+2d\varepsilon$. As promised, we have approximated the mapping $\Phi$ by a linear mapping $\Phi_k$ on a small set, and the transformed set has become only slightly bigger. By the Theorem on Transformation of Measure, this shows

\displaystyle \begin{aligned} \lambda\bigl(\Phi(I_k)\bigr) &\leq \lambda\left(\Phi_k\bigl(I_k-a_k+Blat_{d\varepsilon}(0)\bigr)\right) \\ &=\left|\det\Phi_k\right|\lambda\bigl(I_k-a_k+B_{d\varepsilon}(0)\bigr)\\ &\leq \left|\det\Phi_k\right|d^p(1+2\varepsilon)^p \\ &= \left|\det\Phi_k\right|(1+2\varepsilon)^p\lambda(I_k). \end{aligned}

Summing over all the cubes $I_k$ of which the rectangle $I$ was comprised, (remember that $\Phi$ is a diffeomorphism and disjoint sets are kept disjoint; besides, $a_k$ has been chosen to be the point in $I_k$ of smallest determinant for $\Phi'$)

\displaystyle \begin{aligned} \lambda\bigl(\Phi(I)\bigr) &\leq (1+2\varepsilon)^p\sum_{k=1}^n\left|\det \Phi_k\right|\lambda(I_k) \\ &= (1+2\varepsilon)^p\sum_{k=1}^n\left|\det \Phi'(a_k)\right|\lambda(I_k)\\ &= (1+2\varepsilon)^p\sum_{k=1}^n\int_{I_k}\left|\det\Phi'(a_k)\right|ds\\ &\leq (1+2\varepsilon)^p\int_I\left|\det\Phi'(s)\right|ds. \end{aligned}

Taking $\varepsilon\to0$ yields to smaller subdivisions $I_k$ and in the limit to the conclusion. The Preparatory Lemma holds for rectangles.

Now, let $X\subset U$ be any Borel set, and let $\varepsilon>0$. We cover $X$ by disjoint (rational) rectangles $R_k\subset U$, such that $\lambda\bigl(\bigcup R_k \setminus X\bigr)<\varepsilon$. Then,

\displaystyle \begin{aligned} \lambda\bigl(\Phi(X)\bigr) &\leq \sum_{k=1}^\infty \lambda\bigl(\Phi(R_k)\bigr)\\ &\leq\sum_{k=1}^\infty\int_{R_k}\left|\det \Phi'(s)\right|ds\\ &= \int_{\bigcup R_k}\left| \det\Phi'(s)\right| ds\\ &= \int_X\left| \det\Phi'(s)\right| ds + \int_{\bigcup R_k\setminus X}\left|\det\Phi'(s)\right|ds\\ &\leq \int_X\left| \det\Phi'(s)\right| ds + M\lambda\left(\bigcup R_k\setminus X\right)\\ &\leq \int_X\left| \det\Phi'(s)\right| ds + M\varepsilon. \end{aligned}

If we let $\varepsilon\to0$, we see $\lambda\bigl(\Phi(X)\bigr)\leq\int_X\bigl|\det\Phi'(s)\bigr|ds$.

q.e.d. (The Preparatory Lemma)

We didn’t use the full generality that may be possible here: we already focused ourselves on the Borel sets, instead of the larger class of Lebesgue-measurable sets. We shall skip the technical details that are linked to this topic, and switch immediately to the

Proof of Jacobi’s Transformation Formula: We can focus on non-negative functions $f$ without loss of generality (take the positive and the negative part separately, if needed). By the Preparatory Lemma, we already have

\displaystyle\begin{aligned} \int_{\Phi(U)}\mathbf{1}_{\Phi(X)}(s)ds &= \int_{V}\mathbf{1}_{\Phi(X)}(s)ds\\ &= \int_{\Phi(X)}ds\\ &= \lambda\bigl(\Phi(X)\bigr)\\ &\leq \int_X\left|\det\Phi'(s)\right|ds\\ &= \int_U\mathbf{1}_X(s)\left|\det\Phi'(s)\right|ds\\ &= \int_U\mathbf{1}_{\Phi(X)}\bigl(\Phi(s)\bigr)\left|\det\Phi'(s)\right|ds, \end{aligned}

which proves the inequality

$\displaystyle \int_{\Phi(U)}f(t)dt \leq \int_U f\bigl(\Phi(s)\bigr)\left|\det\Phi'(s)\right|ds,$

for indicator functions $f = \mathbf{1}_{\Phi(X)}$. By usual arguments (linearity of the integral, monotone convergence), this also holds for any measurable function $f$. To prove the Transformation Formula completely, we apply this inequality to the transformation $\Phi^{-1}$ and the function $g(s):=f\bigl(\Phi(s)\bigr)\left|\det\Phi'(s)\right|$:

\displaystyle \begin{aligned} \int_Uf\bigl(\Phi(s)\bigr)\left|\det\Phi'(s)\right|ds &= \int_{\Phi^{-1}(V)}g(t)dt\\ &\leq \int_Vg\bigl(\Phi^{-1}(t)\bigr)\left|\det(\Phi^{-1})'(t)\right|dt\\ &=\int_{\Phi(U)}f\Bigl(\Phi\bigl(\Phi^{-1}(t)\bigr)\Bigr)\left|\det\Phi'\bigl(\Phi^{-1}(t)\bigr)\right|\left|\det(\Phi^{-1})'(t)\right|dt\\ &=\int_{\Phi(U)}f(t)dt, \end{aligned}

since the chain rule yields $\Phi'\bigl(\Phi^{-1}\bigr)(\Phi^{-1})' = \bigl(\Phi(\Phi^{-1})\bigr)' = \mathrm{id}$. This means that the reverse inequality also holds. The Theorem is proved.

q.e.d. (Theorem)

There may be other, yet more intricate proofs of this Theorem. We shall not give any other of them here, but the rather mysterious looking way in which the determinant pops up in the transformation formula is not the only way to look at it. There is a proof by induction, given in Heuser’s book, where the determinant just appears from the inductive step. However, there is little geometric intuition in this proof, and it is by no means easier than what we did above (as it make heavy use of the theorem on implicit functions). Similar things may be said about the rather functional analytic proof in Königsberger’s book (who concludes the transformation formula by step functions converging in the $L^1$-norm, he found the determinant pretty much in the same way that we did).

Let us harvest a little of the hard work we did on the Transformation Formula. The most common example is the integral of the standard normal distribution, which amounts to the evaluation of

$\displaystyle \int_{-\infty}^\infty e^{-\frac12x^2}dx.$

This can happen via the transformation to polar coordinates:

$\Phi:(0,\infty)\times(0,2\pi)\to\mathbb{R}^2,\qquad (r,\varphi)\mapsto (r\cos \varphi, r\sin\varphi).$

For this transformation, which is surjective on all of $\mathbb{R}^2$ except for a set of measure $0$, we find

$\Phi'(r,\varphi) = \begin{pmatrix}\cos\varphi&-r\sin\varphi\\\sin\varphi&\hphantom{-}r\cos\varphi\end{pmatrix},\qquad \det\Phi'(r,\varphi) = r.$

From the Transformation Formula we now get

\displaystyle \begin{aligned} \left(\int_{-\infty}^\infty e^{-\frac12x^2}dx\right)^2 &= \int_{\mathbb{R}^2}\exp\left(-\frac12x^2-\frac12y^2\right)dxdy\\ &= \int_{\Phi\left((0,\infty)\times(0,2\pi)\right)}\exp\left(-\frac12x^2-\frac12y^2\right)dxdy\\ &= \int_{(0,\infty)\times(0,2\pi)}\exp\left(-\frac12r^2\cos^2(\varphi)-\frac12r^2\sin^2(\varphi)\right)\left|\det\Phi'(r,\varphi)\right|drd\varphi\\ &= \int_{(0,\infty)\times(0,2\pi)}\exp\left(-\frac12r^2\right)rdrd\varphi\\ &= \int_0^\infty\exp\left(-\frac12r^2\right)rdr\int_0^{2\pi}d\varphi \\ &= 2\pi \left[-\exp\left(-\frac12r^2\right)\right]_0^\infty\\ &= 2\pi \left(1-0\right)\\ &= 2\pi. \end{aligned}

In particular, $\int_{-\infty}^\infty e^{-\frac12x^2}dx=\sqrt{2\pi}$. One of the very basic results in probability theory.

Another little gem that follows from the Transformation Formula are the Fresnel integrals

$\displaystyle \int_{0}^\infty \cos(x^2)dx = \int_{0}^\infty\sin(x^2)dx = \sqrt{\frac{\pi}{8}}.$

They follow from the same basic trick given above for the standard normal density, but as other methods for deriving this result involve even trickier uses of similarly hard techniques (the Residue Theorem, for instance, as given in Remmert’s book), we shall give the proof of this here:

Consider

$\displaystyle F(t)=\int_{0}^\infty e^{-tx^2}\cos(x^2)dx\qquad\text{and}\qquad\int_{0}^\infty e^{-tx^2}\sin(x^2)dx.$

Then, the trigonometric identity $\cos a+b = \cos a \cos b - \sin a\sin b$ tells us

\displaystyle \begin{aligned} \bigl(F(t)\bigr)^2 - \big(G(t)\bigr)^2 &= \int_0^\infty\int_0^\infty e^{-t(x^2+y^2)}\cos(x^2)\cos(y^2)dxdy - \int_0^\infty\int_0^\infty e^{-t(x^2+y^2)}\sin(x^2)\sin(y^2)dxdy\\ &= \int_0^\infty\int_0^\infty e^{-t(x^2+y^2)}\bigl(\cos(x^2+y^2)+\sin(x^2)\sin(y^2) - \sin(x^2)\sin(y^2)\bigr)dxdy\\ &= \int_0^\infty\int_0^{\frac\pi2}e^{-tr^2}\cos(r^2)rd\varphi dr\\ &= \frac\pi2 \int_0^\infty e^{-tu}\cos u\frac12du. \end{aligned}

This integral can be evaluated by parts to show

$\displaystyle \int_0^\infty e^{-tu}\cos udu\left(1+\frac1{t^2}\right) = \frac1t,$

which means

$\displaystyle \bigl(F(t)\bigr)^2 - \bigl(G(t)\bigr)^2 = \frac\pi4\int_0^\infty\int_0^\infty e^{-tu}\cos udu = \frac\pi4\frac t{t^2+1}.$

Then we consider the product $F(t)G(t)$ and use the identity $\sin(a+b) = \cos a\sin b + \cos b\sin a$, as well as the symmetry of the integrand and integration by parts, to get

\displaystyle \begin{aligned} F(t)G(t) &= \int_0^\infty\int_0^\infty e^{-t(x^2+y^2)}\bigl(\cos(x^2)\sin(y^2)\bigr)dxdy\\ &=2\int_0^\infty\int_0^ye^{-t(x^2+y^2)}\sin(x^2+y^2)dxdy\\ &=2\int_0^\infty\int_0^{\frac\pi8}e^{-tr^2}\sin(r^2)rd\varphi dr\\ &=\frac\pi4\int_0^\infty e^{-tr^2}\sin(r^2)r dr\\ &=\frac\pi4\int_0^\infty e^{-tu}\sin u\frac12du\\ &=\frac\pi8\frac1{1+t^2}. \end{aligned}

We thus find by the dominated convergence theorem

\displaystyle \begin{aligned} \left(\int_0^\infty\cos x^2dx\right)^2-\left(\int_0^\infty\sin x^2dx\right)^2 &= \left(\int_0^\infty\lim_{t\downarrow0}e^{-tx}\cos x^2dx\right)^2-\left(\int_0^\infty\lim_{t\downarrow0}e^{-tx}\sin x^2dx\right)^2 \\ &=\lim_{t\downarrow0}\left(\bigl(F(t)\bigr)^2-\bigl(G(t)\bigr)^2\right)\\ &=\lim_{t\downarrow0}\frac\pi4\frac{t}{t^2+1}\\ &=0, \end{aligned}

and

\displaystyle \begin{aligned} \left(\int_0^\infty\cos x^2dx\right)^2 &= \left(\int_0^\infty\cos x^2dx\right)\left(\int_0^\infty\sin x^2dx\right)\\ &=\left(\int_0^\infty\lim_{t\downarrow0}e^{-tx}\cos x^2dx\right)\left(\int_0^\infty\lim_{t\downarrow0}e^{-tx}\sin x^2dx\right)\\ &=\lim_{t\downarrow0}F(t)G(t)\\ &=\lim_{t\downarrow0}\frac\pi8\frac1{1+t^2}\\ &=\frac\pi8. \end{aligned}

One can easily find the bound that both integrals must be positive and from the first computation, we get

$\int_0^\infty\cos x^2dx = \int_0^\infty\sin x^2dx,$

from the second computation follows that the integrals have value $\sqrt{\frac\pi8}$.

q.e.d. (Fresnel integrals)

Even Brouwer’s Fixed Point Theorem may be concluded from the Transformation Formula (amongst a bunch of other theorems, none of which is actually as deep as this one though). This is worthy of a seperate text, mind you.