# On approximating functions – Part Two

The last time, we have proved the Stone-Weierstass-Theorem on how to find approximating functions to any continuous function on a compact interval only using a certain algebra of “easy” functions – such as polynomials.

Now that the hard, abstract work is done, we shall note that there are other, less abstract ways to prove Weierstrass’ Approximation Theorem. I am unaware of a less abstract method for Stone-Weierstrass, in particular the method given above is non-constructive. You could go through most of these things, but the way that we used compactness is not constructable in general – how do you choose the finite covering, as you only know there should exist one?

Let us therefore look at Bernstein Polynomials, which are associated with a given function $f$ on $[0,1]$:

$\displaystyle f_n(x) = \sum_{k=0}^n\binom{n}{k}f\left(\frac kn\right)x^k(1-x)^{n-k}.$

One can prove:

Proposition: Let $f:[0,1]\to\mathbb{R}$ be continuous. Then the Bernstein Polynomials converge uniformly to $f$.

Proof: Since $f$ is continuous on a compact interval, it is also uniformly continuous. Let $\varepsilon>0$. Then, there is some $\delta>0$ with $\left|f(x)-f(y)\right|<\varepsilon$ if $\left|x-y\right|<\delta$. Besides, for any $x\in[0,1]$,

$\displaystyle \left|f(x)-f_n(x)\right| \leq \sum_{k=0}^n\binom{n}{k}\left|f(x)-f\left(\frac kn\right)\right|x^k(1-x)^{n-k},$

where we have used the standard fact that

$\displaystyle\sum_{k=0}^n\binom nk x^k(1-x)^{n-k} = \bigl(x+(1-x)\bigr)^n = 1,$

which will (if used again and together with uniform continuity) yield

$\displaystyle \left|f(x)-f_n(x)\right| \leq \sum_{k=0}^n\binom{n}{k}\left|f(x)-f\left(\frac kn\right)\right|x^k(1-x)^{n-k}\mathbf{1}_{\left|x-\frac kn\right|\geq\delta} + \varepsilon.$

So, we have split our problem in two parts: in the first part, if we want to approximate $f$ well, it’s easy because an interpolation point for $f_n$ is close by. If this is not the case, the interpolation point is far away, meaning:

$\displaystyle \left|x-\frac kn\right|\geq\delta\quad\iff\quad 1 \leq \frac{\bigl(x-\frac kn\bigr)^2}{\delta^2}.$

On top of that, by continuity, $f$ will be bounded by a constant $M$, say. This gives:

\begin{aligned} \left|f(x)-f_n(x)\right| &\leq \sum_{k=0}^n\binom{n}{k}\left|f(x)-f\left(\frac kn\right)\right| x^k(1-x)^{n-k}\frac{\bigl(x-\frac kn\bigr)^2}{\delta^2} + \varepsilon \\ &\leq \frac{2M}{\delta^2}\sum_{k=0}^n\binom{n}{k}\left(x-\frac kn\right)^2 x^k(1-x)^{n-k} + \varepsilon. \end{aligned}

As far as ideas are concerned, we’re done. All that is left to do is to shove terms and binomial coefficients from left to right and to find out that they behave properly. We won’t do that here, as it’s not particularly hard. One will find:

$\displaystyle\left |f(x)-f_n(x)\right| \leq \frac{2M}{\delta^2}\left(x^2-2x^2+x^2+\frac{x(1-x)}{n}\right) + \varepsilon\leq \frac{2M}{\delta^2}\frac1{4n}+\varepsilon < 2\varepsilon,$

if $n$ is chosen sufficiently large. q.e.d.

There is a variant of the proof which avoids some of these messier calculations that we didn’t pull through above. Well, this variant doesn’t avoid the calculations, it rather shoves them away to another concept: the Weak Law of Large Numbers. We rely on Klenke’s book on Probability for this (but the idea is due to Bernstein).

Alternative Proof: Let $x\in[0,1]$, and let $x_1,\ldots,X_N$ be independent random variables with a Bernoulli distribution with parameter $x$ (the random experiment of throwing one coin and its chance of landing on heads is $x$). Then, the sum $S_n = \sum_{i=1}^nX_i$ has a binomial distribution $B_{n,x}$. Therefore,

$\displaystyle E\left[f\left(\frac{S_n}n\right)\right] = \sum_{k=0}^nf\left(\frac kn\right)P[S_n=k] = f_n(x).$

We find, again with $M = \left\|f\right\|_\infty$,

$\displaystyle \left|f\left(\frac{S_n}n\right) - f(x)\right| \leq \varepsilon + 2M \mathbf{1}_{|\frac{S_n}n-x|\geq\delta}$

and hence

$\displaystyle \left|f_n(x)-f(x)\right| \leq E\left[\left|f\left(\frac{S_n}n\right)-f(x)\right|\right] \leq \varepsilon + 2MP\left[\left|\frac{S_n}{n}-x\right|\geq\delta\right].$

By the Chebyshev Inequality (which is just shy from using the weak law of large numbers itself), since $E[\frac{S_n}n] = x$ and $\mathrm{Var}[\frac{S_n}n] = \frac1{n^2}nx(1-x)\leq\frac1{4n}$, we have

$\displaystyle P\left(\left|\frac{S_n}n-x\right|\geq\delta \right) = P\left(\left|\frac{S_n}n-E\left[\frac{S_n}n\right]\right|\geq\delta\right) \leq \mathrm{Var}\left[\frac{S_n}n\right]\delta^{-2} = \frac{1}{4n\delta^2}.$

Hence,

$\displaystyle \left|f_n(x)-f(x)\right| \leq \varepsilon + \frac{2M}{4\delta^2n} < 2\varepsilon,$

if $n$ is taken large enough. q.e.d.

These two have been more direct ways of proving the classical Weierstrass Approximation Theorem. It is what Werner’s book on Functional Analysis does, and so does the treatment of Körner’s on Fourier Series. The pro side is, it’s a direct and simple proof. The downside is, however, you won’t get the stronger theorems given at the start of this text; it doesn’t even say anything about trigonometric polynomials.

But, there’s a slight hope: actually, in the non-probabilistic approach, we only needed to prove that Bernstein Polynomials can deal well with the monomials $1$, $x$ and $x^2$ (that’s the part that we have avoided, since there is some computation to be done). Maybe one can generalize this one – and one can: this is Korovkin’s approach.

Theorem (Korovkin): Let $T_n:\mathcal{C}_{\mathbb{R}}(K)\to\mathcal{C}_{\mathbb{R}}(K)$ a sequence of positive linear operators, and let $q\subset \mathcal{C}_{\mathbb{R}}(K)$ be a test set that satisfies certain conditions on positivity and on its zero set.

If for any $q\in Q$, we have $\lim_{n\to\infty}\left\|T_nq-q\right\|_\infty=0$, then we also have $\lim_{n\to\infty}\left\|T_nf-f\right\|_\infty=0$ for all $f\in\mathcal{C}_{\mathbb{R}}(K)$.

Remember that an operator is just a function on a function space (plug a function in, get a function out). It is linear by the usual definition of that term, and by positive one means $T_nf\geq0$ as long as $f\geq0$ everywhere it is defined.

Korovkin tells us that you only need to approximate certain functions well with a cleverly chosen operator, to make the classical Weierstrass-Theorem work. This is what can be achieved by Bernstein Polynomials (they are a sequence of operators which only use finitely many values of a function) and the test set $q=\{1, x, x^2\}$. You can also prove the Theorem on trigonometric polynomials with the operator $T_n$ chosen as the Fejér-kernel and the test set $q=\{1, \cos, \sin\}$. We don’t delve into this here, especially not in the test set $q$; let’s just say that there is one and in the given examples you can’t make it smaller. In the detailed proof, there is some computation required and you can actually learn many things about Fourier series (so go into it; for instance in the books by Werner on Functional Analysis, or in Hämmerlin’s book on Numerics – however, to place a quote by Gauss in here: Most readers may well wish in some places for greater clarity in the presentation).

We won’t prove that here because for the main applications there is not much to gain on top of what we already have. But the approach has a certain elegance to it which is especially captured by functional analysts. I am not aware of generalizations of its applicability, I’d be very interested to hear about them.

As a final alternate proof of the full real version of Stone-Weierstrass Theorem (not just Weierstrass’ Theorem), we shall look at an elementary approach suggested by Brosowski and Deutsch in 1981 (see the original paper here). There is no dealing with convergent series, operators and the like; the proof only hinges on definitions of compactness and continuity and three easily proved ideas: closed subsets of compact spaces are compact, continuous positive functions have a positive minimum on compact sets, and the Bernoulli inequality $(1+h)^n\geq 1+nh$ for any $n\in\mathbb{N}$ and $h\geq-1$. We will give two lemmas before we are able to prove the real version of Stone-Weierstrass.

The proof will actually show a little bit more, since we need not suppose that $K\subset\mathbb{R}$. In fact, Stone-Weierstrass holds for any real-valued function on a compact topological space (it’s just that we didn’t make a fuss about that earlier on – most applications deal with functions on $\mathbb{R}$, as a matter of fact).

The pointwise nearly indicator Lemma: Let $x_0\in K$ and let $U$ be a neighborhood of $x_0$. Then, for any $\varepsilon>0$ there is some neighborhood $V(x_0)$ of $x_0$ with $V\subset U$ and some function $g\in\mathcal{A}$ such that

• $0\leq g(x)\leq 1$ for $x\in K$,
• $g(x)<\varepsilon$ for $x\in V(t_0)$,
• $g(x)>1-\varepsilon$ for $x\in K\setminus U$.

The lemma got its name because $g$ serves as an indicator function (or characteristic function for the non-probabilists out there) for the set $K\setminus U$. Of course, the algebra $\mathcal A$ is not rich enough for a proper indicator function, since that wouldn’t be continuous. But the algebra approximates this indicator of $K\setminus U$, and we can guarantee the function $g$ to nearly vanish in some neighborhood of the given point $x_0$. We’ll improve on that soon.

Proof: First of all, we need to find a function that takes its values only in the interval $[0,1]$ and which already has a controllably small value in $x_0$. What easier way to pick that small value as $0$?

For any $y\in K\setminus U$, by separation, there is some function $g_y\in\mathcal{A}$ with $g_y(y)\neq g_y(x_0)$. Besides, the function $h_y(x) = g_y(x)-g_y(x_0)$ is in $\mathcal{A}$ and of course $h_y(y)\neq h_y(x_0) = 0$. Even the function $p_y(x) = \frac1{\left\|h_y^2\right\|_\infty}h_y^2(x)$ is in $\mathcal{A}$, we have $p_y(x_0) = 0$, $p_y(y)>0$ and $0\leq p_y\leq 1$.

That was the first important step. $p_y$ vanishes at $x_0$ and it doesn’t vanish in some point outside of $U$. Now, we need to push the “doesn’t vanish” part to make our function take values near $1$, uniformly outside of $U$.

Consider the set $W(y)=\{x\in K : p_y(x)>0\}$. This is a neighborhood of $y$, hence its name. We can look at any of those $W(y)$, whenever $y\in K\setminus U$. Now, $K\setminus U$ is compact (it’s a closed subset of a compact space), and hence finitely many $W(y)$ will suffice to cover $K\setminus U$, let’s say $K\setminus U \subset\bigcup_{k=1}^m W(y_k)$. The function $p(x) = \frac1m\sum_{k=1}^mp_{t_k}(x)$ is in $\mathcal{A}$, we still have $0\leq p\leq1$, $p(x_0)=0$ and $p(x)>0$ for any $x\in K\setminus U$.

Because of compactness, $p$ must take a positive minimum on the compact set $K\setminus U$, let’s call it $\delta$. Then, for this $\delta\in(0,1]$, we look at the set $V=\{x\in K : p(x)<\frac\delta2\}$. Obviously, $V\subset U$ and $V$ is a neighborhood of $x_0$.

Let us look at the integer $K = \left\lceil \frac1\delta+1\right\rceil$. Thus, $K-1\leq\frac1\delta$ and $K\leq\frac{1+\delta}{\delta} \leq \frac2\delta$. In total, $1< k\delta \leq 2$. We can therefore, for every $n\in\mathbb{N}$, define the functions $q_n(x) = \bigl(1-p^n(x)\bigr)^{k^n}$, which, of course, are in $\mathcal{A}$, which have $0\leq q_n\leq 1$ and which have $q_n(x_0)=1$.

Let $x\in V$. Then, $Kp(x)< \frac{k\delta}2\leq 1$ and Bernoulli’s inequality (used with $n := k^n$ and $h:= -p^n(x)\geq -1$) shows $q_n(x) \geq 1-k^np^n(x)\geq 1-\left(\frac{k\delta}2\right)^n$. Therefore, $\lim_{n\to\infty}q_n(x) = 1$, and this limit is uniform in $x\in V$.

Let $x\in K\setminus U$. Then, $Kp(x)\geq k\delta> 1$ and Bernoulli’s inequality shows

\begin{aligned} q_n(x) &= \frac{1}{k^np^n(x)}\bigl(1-p^n(x)\bigr)^{k^n} k^np^n(x) \\ &\leq \frac1{k^np^n(x)}\bigl(1-p^n(x)\bigr)^{k^n}\bigl(1+k^np^n(x)\bigr)\\ &\leq \frac{1}{k^np^n(x)}\bigl(1-p^n(x)\bigr)^{k^n}\bigl(1+p^n(x)\bigr)^{k^n} \\ &= \frac{1}{k^np^n(x)}\bigl(1-p^{2n}(x)\bigr)^{k^n} \\ &\leq \frac{1}{k^n\delta^n}, \end{aligned}

and this converges to $0$ uniformly.

In particular, for any $\varepsilon>0$, we can find some $n$ large enough such that $q_n\leq\varepsilon$ on $K\setminus U$ and $q_n\geq 1-\varepsilon$ on $V$. Now, to conclude the proof, pick $V(x_0) = V$ (remember that $V$ was a neighborhood of $x_0$ by construction) and $g(x) = 1-q_n(x)$. q.e.d.

This one felt quite chore-ish. It was a little messy and technical, but note the necessary tools: no operator theory, no convergence of series, just stuff from high school calculus and the very basics of the theory of compact sets. That would seem like a value in and of itself.

The setwise nearly indicator Lemma: Let $A,B\subset K$ disjoint and closed. Let $\varepsilon\in(0,1)$. Then, there is some $g\in\mathcal{A}$ with

• $0\leq g(x)\leq 1$ for $x\in K$,
• $g(x)<\varepsilon$ for $x\in A$,
• $g(x)>1-\varepsilon$ for $x\in B$.

Obviously, this lemma improves on the first one, we have nearly an indicator function on $B$ which nearly vanishes on $A$. In this respect, it resembles Urysohn’s Lemma from elementary topology, but Urysohn isn’t restricted to the algebra $\mathcal A$ and can hence do even a little better. Never mind.

Proof: This is going to be a lot quicker: you know the drill. Let $\varepsilon>0$. Let $U=K\setminus B$ and let $y\in A$. We can choose neighborhoods $V(y)$ as in the previous Lemma. Of course, since $A$ is compact, finitely many of those $V(y)$ suffice to cover $A$, let’s say $A\subset\bigcup_{k=1}^m V(y_k)$. For each of the $V(y_k)$, there is a function $g_k\in\mathcal{A}$ such that $0\leq g_k\leq 1$, $g_k(x)>1-\frac\varepsilon m$ for all $x\in K\setminus U = B$ and $g_k(x)<\frac\varepsilon m$ for all $x\in V(y_k)$.

The function $g(x) = \prod_{k=1}^mg_k(x)$ will now do the trick: Of course, we have $0\leq g\leq1$.

Let $x\in A\subset\bigcup_{k=1}^mV(y_k)$, in particular $x\in V(y_{n_0})$, say. Then

$\displaystyle g(x) = \prod_{k=1}^mg_k(x)= g_{n_0}(x)\prod_{k=1, k\neq n_0}^mg_k(x) \leq \frac\varepsilon m\prod_{k=1, k\neq n_0}^m1 \leq\varepsilon.$

Now, let $x\in B$. Then Bernoulli tells us that $g(x)=\prod_{k=1}^mg_k(x)\geq\left(1-\frac\varepsilon m\right)^m \geq 1-\varepsilon$. Note that we need $\varepsilon\leq m$ here (which is not a very hard restriction anyway – nevertheless, we included some condition like that in the assertion).

That’s all we needed to show. q.e.d.

Note that the notion of compactness has been used rather slightly differently from the very first proof we gave for Stone-Weierstrass. Then, we had the Approximation Lemma which used continuity and direct neighborhoods where we could approximate $f$ adequately. Here, we have decomposed the underlying set $K$ by using the level sets of $f$ to approximate it. Then, compactness tells us every time that we can do with finitely many sets – but the nature of those sets is ever so slightly different. The math, however, is not.

Now, let’s harvest the ideas:

Proof of the Stone-Weierstrass, real version:

Let $f\in\mathcal{C}_{\mathbb{R}}(K)$ and let $\varepsilon>0$. Without loss of generality, we may assume that $f\geq 0$, since we might as well replace it by $f+\left\|f\right\|_\infty$. Besides, $\varepsilon <\frac13$ is not much of a restriction.

Let $n\in\mathbb{N}$ be such that $(n-1)\varepsilon>\left\|f\right\|_\infty$. Let, for $j=0,\ldots,n$, $A_j$ and $B_j$ be the sets

$\displaystyle A_j = \left\{x\in K : f(x)\leq \left(j-\frac13\right)\varepsilon\right\}$

$\displaystyle B_j = \left\{x\in K : f(x)\geq \left(j+\frac13\right)\varepsilon\right\}.$

Obviously, those sets are disjoint for fixed $j$, and the $A_j$ are increasing to $A_n = K$, while the $B_j$ are decreasing to $B_n = \emptyset$. The setwise nearly indicator Lemma then says that there is some $g_j\in\mathcal{A}$ with $0\leq g_j\leq 1$, $g_j(x) < \frac\varepsilon n$ for $x\in A_j$ and $g_j(x) > 1-\frac{\varepsilon}{n}$ for $x\in B_j$.

That’s all well and good – but how will this help us to prove Stone-Weierstrass? Imagine the space $K$ decomposed like an onion into the $A_j$. The functions $g_j$ will have very small values on the $A_j$, but large values outside, in the $B_j$. We will combine these small and large values properly, to have similar values as the function $f$ in every ring of this onion. Note that the values of $f$ are closely attached to the $g_j$ – by the very definition of $A_j$ and $B_j$.

Now consider the function $g(x) = \varepsilon\sum_{k=0}^n g_j(x)$, which of course is in $\mathcal{A}$. Let $x\in K$, then there is some $j\geq 1$ with $x\in A_j\setminus A_{j-1}$, and so

$\displaystyle \left(j-\frac43\right)\varepsilon < f(x) \leq \left(j-\frac13\right)\varepsilon$,

and of course $g_i(x) < \frac\varepsilon n$ for all $i\geq j$. Besides, our fixed $x\in K$ is in $B_i$ for $i\leq j-2$. Therefore, $g_i(x)\geq 1-\frac\varepsilon n$. All of this tells us (remember that $\varepsilon<\frac13$)

\begin{aligned} g(x) = \varepsilon\sum_{k=0}^{j-1}g_k(x) + \varepsilon\sum_{k=j}^ng_k(x)&\leq j\varepsilon + \varepsilon(n-j+1)\frac{\varepsilon}{n} \\ &\leq j\varepsilon + \varepsilon^2 < j\varepsilon + \frac\varepsilon3 = \left(j+\frac13\right)\varepsilon. \end{aligned}

Now, if $j\geq2$,

\begin{aligned} g(x) \geq \varepsilon\sum_{k=0}^{j-2}g_k(x) &\geq (j-1)\varepsilon \left(1-\frac\varepsilon n\right) \\ &= (j-1)\varepsilon - \left(\frac{j-1}{n}\right)\varepsilon^2 > (j-1)\varepsilon - \varepsilon^2 > \left(j-\frac43\right)\varepsilon. \end{aligned}

Note that, for $j = 1$, $g(x) > \bigl(j-\frac43\bigr)\varepsilon$ appears to be correct.

To conclude, we wish to show something about $\left|f(x)-g(x)\right|$. Let us consider the case $f(x)\geq g(x)$ first:

$\displaystyle\left|f(x)-g(x)\right| = f(x)-g(x) \leq \left(j-\frac13\right)\varepsilon - \left(j-\frac43\right)\varepsilon = \varepsilon.$

The other case is $f(x) < g(x)$:

$\displaystyle\left|f(x)-g(x)\right| = g(x) - f(x) \leq \left(j+\frac13\right)\varepsilon - \left((j-2)+\frac13\right)\varepsilon = 2\varepsilon.$

This shows, that in every case $f$ and $g$ differ by at most $2\varepsilon$, uniformly on $K$. q.e.d.

To conclude, let us mention some little question that is related to the classical Weierstrass Aproximation Theorem. If we have all monomials $x^n$ at our disposal, we can uniformly approximate any continuous function; but what if some monomials are missing? With how few of them can the Approimation Theorem still hold? The surprisingly neat answer is given in Müntz’ Theorem (sometimes called the Müntz-Szász-Theorem for the sake of more signs on top of the letters):

Theorem (Müntz): Let $\lambda_k$ be a strictly increasing sequence of natural numbers with $\lambda_1>0$. Then, we have

$\displaystyle \overline{\mathrm{span}\{1, x^{\lambda_1},x^{\lambda_2},\ldots,\}} = \mathcal{C}_{\mathbb{R}}([0,1]) \qquad\iff\qquad \sum_{k=1}^\infty\lambda_k^{-1}=\infty.$

The span is meant as the linear space generated by the given monomials, i.e. the polynomials that can be constructed using only those monomials.  Hence, the Weierstrass Theorem can even hold if you disallow to use infinitely many monomials. However, you will always need the monomial $x^0=1$, since otherwise your approximating functions could only satisfy $f(0)=0$.

The proof of that theorem is not particularly easy, especially the part where you show that the series must diverge if you can approximate any function. For the other direction there has been a very elementary proof that we shall give here because of its inherent beauty:

Proof of Müntz’ Theorem, $\Leftarrow$-part:

Let $\sum_{k=1}^\infty\lambda_k^{-1}=\infty$.

Let $q\in\mathbb{N}$ be such that $q\notin\{\lambda_1,\lambda_2,\ldots\}$. Without loss of generality, we assume that $q$ is smaller than any of the $\lambda_k$: just take away those $\lambda_k$ which are smaller than $q$ and re-enumerate the $\lambda_k$ starting from $k=1$; the condition $\sum_{k=1}^\infty\lambda_k^{-1}=\infty$ will then still hold.

We will show that $x^q\in\overline{\mathrm{span}\{x^{\lambda_1},x^{\lambda_2},\ldots\}}$. This will imply that all monomials $x^n$ are in the closure of that span, and if we have that, Weierstrass tells us that we can approximate any continuous function.

Note that we don’t need the constant function $1$ to approximate $x^q$, since for our purposes the relevant functions all have $f(0)=0$.

Now consider the function

$\displaystyle Q_n(x) := x^q-\sum_{k=1}^na_{k,n}x^{\lambda_k}$

with certain $a_{k,n}\in\mathbb{R}$. Then first of all, $Q_0(x) = x^q$, and if we choose certain $a_{k,n}$, we find the recursion

$\displaystyle Q_n(x) = (\lambda_n-q)x^{\lambda_n}\int_x^1Q_{n-1}(t)t^{-1-\lambda_n}dt.$

Let’s prove this by induction:

\begin{aligned} (\lambda_n-q)x^{\lambda_n}\int_x^1Q_{n-1}(t)t^{-1-\lambda_n}dt &= (\lambda_n-q)x^{\lambda_n}\int_x^1\left(t^q-\sum_{k=1}^{n-1}a_{k,n-1}t^k\right)t^{-1-\lambda_n}dt\\ &=(\lambda_n-q)x^{\lambda_n}\int_x^1 t^{q-1-\lambda_n}-\sum_{k=1}^{n-1}a_{k,n-1}t^{k-1-\lambda_n}dt\\ &=(\lambda_n-q)x^{\lambda_n}\left(\left[\frac1{q-1-\lambda_n+1}t^{q-1-\lambda_n+1}\right]_x^1\right. \\ &\hphantom{=}\left.- \sum_{k=1}^{n-1}a_{k,n-1}\left[\frac1{k-1-\lambda_n+1}t^{k-1-\lambda_n+1}\right]_x^1\right)\\ &=(\lambda_n-q)x^{\lambda_n}\left(\frac1{q-\lambda_n}(1-x^{q-\lambda_n}) \vphantom{\sum_{k=1}^n}\right.\\ &\hphantom{=}-\left.\sum_{k=1}^{n-1}a_{k,n-1}\frac1{k-\lambda_n}(1-x^{k-\lambda_n})\right)\\ &=x^{\lambda_n}(x^{q-\lambda_n}-1)-(\lambda_n-q)\sum_{k=1}^{n-1}\frac{a_{k,n-1}}{k-\lambda_n}(1-x^{k-\lambda_n})\\ &=x^q-x^{\lambda_n}-(\lambda_n-q)\sum_{k=1}^{n-1}\frac{a_{k,n-1}}{k-\lambda_n}(1-x^{k-\lambda_n})\\ &=: x^q - \sum_{k=1}^na_{k,n}x^{\lambda_k}. \end{aligned}

Obviously, one could try to define the $a_{k,n}$ appropriately or to give some closed formula expression; but it doesn’t seem to be useful for the problem at hand.

Now, we have $\left\|Q_0\right\|_\infty = 1$, and

\begin{aligned} \left\|Q_n\right\|_\infty &= \left\|(\lambda_n-q)x^{\lambda_n}\int_x^1Q_{n-1}(t)t^{-1-\lambda_n}dt\right\|_\infty \\ &\leq \left|\lambda_n-q\right|\left\|x^{\lambda_n}\right\|_\infty\left\|Q_{n-1}\right\|_\infty\left\|\int_x^1t^{-1-\lambda_n}dt\right\|_\infty\\ &=\left|\lambda_n-q\right|\left\|Q_{n-1}\right\|_\infty\left\|-\frac1{\lambda_n}(1-x^{-\lambda_n})\right\|_\infty\\ &=\left|\lambda_n-q\right|\left\|Q_{n-1}\right\|_\infty\frac1{\left|\lambda_n\right|}\left\|1-x^{-\lambda_n}\right\|_\infty \\ &=\left|1-\frac q{\lambda_n}\right|\left\|Q_{n-1}\right\|_\infty. \end{aligned}

This shows

$\displaystyle \left\|Q_n\right\|_\infty\leq\prod_{k=1}^n\left|1-\frac q{\lambda_k}\right|,$

and hence

$\displaystyle \lim_{n\to\infty}\left\|Q_n\right\|_\infty \leq \prod_{k=1}^\infty\left|1-\frac q{\lambda_k}\right|$.

But the infinite product will be 0, because $0<\frac q{\lambda_k}<1$ by assumption, and because of the basic inequality $1-x\leq e^{-x}$ for any $x>0$, we have

$\displaystyle \prod_{k=1}^\infty\left|1-\frac q{\lambda_k}\right| \leq \prod_{k=1}^\infty\exp\left(-\frac q{\lambda_k}\right) = \exp\left(-\sum_{k=1}^\infty \frac q{\lambda_k}\right) = 0.$

Hence, $Q_n$ vanishes uniformly as $n\to\infty$, and so $x^q$ is being approximated uniformly by some polynomials using only the allowed monomials $x^{\lambda_k}$. This concludes the proof. q.e.d.

At first, I was excited because I thought one could reverse this idea to show the other direction of Müntz’ Theorem as well. However, it doesn’t work because of the second-to-last inequality given above: If the product is finite, that doesn’t say anything about how well (or how badly) $x^q$ is approximated. The proof of the reverse direction is harder and I am only aware of deeper ideas from Fourier Analysis (that’s historically the first proof) or using the Hahn-Banach Theorem (as in Rudin’s book).

But the elementary idea given here is actually very neat, even though it doesn’t, to the best my knowledge, solve any real-world problems. A very pretty academic finger-exercise. And this should be a good place to stop.