On approximating functions – Part Two

The last time, we have proved the Stone-Weierstass-Theorem on how to find approximating functions to any continuous function on a compact interval only using a certain algebra of “easy” functions – such as polynomials.

Now that the hard, abstract work is done, we shall note that there are other, less abstract ways to prove Weierstrass’ Approximation Theorem. I am unaware of a less abstract method for Stone-Weierstrass, in particular the method given above is non-constructive. You could go through most of these things, but the way that we used compactness is not constructable in general – how do you choose the finite covering, as you only know there should exist one?

Let us therefore look at Bernstein Polynomials, which are associated with a given function f on [0,1]:

\displaystyle f_n(x) = \sum_{k=0}^n\binom{n}{k}f\left(\frac kn\right)x^k(1-x)^{n-k}.

One can prove:


Proposition: Let f:[0,1]\to\mathbb{R} be continuous. Then the Bernstein Polynomials converge uniformly to f.


Proof: Since f is continuous on a compact interval, it is also uniformly continuous. Let \varepsilon>0. Then, there is some \delta>0 with \left|f(x)-f(y)\right|<\varepsilon if \left|x-y\right|<\delta. Besides, for any x\in[0,1],

\displaystyle \left|f(x)-f_n(x)\right| \leq \sum_{k=0}^n\binom{n}{k}\left|f(x)-f\left(\frac kn\right)\right|x^k(1-x)^{n-k},

where we have used the standard fact that

\displaystyle\sum_{k=0}^n\binom nk x^k(1-x)^{n-k} = \bigl(x+(1-x)\bigr)^n = 1,

which will (if used again and together with uniform continuity) yield

\displaystyle \left|f(x)-f_n(x)\right| \leq \sum_{k=0}^n\binom{n}{k}\left|f(x)-f\left(\frac kn\right)\right|x^k(1-x)^{n-k}\mathbf{1}_{\left|x-\frac kn\right|\geq\delta} + \varepsilon.

So, we have split our problem in two parts: in the first part, if we want to approximate f well, it’s easy because an interpolation point for f_n is close by. If this is not the case, the interpolation point is far away, meaning:

\displaystyle \left|x-\frac kn\right|\geq\delta\quad\iff\quad 1 \leq \frac{\bigl(x-\frac kn\bigr)^2}{\delta^2}.

On top of that, by continuity, f will be bounded by a constant M, say. This gives:

\begin{aligned}    \left|f(x)-f_n(x)\right| &\leq \sum_{k=0}^n\binom{n}{k}\left|f(x)-f\left(\frac kn\right)\right| x^k(1-x)^{n-k}\frac{\bigl(x-\frac kn\bigr)^2}{\delta^2} + \varepsilon \\    &\leq \frac{2M}{\delta^2}\sum_{k=0}^n\binom{n}{k}\left(x-\frac kn\right)^2 x^k(1-x)^{n-k} + \varepsilon.    \end{aligned}

As far as ideas are concerned, we’re done. All that is left to do is to shove terms and binomial coefficients from left to right and to find out that they behave properly. We won’t do that here, as it’s not particularly hard. One will find:

\displaystyle\left |f(x)-f_n(x)\right| \leq \frac{2M}{\delta^2}\left(x^2-2x^2+x^2+\frac{x(1-x)}{n}\right) + \varepsilon\leq \frac{2M}{\delta^2}\frac1{4n}+\varepsilon < 2\varepsilon,

if n is chosen sufficiently large. q.e.d.


There is a variant of the proof which avoids some of these messier calculations that we didn’t pull through above. Well, this variant doesn’t avoid the calculations, it rather shoves them away to another concept: the Weak Law of Large Numbers. We rely on Klenke’s book on Probability for this (but the idea is due to Bernstein).


Alternative Proof: Let x\in[0,1], and let x_1,\ldots,X_N be independent random variables with a Bernoulli distribution with parameter x (the random experiment of throwing one coin and its chance of landing on heads is x). Then, the sum S_n = \sum_{i=1}^nX_i has a binomial distribution B_{n,x}. Therefore,

\displaystyle E\left[f\left(\frac{S_n}n\right)\right] = \sum_{k=0}^nf\left(\frac kn\right)P[S_n=k] = f_n(x).

We find, again with M = \left\|f\right\|_\infty,

\displaystyle \left|f\left(\frac{S_n}n\right) - f(x)\right| \leq \varepsilon + 2M \mathbf{1}_{|\frac{S_n}n-x|\geq\delta}

and hence

\displaystyle \left|f_n(x)-f(x)\right| \leq E\left[\left|f\left(\frac{S_n}n\right)-f(x)\right|\right] \leq \varepsilon + 2MP\left[\left|\frac{S_n}{n}-x\right|\geq\delta\right].

By the Chebyshev Inequality (which is just shy from using the weak law of large numbers itself), since E[\frac{S_n}n] = x and \mathrm{Var}[\frac{S_n}n] = \frac1{n^2}nx(1-x)\leq\frac1{4n}, we have

\displaystyle P\left(\left|\frac{S_n}n-x\right|\geq\delta \right) = P\left(\left|\frac{S_n}n-E\left[\frac{S_n}n\right]\right|\geq\delta\right) \leq \mathrm{Var}\left[\frac{S_n}n\right]\delta^{-2} = \frac{1}{4n\delta^2}.


\displaystyle \left|f_n(x)-f(x)\right| \leq \varepsilon + \frac{2M}{4\delta^2n} < 2\varepsilon,

if n is taken large enough. q.e.d.


These two have been more direct ways of proving the classical Weierstrass Approximation Theorem. It is what Werner’s book on Functional Analysis does, and so does the treatment of Körner’s on Fourier Series. The pro side is, it’s a direct and simple proof. The downside is, however, you won’t get the stronger theorems given at the start of this text; it doesn’t even say anything about trigonometric polynomials.

But, there’s a slight hope: actually, in the non-probabilistic approach, we only needed to prove that Bernstein Polynomials can deal well with the monomials 1, x and x^2 (that’s the part that we have avoided, since there is some computation to be done). Maybe one can generalize this one – and one can: this is Korovkin’s approach.


Theorem (Korovkin): Let T_n:\mathcal{C}_{\mathbb{R}}(K)\to\mathcal{C}_{\mathbb{R}}(K) a sequence of positive linear operators, and let q\subset \mathcal{C}_{\mathbb{R}}(K) be a test set that satisfies certain conditions on positivity and on its zero set.

If for any q\in Q, we have \lim_{n\to\infty}\left\|T_nq-q\right\|_\infty=0, then we also have \lim_{n\to\infty}\left\|T_nf-f\right\|_\infty=0 for all f\in\mathcal{C}_{\mathbb{R}}(K).


Remember that an operator is just a function on a function space (plug a function in, get a function out). It is linear by the usual definition of that term, and by positive one means T_nf\geq0 as long as f\geq0 everywhere it is defined.

Korovkin tells us that you only need to approximate certain functions well with a cleverly chosen operator, to make the classical Weierstrass-Theorem work. This is what can be achieved by Bernstein Polynomials (they are a sequence of operators which only use finitely many values of a function) and the test set q=\{1, x, x^2\}. You can also prove the Theorem on trigonometric polynomials with the operator T_n chosen as the Fejér-kernel and the test set q=\{1, \cos, \sin\}. We don’t delve into this here, especially not in the test set q; let’s just say that there is one and in the given examples you can’t make it smaller. In the detailed proof, there is some computation required and you can actually learn many things about Fourier series (so go into it; for instance in the books by Werner on Functional Analysis, or in Hämmerlin’s book on Numerics – however, to place a quote by Gauss in here: Most readers may well wish in some places for greater clarity in the presentation).

We won’t prove that here because for the main applications there is not much to gain on top of what we already have. But the approach has a certain elegance to it which is especially captured by functional analysts. I am not aware of generalizations of its applicability, I’d be very interested to hear about them.


As a final alternate proof of the full real version of Stone-Weierstrass Theorem (not just Weierstrass’ Theorem), we shall look at an elementary approach suggested by Brosowski and Deutsch in 1981 (see the original paper here). There is no dealing with convergent series, operators and the like; the proof only hinges on definitions of compactness and continuity and three easily proved ideas: closed subsets of compact spaces are compact, continuous positive functions have a positive minimum on compact sets, and the Bernoulli inequality (1+h)^n\geq 1+nh for any n\in\mathbb{N} and h\geq-1. We will give two lemmas before we are able to prove the real version of Stone-Weierstrass.

The proof will actually show a little bit more, since we need not suppose that K\subset\mathbb{R}. In fact, Stone-Weierstrass holds for any real-valued function on a compact topological space (it’s just that we didn’t make a fuss about that earlier on – most applications deal with functions on \mathbb{R}, as a matter of fact).


The pointwise nearly indicator Lemma: Let x_0\in K and let U be a neighborhood of x_0. Then, for any \varepsilon>0 there is some neighborhood V(x_0) of x_0 with V\subset U and some function g\in\mathcal{A} such that

  • 0\leq g(x)\leq 1 for x\in K,
  • g(x)<\varepsilon for x\in V(t_0),
  • g(x)>1-\varepsilon for x\in K\setminus U.


The lemma got its name because g serves as an indicator function (or characteristic function for the non-probabilists out there) for the set K\setminus U. Of course, the algebra \mathcal A is not rich enough for a proper indicator function, since that wouldn’t be continuous. But the algebra approximates this indicator of K\setminus U, and we can guarantee the function g to nearly vanish in some neighborhood of the given point x_0. We’ll improve on that soon.


Proof: First of all, we need to find a function that takes its values only in the interval [0,1] and which already has a controllably small value in x_0. What easier way to pick that small value as 0?

For any y\in K\setminus U, by separation, there is some function g_y\in\mathcal{A} with g_y(y)\neq g_y(x_0). Besides, the function h_y(x) = g_y(x)-g_y(x_0) is in \mathcal{A} and of course h_y(y)\neq h_y(x_0) = 0. Even the function p_y(x) = \frac1{\left\|h_y^2\right\|_\infty}h_y^2(x) is in \mathcal{A}, we have p_y(x_0) = 0, p_y(y)>0 and 0\leq p_y\leq 1.

That was the first important step. p_y vanishes at x_0 and it doesn’t vanish in some point outside of U. Now, we need to push the “doesn’t vanish” part to make our function take values near 1, uniformly outside of U.

Consider the set W(y)=\{x\in K : p_y(x)>0\}. This is a neighborhood of y, hence its name. We can look at any of those W(y), whenever y\in K\setminus U. Now, K\setminus U is compact (it’s a closed subset of a compact space), and hence finitely many W(y) will suffice to cover K\setminus U, let’s say K\setminus U \subset\bigcup_{k=1}^m W(y_k). The function p(x) = \frac1m\sum_{k=1}^mp_{t_k}(x) is in \mathcal{A}, we still have 0\leq p\leq1, p(x_0)=0 and p(x)>0 for any x\in K\setminus U.

Because of compactness, p must take a positive minimum on the compact set K\setminus U, let’s call it \delta. Then, for this \delta\in(0,1], we look at the set V=\{x\in K : p(x)<\frac\delta2\}. Obviously, V\subset U and V is a neighborhood of x_0.

Let us look at the integer K = \left\lceil \frac1\delta+1\right\rceil. Thus, K-1\leq\frac1\delta and K\leq\frac{1+\delta}{\delta} \leq \frac2\delta. In total, 1< k\delta \leq 2. We can therefore, for every n\in\mathbb{N}, define the functions q_n(x) = \bigl(1-p^n(x)\bigr)^{k^n}, which, of course, are in \mathcal{A}, which have 0\leq q_n\leq 1 and which have q_n(x_0)=1.

Let x\in V. Then, Kp(x)< \frac{k\delta}2\leq 1 and Bernoulli’s inequality (used with n := k^n and h:= -p^n(x)\geq -1) shows q_n(x) \geq 1-k^np^n(x)\geq 1-\left(\frac{k\delta}2\right)^n. Therefore, \lim_{n\to\infty}q_n(x) = 1, and this limit is uniform in x\in V.

Let x\in K\setminus U. Then, Kp(x)\geq k\delta> 1 and Bernoulli’s inequality shows

\begin{aligned}    q_n(x) &= \frac{1}{k^np^n(x)}\bigl(1-p^n(x)\bigr)^{k^n} k^np^n(x) \\    &\leq \frac1{k^np^n(x)}\bigl(1-p^n(x)\bigr)^{k^n}\bigl(1+k^np^n(x)\bigr)\\    &\leq \frac{1}{k^np^n(x)}\bigl(1-p^n(x)\bigr)^{k^n}\bigl(1+p^n(x)\bigr)^{k^n} \\    &= \frac{1}{k^np^n(x)}\bigl(1-p^{2n}(x)\bigr)^{k^n} \\    &\leq \frac{1}{k^n\delta^n},    \end{aligned}

and this converges to 0 uniformly.

In particular, for any \varepsilon>0, we can find some n large enough such that q_n\leq\varepsilon on K\setminus U and q_n\geq 1-\varepsilon on V. Now, to conclude the proof, pick V(x_0) = V (remember that V was a neighborhood of x_0 by construction) and g(x) = 1-q_n(x). q.e.d.


This one felt quite chore-ish. It was a little messy and technical, but note the necessary tools: no operator theory, no convergence of series, just stuff from high school calculus and the very basics of the theory of compact sets. That would seem like a value in and of itself.


The setwise nearly indicator Lemma: Let A,B\subset K disjoint and closed. Let \varepsilon\in(0,1). Then, there is some g\in\mathcal{A} with

  • 0\leq g(x)\leq 1 for x\in K,
  • g(x)<\varepsilon for x\in A,
  • g(x)>1-\varepsilon for x\in B.


Obviously, this lemma improves on the first one, we have nearly an indicator function on B which nearly vanishes on A. In this respect, it resembles Urysohn’s Lemma from elementary topology, but Urysohn isn’t restricted to the algebra \mathcal A and can hence do even a little better. Never mind.


Proof: This is going to be a lot quicker: you know the drill. Let \varepsilon>0. Let U=K\setminus B and let y\in A. We can choose neighborhoods V(y) as in the previous Lemma. Of course, since A is compact, finitely many of those V(y) suffice to cover A, let’s say A\subset\bigcup_{k=1}^m V(y_k). For each of the V(y_k), there is a function g_k\in\mathcal{A} such that 0\leq g_k\leq 1, g_k(x)>1-\frac\varepsilon m for all x\in K\setminus U = B and g_k(x)<\frac\varepsilon m for all x\in V(y_k).

The function g(x) = \prod_{k=1}^mg_k(x) will now do the trick: Of course, we have 0\leq g\leq1.

Let x\in A\subset\bigcup_{k=1}^mV(y_k), in particular x\in V(y_{n_0}), say. Then

\displaystyle g(x) = \prod_{k=1}^mg_k(x)= g_{n_0}(x)\prod_{k=1, k\neq n_0}^mg_k(x) \leq \frac\varepsilon m\prod_{k=1, k\neq n_0}^m1 \leq\varepsilon.

Now, let x\in B. Then Bernoulli tells us that g(x)=\prod_{k=1}^mg_k(x)\geq\left(1-\frac\varepsilon m\right)^m \geq 1-\varepsilon. Note that we need \varepsilon\leq m here (which is not a very hard restriction anyway – nevertheless, we included some condition like that in the assertion).

That’s all we needed to show. q.e.d.


Note that the notion of compactness has been used rather slightly differently from the very first proof we gave for Stone-Weierstrass. Then, we had the Approximation Lemma which used continuity and direct neighborhoods where we could approximate f adequately. Here, we have decomposed the underlying set K by using the level sets of f to approximate it. Then, compactness tells us every time that we can do with finitely many sets – but the nature of those sets is ever so slightly different. The math, however, is not.

Now, let’s harvest the ideas:


Proof of the Stone-Weierstrass, real version:

Let f\in\mathcal{C}_{\mathbb{R}}(K) and let \varepsilon>0. Without loss of generality, we may assume that f\geq 0, since we might as well replace it by f+\left\|f\right\|_\infty. Besides, \varepsilon <\frac13 is not much of a restriction.

Let n\in\mathbb{N} be such that (n-1)\varepsilon>\left\|f\right\|_\infty. Let, for j=0,\ldots,n, A_j and B_j be the sets

\displaystyle A_j = \left\{x\in K : f(x)\leq \left(j-\frac13\right)\varepsilon\right\}

\displaystyle B_j = \left\{x\in K : f(x)\geq \left(j+\frac13\right)\varepsilon\right\}.

Obviously, those sets are disjoint for fixed j, and the A_j are increasing to A_n = K, while the B_j are decreasing to B_n = \emptyset. The setwise nearly indicator Lemma then says that there is some g_j\in\mathcal{A} with 0\leq g_j\leq 1, g_j(x) < \frac\varepsilon n for x\in A_j and g_j(x) > 1-\frac{\varepsilon}{n} for x\in B_j.

That’s all well and good – but how will this help us to prove Stone-Weierstrass? Imagine the space K decomposed like an onion into the A_j. The functions g_j will have very small values on the A_j, but large values outside, in the B_j. We will combine these small and large values properly, to have similar values as the function f in every ring of this onion. Note that the values of f are closely attached to the g_j – by the very definition of A_j and B_j.

Now consider the function g(x) = \varepsilon\sum_{k=0}^n g_j(x), which of course is in \mathcal{A}. Let x\in K, then there is some j\geq 1 with x\in A_j\setminus A_{j-1}, and so

\displaystyle \left(j-\frac43\right)\varepsilon < f(x) \leq \left(j-\frac13\right)\varepsilon,

and of course g_i(x) < \frac\varepsilon n for all i\geq j. Besides, our fixed x\in K is in B_i for i\leq j-2. Therefore, g_i(x)\geq 1-\frac\varepsilon n. All of this tells us (remember that \varepsilon<\frac13)

\begin{aligned}    g(x) = \varepsilon\sum_{k=0}^{j-1}g_k(x) + \varepsilon\sum_{k=j}^ng_k(x)&\leq j\varepsilon + \varepsilon(n-j+1)\frac{\varepsilon}{n} \\    &\leq j\varepsilon + \varepsilon^2 < j\varepsilon + \frac\varepsilon3 = \left(j+\frac13\right)\varepsilon.    \end{aligned}

Now, if j\geq2,

\begin{aligned}    g(x) \geq \varepsilon\sum_{k=0}^{j-2}g_k(x) &\geq (j-1)\varepsilon \left(1-\frac\varepsilon n\right) \\    &= (j-1)\varepsilon - \left(\frac{j-1}{n}\right)\varepsilon^2 > (j-1)\varepsilon - \varepsilon^2 > \left(j-\frac43\right)\varepsilon.    \end{aligned}

Note that, for j = 1, g(x) > \bigl(j-\frac43\bigr)\varepsilon appears to be correct.

To conclude, we wish to show something about \left|f(x)-g(x)\right|. Let us consider the case f(x)\geq g(x) first:

\displaystyle\left|f(x)-g(x)\right| = f(x)-g(x) \leq \left(j-\frac13\right)\varepsilon - \left(j-\frac43\right)\varepsilon = \varepsilon.

The other case is f(x) < g(x):

\displaystyle\left|f(x)-g(x)\right| = g(x) - f(x) \leq \left(j+\frac13\right)\varepsilon - \left((j-2)+\frac13\right)\varepsilon = 2\varepsilon.

This shows, that in every case f and g differ by at most 2\varepsilon, uniformly on K. q.e.d.


To conclude, let us mention some little question that is related to the classical Weierstrass Aproximation Theorem. If we have all monomials x^n at our disposal, we can uniformly approximate any continuous function; but what if some monomials are missing? With how few of them can the Approimation Theorem still hold? The surprisingly neat answer is given in Müntz’ Theorem (sometimes called the Müntz-Szász-Theorem for the sake of more signs on top of the letters):


Theorem (Müntz): Let \lambda_k be a strictly increasing sequence of natural numbers with \lambda_1>0. Then, we have

\displaystyle \overline{\mathrm{span}\{1, x^{\lambda_1},x^{\lambda_2},\ldots,\}} = \mathcal{C}_{\mathbb{R}}([0,1]) \qquad\iff\qquad \sum_{k=1}^\infty\lambda_k^{-1}=\infty.


The span is meant as the linear space generated by the given monomials, i.e. the polynomials that can be constructed using only those monomials.  Hence, the Weierstrass Theorem can even hold if you disallow to use infinitely many monomials. However, you will always need the monomial x^0=1, since otherwise your approximating functions could only satisfy f(0)=0.

The proof of that theorem is not particularly easy, especially the part where you show that the series must diverge if you can approximate any function. For the other direction there has been a very elementary proof that we shall give here because of its inherent beauty:

Proof of Müntz’ Theorem, \Leftarrow-part:

Let \sum_{k=1}^\infty\lambda_k^{-1}=\infty.

Let q\in\mathbb{N} be such that q\notin\{\lambda_1,\lambda_2,\ldots\}. Without loss of generality, we assume that q is smaller than any of the \lambda_k: just take away those \lambda_k which are smaller than q and re-enumerate the \lambda_k starting from k=1; the condition \sum_{k=1}^\infty\lambda_k^{-1}=\infty will then still hold.

We will show that x^q\in\overline{\mathrm{span}\{x^{\lambda_1},x^{\lambda_2},\ldots\}}. This will imply that all monomials x^n are in the closure of that span, and if we have that, Weierstrass tells us that we can approximate any continuous function.

Note that we don’t need the constant function 1 to approximate x^q, since for our purposes the relevant functions all have f(0)=0.

Now consider the function

\displaystyle Q_n(x) := x^q-\sum_{k=1}^na_{k,n}x^{\lambda_k}

with certain a_{k,n}\in\mathbb{R}. Then first of all, Q_0(x) = x^q, and if we choose certain a_{k,n}, we find the recursion

\displaystyle Q_n(x) = (\lambda_n-q)x^{\lambda_n}\int_x^1Q_{n-1}(t)t^{-1-\lambda_n}dt.

Let’s prove this by induction:

\begin{aligned}    (\lambda_n-q)x^{\lambda_n}\int_x^1Q_{n-1}(t)t^{-1-\lambda_n}dt &= (\lambda_n-q)x^{\lambda_n}\int_x^1\left(t^q-\sum_{k=1}^{n-1}a_{k,n-1}t^k\right)t^{-1-\lambda_n}dt\\    &=(\lambda_n-q)x^{\lambda_n}\int_x^1 t^{q-1-\lambda_n}-\sum_{k=1}^{n-1}a_{k,n-1}t^{k-1-\lambda_n}dt\\    &=(\lambda_n-q)x^{\lambda_n}\left(\left[\frac1{q-1-\lambda_n+1}t^{q-1-\lambda_n+1}\right]_x^1\right. \\    &\hphantom{=}\left.- \sum_{k=1}^{n-1}a_{k,n-1}\left[\frac1{k-1-\lambda_n+1}t^{k-1-\lambda_n+1}\right]_x^1\right)\\    &=(\lambda_n-q)x^{\lambda_n}\left(\frac1{q-\lambda_n}(1-x^{q-\lambda_n}) \vphantom{\sum_{k=1}^n}\right.\\    &\hphantom{=}-\left.\sum_{k=1}^{n-1}a_{k,n-1}\frac1{k-\lambda_n}(1-x^{k-\lambda_n})\right)\\    &=x^{\lambda_n}(x^{q-\lambda_n}-1)-(\lambda_n-q)\sum_{k=1}^{n-1}\frac{a_{k,n-1}}{k-\lambda_n}(1-x^{k-\lambda_n})\\    &=x^q-x^{\lambda_n}-(\lambda_n-q)\sum_{k=1}^{n-1}\frac{a_{k,n-1}}{k-\lambda_n}(1-x^{k-\lambda_n})\\    &=: x^q - \sum_{k=1}^na_{k,n}x^{\lambda_k}.    \end{aligned}

Obviously, one could try to define the a_{k,n} appropriately or to give some closed formula expression; but it doesn’t seem to be useful for the problem at hand.

Now, we have \left\|Q_0\right\|_\infty = 1, and

\begin{aligned}    \left\|Q_n\right\|_\infty &= \left\|(\lambda_n-q)x^{\lambda_n}\int_x^1Q_{n-1}(t)t^{-1-\lambda_n}dt\right\|_\infty \\    &\leq \left|\lambda_n-q\right|\left\|x^{\lambda_n}\right\|_\infty\left\|Q_{n-1}\right\|_\infty\left\|\int_x^1t^{-1-\lambda_n}dt\right\|_\infty\\    &=\left|\lambda_n-q\right|\left\|Q_{n-1}\right\|_\infty\left\|-\frac1{\lambda_n}(1-x^{-\lambda_n})\right\|_\infty\\    &=\left|\lambda_n-q\right|\left\|Q_{n-1}\right\|_\infty\frac1{\left|\lambda_n\right|}\left\|1-x^{-\lambda_n}\right\|_\infty \\    &=\left|1-\frac q{\lambda_n}\right|\left\|Q_{n-1}\right\|_\infty.    \end{aligned}

This shows

\displaystyle \left\|Q_n\right\|_\infty\leq\prod_{k=1}^n\left|1-\frac q{\lambda_k}\right|,

and hence

\displaystyle \lim_{n\to\infty}\left\|Q_n\right\|_\infty \leq \prod_{k=1}^\infty\left|1-\frac q{\lambda_k}\right|.

But the infinite product will be 0, because 0<\frac q{\lambda_k}<1 by assumption, and because of the basic inequality 1-x\leq e^{-x} for any x>0, we have

\displaystyle \prod_{k=1}^\infty\left|1-\frac q{\lambda_k}\right| \leq \prod_{k=1}^\infty\exp\left(-\frac q{\lambda_k}\right) = \exp\left(-\sum_{k=1}^\infty \frac q{\lambda_k}\right) = 0.

Hence, Q_n vanishes uniformly as n\to\infty, and so x^q is being approximated uniformly by some polynomials using only the allowed monomials x^{\lambda_k}. This concludes the proof. q.e.d.


At first, I was excited because I thought one could reverse this idea to show the other direction of Müntz’ Theorem as well. However, it doesn’t work because of the second-to-last inequality given above: If the product is finite, that doesn’t say anything about how well (or how badly) x^q is approximated. The proof of the reverse direction is harder and I am only aware of deeper ideas from Fourier Analysis (that’s historically the first proof) or using the Hahn-Banach Theorem (as in Rudin’s book).

But the elementary idea given here is actually very neat, even though it doesn’t, to the best my knowledge, solve any real-world problems. A very pretty academic finger-exercise. And this should be a good place to stop.