How to find the Median efficiently

In the past few months I have spent some time on the in-depth analysis of Quicksort. Even the average-case analysis as it’s laid out in Knuth’s book The Art of Computer Programming took me a little while to digest, and now I have looked at a very closely related algorithm.

Consider the problem of finding the t-th largest element in a set of n distinct real numbers. We shall use the ideas of the Quicksort algorithm to solve this task, and ask the question: how many comparisons are needed on the average? The corresponding algorithm is sometimes called “Quickfind” or “Quickselect“.

This problem is taken from an exercise in Knuth’s book (Sect. 5.2.2, Ex. 32). The exercises are rated according to the effort needed to solve them. An exercise rated “00” is immediately clear, while “50” is a research problem. The interpolation takes place on a logarithmic scale, meaning that the exercises between “20” and “30” are challenging to me, but can be reached with a little effort. The exercise solved here is rated “40” — and rightly so as we shall see. The solution to the exercise in Knuth’s book is quite brief: the basic idea of the recursion is stated and then there is the final result. In a citation, Knuth refers to himself, to the paper “Mathematical Analysis of Algorithms”, from the Proceedings of IFIP Congress 71 (1972). It has been reprinted in Knuth’s “Selected Papers on Analysis of Algorithms“. In what follows, we will heavily rely on this paper, even though there is a slight mismatch of how Quickfind is programmed.

First, let’s have a brief look at the Quicksort algorithm. It has been designed by Hoare in the 1960’s as a sorting algorithm. You take a certain element of the set of real numbers and call it the pivot. You scan the set from left to right looking for elements greater than the pivot, then you scan from right to left looking for elements lesser than the pivot. When you find two such elements, swap them and continue scanning. When you have seen each element from your set, you stop and put the pivot at the point where your scans have crossed. Thus, you have placed the pivot in its correct position, having partitioned the set in two parts, one of which only contains elements less than the pivot, the other one of which only contains elements greater than the pivot. Then you can continue in the new, smaller parts of the set. In the end, you have sorted the entire set (quite efficiently in most cases).

In our present task, we can save some of the trouble of Quicksort. We don’t want to sort the entire set, we’re only looking for the t-th largest element. So, after the partitioning phase, we have to look at the position where the pivot has come up. If the pivot-position is smaller than t, we’ll have to look at the right-hand side of our set; if it is larger than t, we’ll look at the left-hand side. The other side is unimportant for the task at hand. That saves, heuristically, half the effort of Quicksort.

It is well-known that Quicksort takes about O(n\log n) operations to sort a set of n elements. Quickfind, however, actually does better: it’s O(n). You can shallowly argue like this: Assuming that you can split the set in halves each time, there will be roughly n+\frac12n+\frac14n+\frac18n+\cdots=n\frac1{1-\frac12}=2n operations. Of course, the set will not split properly like this each time and you cannot expect k operations on a set of k elements (that’s what we want to prove, mind you). So, we’ll look a little deeper right away.

Before we start, we note that in special cases one can obviously do without Quickfind: if you look for the largest element (which amounts to the case t=1), a single scan looking for the running maximum will do without the overhead of Quickfind. There will be a break-even point t for your engine, when t scans are superior to Quickfind for finding the t-th largest element. Besides, note the symmetry: whatever we say about maxima applies to minima in exactly the same way.

Let us designate the average number of comparisons needed to find the t-th largest element in a set of n real numbers by C_{nt}. Obviously, we need 1\leq t\leq n, and there is C_{11} = 0. In what follows, we will prove:

\displaystyle C_{nt} = 2(n+1)H_n-2(n-t+2)H_{n-t+1}-2(t+1)H_t+2n+\frac{10}3-\frac13\delta_{t1}-\frac13\delta_{nt}-\frac23\delta_{n1}.

Here, H_n=\sum_{i=1}^n\frac 1i is the n-th harmonic number, and \delta_{ij} is Kronecker’s symbol: \delta_{ij}=1\iff i=j.

Remember, we have H(n)\approx \log n+\gamma+\frac1{2n}+o(n^{-1}). In the end, we shall have a particular look at the case t\approx\frac n2 and consider a Taylor-expansion of this.

Let us start with a recursive expression to C_{nt}. During the first partitioning phase, you will have to compare every element of the set against the pivot (that makes n-1 comparisons, because you won’t compare the pivot to itself) and when the scans cross each other, you will inevitably scan two elements twice. Hence, this first partitioning yields n+1 comparisons. So, you can put the pivot in its proper position, and now three cases are possible:

  1. the pivot is in position t, so we are done with no more comparisons necessary.
  2. the pivot is in a position k<t, say. So, we have to continue with the elements in positions k+1,\ldots,n. There are n-k of them, and we will look for the $(t-k)$-th largest in this subset. There will be C_{n-k,t-k} further commparisons.
  3. the pivot is in a position k>t, say. So, we have to continue with the elements in positions 1,\ldots,k-1. There are k-1 of them, and we will look for the t-th largest in this subset. There will be C_{k-1,t} further comparisons.

Beforehand, any of the n real numbers had had equal chances of being the pivot, so the recursion reads as follows:

\displaystyle C_{nt} = n+1+\frac1n\sum_{k=1}^{t-1}C_{n-k,t-k}+\frac1n\sum_{k=t+1}^nC_{k-1,t}.

As a warm-up, we start with the special case of t=1. As we have already seen, C_{11}=0. Now, let n>1:

\begin{array}{rcl}  (n+1)C_{n+1,1}-nC_{n,1} &=& \displaystyle (n+1)(n+2)+\sum_{k=2}^{n+1}C_{k-1,1}-n(n+1)-\sum_{k=2}^{n}C_{k-1,1}\\  &=& 2(n+1)+C_{n,1}, \vphantom{\sum_{i=1}^n}  \end{array}

which tells us

C_{n+1,1}-C_{n1} = 2.

Keeping in mind that this recursion only works for n>1 (to avoid the eerie case of vacuous sums and the expression C_{0,1} in the recursion above), we still need the base case C_{21} = 2+1=3, and so, by induction, we arrive at

C_{n1} = 2n-1-\delta_{n1}.

For reasons of symmetry, we can look at the case t=n by considering the case t=1:

C_{nn} = 2n-1-\delta_{n1}.

Now, let’s look at the recursion formula for C_{nt}. By divine intuition, we have a look at:

\begin{array}{rcl}  \displaystyle (n+1)C_{n+1,t+1}&-&nC_{n,t+1}-nC_{nt}+(n-1)C_{n-1,t} \\  &=& \displaystyle (n+1)(n+2)+\sum_{k=1}^{t}C_{n+1-k,t+1-k}+\sum_{k=t+2}^{n+1}C_{k-1,t+1}\\&&- \displaystyle n(n+1)-\sum_{k=1}^{t}C_{n-k,t+1-k}-\sum_{k=t+2}^nC_{k-1,t+1}\\&&- \displaystyle n(n+1)-\sum_{k=1}^{t-1}C_{n-k,t-k}-\sum_{k=t+1}^nC_{k-1,t}\\&&+ \displaystyle (n-1)n+\sum_{k=1}^{t-1}C_{n-1-k,t-k}+\sum_{k=t+1}^{n-1}C_{k-1,t}\\  &=& \displaystyle 2+\sum_{k=1}^{t}C_{n+1-k,t+1-k}-\sum_{l=2}^{t}C_{n-l+1,t-l+1}\\&& \displaystyle -\sum_{k=1}^{t}C_{n-k,t+1-k}+\sum_{l=2}^{t}C_{n-l,t-l+1} + \displaystyle C_{n,t+1}-C_{n-1,t} \vphantom{\sum_{i=1}^n} \\  &=& \displaystyle 2+C_{n,t}-C_{n-1,t}+C_{n,t+1}-C_{n-1,t}. \vphantom{\sum_{i=1}^n}  \end{array}

That leads to:

(n+1)C_{n+1,t+1}-(n+1)C_{n,t+1}-(n+1)C_{nt}+(n+1)C_{n-1,t}=2,

and to

\displaystyle C_{n+1,t+1}-C_{nt} = \frac{2}{n+1}+C_{n,t+1}-C_{n-1,t}.

Note, that we can only write this down if n>t, since C_{n-1,t} won’t make sense otherwise. The recursion applies to any n>2, because, the basic reasoning behind the recursion is only possible for these n (C_{11} is always a very special case, as we have already seen above).

Let us ignore additive constants for now, and particularly let us ignore the (quite nasty) base cases. The recursion will tell us how to go on from these base cases, so they only serve as an additive offset to the closed-form expression that we want to achieve. We iterate:

\begin{array}{rcl}  \displaystyle C_{n+1,t+1}-C_{nt} &=& \displaystyle \frac{2}{n+1}+C_{n,t+1}-C_{n-1,t}\vphantom{\sum_{i=1}^n}\\  &=& \displaystyle \frac{2}{n+1}+\frac{2}{n}+C_{n-1,t+1}-C_{n-2,t} \vphantom{\sum_{i=1}^n} \\  &=& \displaystyle \sum_{k=0}^{n-1-t}\frac{2}{n+1-k}+C_{t+1,t+1}-C_{tt}\\  &=& \displaystyle 2(H_{n+1}-H_{t+1})+2(t+1)-1-(2t-1). \vphantom{\sum_{i=1}^n}  \end{array}

That means

C_{nt}-C_{n-1,t-1}=2(H_n-H_t)+2.

Let us now look at

\begin{array}{rcl}  C_{nt}-C_{n-t+1,1} &=& \displaystyle \sum_{k=0}^{t-2}(C_{n-k,t-k}-C_{n-k-1,t-k-1}) \vphantom{\sum_{i=1}^n} \\ &=& \displaystyle \sum_{k=0}^{t-2}\left(2H_{n-k}-2H_{t-k}+2\right) \vphantom{\sum_{i=1}^n} \\ &=& \displaystyle 2\sum_{k=n-t+2}^{n}H_k-2\sum_{k=2}^tH_k+2(t-1) \vphantom{\sum_{i=1}^n} \\ &=& \displaystyle 2\sum_{k=1}^nH_k-2\sum_{k=1}^{n-t+1}H_k-2\sum_{k=1}^tH_k+2+2t-2 \vphantom{\sum_{i=1}^n} \\ &=& \displaystyle 2(n+1)H_n-2n-2(n-t+2)H_{n-t+1}+2(n-t+1) \vphantom{\sum_{i=1}^n} \\ && -2(t+1)H_t+2t+2t \vphantom{\sum_{i=1}^n} \\ &=& \displaystyle 2(n+1)H_n-2(n-t+2)H_{n-t+1}-2(t+1)H_t+2+2t. \vphantom{\sum_{i=1}^n}  \end{array}

Here, we have used the useful relation \sum_{k=0}^nH_k=(n+1)H_n-n, which can be proved without much effort by an interchange of summations.

Then we get

\begin{array}{rcl}  C_{nt}&=& \displaystyle 2(n+1)H_n-2(n-t+2)H_{n-t+1}-2(t+1)H_t+2+2t+2(n-t+1)-1 \vphantom{\sum_{i=1}^n} \\  &=& \displaystyle 2(n+1)H_n-2(n-t+2)H_{n-t+1}-2(t+1)H_t+2n+3. \vphantom{\sum_{i=1}^n}  \end{array}

Be aware that this is not yet the truth: we have not been concerned with base cases. Let us therefore compute several special cases of C_{nt} to find the correct offset for the closed-form expression.

Let’s have a look at several particular cases to determine the additive constant that we still need to deal with the connection of recursion and base case.

By definition:

C_{32} = \displaystyle 3+1+\frac13(C_{21}+C_{22}) = 4+1+1=6,

while by the recursion formula:

C_{32} = \displaystyle 2\cdot4\cdot H_3-2(3-2+2)H_{3-2+1}-2\cdot 3 H_2+2\cdot 3+3 = \frac{88}{6}-9-9+6+3 = \frac{17}{3}.

Hence, we have to add 6-\frac{17}{3}=\frac13 to our solution to the recursion for any of the “standard” values for n and t. The boundary cases t=1 and t=n are different, and so is n=1.

For n>t=1, we already know C_{n1}=2n-1, which should equal

2(n+1)H_n-2(n-1+2)H_{n-1+1}-2\cdot 2H_1+2n+3 = -4+2n+3 = 2n-1.

Hence, in this case, no additive constant is necessary. For t=n, the same holds because of symmetry.

Finally, for n=t=1, we already know C_{11}=0. On the other hand,

2(1+1)H_1-2(1-1+2)H_{1-1+1}-2\cdot(1+1)H_1+2\cdot 1+3=-4+5=1,

thus we need the constant -1 in this case.

We shall write this down from the perspective of the “standard” values, yielding a constant of 3+\frac13=\frac{10}3. The other cases are considered the special ones and are done away with using Kronecker’s delta. Altogether, we have found what we claimed at the beginning:

\displaystyle C_{nt}=2(n+1)H_n-2(n-t+2)H_{n-t+1}-2(t+1)H_t+2n+\frac{10}{3}-\frac13\delta_{t1}-\frac13\delta_{tn}-\frac23\delta_{n1}.

This can in retrospect easily be verified by numerical computations. Note the symmetry of this expression with respect to n and n-t: it is unimportant if you are looking for the t-th largest or the t-th smallest number in your set.

In some way, I’m tempted to take a notion of Pisier (Lecture notes in Mathematics 770, p. 312-327): ending a rough proof with a heart-felt “Oof!” I do acknowledge, though, that Pisier’s proof is way more complicated than what I did here…

The intuitive heuristics in the beginning have told us that the Quickfind algorithm should take linear time in n on the average. In our closed form expression, we can see terms of the form n H_n, which is approximately equal to n\log n — but when you have a closer look at the Taylor expansion of the whole expression, the n\log n-terms will cancel. In particular, any t you may want to choose will take linear time in n. The naive algorithm of having t passes over your set and kicking the maximum out each time will take O(tn) comparisons — this may be superior for some t, but when you are interested in the median, say, this yields O(n^2) comparisons. Now, what have we found out for the important special case of finding the median of n real numbers?

We are interested in the case n=2t-1 for sufficiently large n and find

\begin{array}{rcl}  C_{2t-1,t} &=& \displaystyle 4tH_{2t-1}-4tH_t-4H_t+4t+\frac43\vphantom{\sum_{i=0}^n}\\  &=& \displaystyle 4t\left(\log(2t-1)+\gamma+\frac1{4t-2}\right)-4t\left(\log t+\gamma+\frac1{2t}\right) \vphantom{\sum_{i=0}^n} \\&&\displaystyle-4\left(\log t+\gamma+\frac1{2t}\right)+4t+\frac43+O(t^{-1}) \vphantom{\sum_{i=0}^n} \\  &=& \displaystyle 4t\left(\log(2t)+\log\left(1-\frac1{2t}\right)+\gamma+\frac1{4t-2}\right)-4t\log t-4\gamma t-2 \vphantom{\sum_{i=0}^n} \\&&\displaystyle -4\log t-4\gamma+4t+\frac43+O(t^{-1}) \vphantom{\sum_{i=0}^n} \\  &=& \displaystyle 4t\log 2+4t\log t+4t\log\left(1-\frac1{2t}\right)+4\gamma t+\frac{4t-2+2}{4t-2} \vphantom{\sum_{i=0}^n} \\ &&- \displaystyle 4t\log t-4\gamma t-2-4\log t-4\gamma+4t+\frac43+O(t^{-1}) \vphantom{\sum_{i=0}^n} \\ &=& \displaystyle 4t\log 2+4t\log\left(1-\frac1{2t}\right)+1+\frac1{2t-1}-2-4\log t-4\gamma+4t+\frac43+O(t^{-1}) \vphantom{\sum_{i=0}^n} \\  &=& \displaystyle 4t\log 2-4t\frac1{2t}+\frac13-4\log t-4\gamma+4t+O(t^{-1}) \vphantom{\sum_{i=0}^n} \\  &=& \displaystyle (4+4\log 2)t-4\log t-4\gamma-\frac53+O(t^{-1}) \vphantom{\sum_{i=0}^n} \\  &=& \displaystyle (4+4\log 2)\frac{n-1}2-4\log(\frac{n-1}2)-4\gamma-\frac 53+O(n^{-1}) \vphantom{\sum_{i=0}^n} \\  &=& \displaystyle (2+2\log 2)n-(2+2\log 2)-4\log(n-1)-4\log 2-4\gamma-\frac 53+O(n^{-1}) \vphantom{\sum_{i=0}^n} \\  &\approx& \displaystyle 3.39 n-3.39-4\log(n-1)-2.77-2.31-1.67+O(n^{-1}) \vphantom{\sum_{i=0}^n} \\  &\approx& \displaystyle 3.39n + O(\log n). \vphantom{\sum_{i=0}^n}    \end{array}

It seems hard to think of an algorithm to pick the median of a set that is more efficient on the average than this one: even the constant is rather small. Note, however, that the worst-case analysis will yield a much larger number of necessary comparisons, and we have said nothing about higher moments (indicating the variation around the average case). Here, the very related “median of medians” algorithms can be used, and one can select the pivot in Quickselect at random, for instance. But anyway, this seems like a good place to stop.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s