An analysis of Quicksort

As I have mentioned already, I am presently curious about sorting algorithms and about Quicksort in particular. This is just some academic curiosity – whenever I actually need to sort some data, I have ready-to-use commands in any of my favourite programming languages. But I wanted to develop a deeper understanding of what lies beneath the surface of the standard command “sort”. Often, the helppages will tell you that there is some implementation of Quicksort or Mergesort or Radixsort down there, sometimes enriched with Shellsort for special cases. Many times, there is no information available at all. But then, what is the performance of a given sorting algorithm? What are its limitations? And how can one decide which algorithm is the best for what you want to achieve?

Hence my interest in the detailed analysis. Knuth’s book allows for a very thorough understanding, but it took me several weeks to get to the bottom of the analysis of Quicksort: partially due to myself only spending some off-hours to this, partially due to the complexity of this analysis. The analysis is not particularly hard, but pretty lengthy.

On the internet, you will find numerous explanations, implementations and worked examples for Quicksort. There is little about this that I could add to this. But just to make a basis for the rest of this post: Let there be N elements (real numbers x_1,\ldots,x_n, say) that will be sorted. The basic idea of Quicksort is to choose one element out of the data and putting it in the proper place, while putting all the smaller elements to the one side and all the larger elements to the other side. Then you have two subsets of the original data which you can sort independently of one another. This didn’t appear quite intuitive when I first heard about it – why should that be any faster than Insertion Sort or Shellsort? Let’s formalize the algorithm a little more:

  • Choose a fixed number M. If you wish to sort N\leq M elements, use Insertion Sort.
  • If there are more than M elements at hand, start a new stage:
    • take the first element x_1 of your set as your pivot K
    • scan through the file from left to right (starting with x_2) and stop at the first element x_i you find that is greater or equal to K
    • scan through the file from right to left (starting with x_N) and stop at the first element x_j you find that is less or equal to K
    • if i<j then your pointers have not crossed – for the sorted set, this means x_i belongs to the right-hand side of the pivot K, x_j belongs to the left-hand side. So, swap x_i and x_j
    • if i\geq j, no more swaps need to be done: all elements smaller than K are on the left-hand side, all elements larger than K are on the right-hand side. Thus, the position of x_j is the final place for K – then, swap x_j and K. (Note, that at this moment, j must be at the final position of K, since the smaller elements have already been seperated, x_j is the rightmost one of those, and K is still at position 1 – swapping x_i and K would therefore put you off to a position that is too far right for K).
  • Now, the set of data has been divided, with K at the proper position. Now, we have two subsets, each containing only the unordered elements smaller than K or larger than K, respectively. Take each of these sets and repeat the algorithm from its start.

The way that we have put down the algorithm here underlines its nature as a divide-and-conquer algorithm. There are several well-known ways of considering this as an exercise in elegant, recursive programming. For the analysis, Knuth does slightly otherwise in chapter 5.2.2 of his book. On the one hand, the source code becomes more complicated; but on the other hand, the program will be more powerful. For instance, calling the program recursively will generate a lot of overhead during run-time (having quickly spammed the RAM on my frail notebook). Besides, an iterative implementation makes it easier to analyse what actually happens and where the costs lie during run-time.

In Knuth’s book, the algorithm is written in his special-purpose language MIX. He uses many jump commands to utilise the concept of if-statements and for-loops (which he doesn’t have at hand explicitly in MIX). I have adapted his implementation in the statistics language R, where there are no jump commands at all. I am perfectly aware that loops are not the fastest thing to do in R (and R is not really fast anyway – if you want speed, use your favourite dialect of C). There are ways to employ loops in a vector-matrix-disguise, but in this particular case, there was no speed-up from the usual workarounds. Being still at the surface of a GUI, the naive implementation of Quicksort in R cannot compete with R’s proper sorting command (which is coded in a more machine-friendly way and which is more sophisticated in the nasty details). With tests on several dummy data, my naive implementation lost against the “sort” command by a time-factor of about 300. But then again, I never intended to beat a professional implementation – I just wanted to understand what makes this algorithm work.

To keep track of the subsets that will still be needed for sorting, we will employ a stack. At the beginning of each partitioning-stage, we take the top-most element from the stack and sort the subset that is coded in this top-most element. After partitioning, the subsets are only put on the stack if they still contain more than M elements. When the stack is empty, we’re done. Finally, we use one pass of Insertion Sort to clean up the “small” subsets.

This iterative implementation is obiviously equivalent to the recursive implementation that you can find everywhere on the internet. The stack that we program explicitly will be used implicitely by your processor during run-time of the recursion (only using a lot more overhead because there will be all sorts of meta-information by calling the function again and again).

Now, finally, to the analysis of the run-time. First, a heuristic argument: In each partitioning, the set will be split in half (more or less), and hence there will be about \log N levels of partitioning. In each of these levels, that will comprise several stages, all the N elements will be considered (first in one half, later in the other half – except for the pivot elements of course… but that’s what a heuristic is for). This yields a total run-time of about N\log N.

Now, a little deeper. Where do the costs for this algorithm come from?

  • A_N, the number of stages,
  • B_N, the number of swaps of x_i and x_j during partitioning,
  • C_N, the number of comparisons while partitioning,
  • D_N, the number of times that Insertion Sort is invoked,
  • E_N, the number of inversions that Insertion Sort heals.

Knuth analyses another number that he calls S_N, the number of times an element is put on the stack. For the source code in R, this is not necessary: every stage begins with pulling an element from the stack, and every element that comes from the stack must have been put in there in advance. Hence, for my R code, A_N=S_N. Knuth, however, can distinguish these numbers because he can employ jump commands that I cannot: he will only put those subsets on the stack that he doesn’t consider immediately. On the other hand, I put all the subsets on the stack, as long as they are big enough – and in many occasions I put them on the stack and retrieve them immediately afterwards. It turns out that Knuth gains a small amount of time by this, but this is not where the action is.

The analysis of the expectation of these five numbers relies heavily on the recursive structure of the algorithm, be it implemented recusively or not. We will arrive at explicit formulas for these numbers, where I am quite puzzled how you would find some of these. I can prove them alright, but the necessary induction doesn’t always tell me how I would write them down in the first place. But basically, I’m glad I arrived there. Besides, the variance of these numbers seems quite out of reach for me (not for full-time researchers of course, the algorithm has been x-rayed pretty thoroughly).

OK. Let’s go.

First, the number of stages A_N. We will prove the theorem:

  • A_N=0 for N\leq M,
  • \displaystyle A_N = 2\frac{N+1}{M+2}-1 for any N> M.

The first assertion is obvious. If N>M, we do a partition on the entire set, and, by basic assumption, each element has equal probability to stand in position 1 and have become the pivot element. So, with probability \frac 1N the pivot ends up in some position s and we continue with s-1 elements on the left-hand side and N-s elements on the right-hand side. To sum up, we have a total of A_N = 1+\frac 1N\sum_{s=1}^N(A_{s-1}+A_{N_s}) stages during the whole run-time of the algorithm, in case N>M. This can be rearranged to N A_N = N+2\sum_{s=0}^{N-1} A_s. Then, we get rid of the sum by considering the equation for N+1 and then taking the difference:  (N+1)A_{N+1} = N+1 - N +NA_N+2A_N = 1+(N+2)A_N. Note, that the recursion formula only holds for N>M, so you can’t just do the difference trick with N and N-1 (at least: not without thinking).

What have we got now? There are two cases left to be distinguished:

  • If N = M+1, we have \displaystyle A_N = 1+\sum_{s=0}^{N-1}A_s = 1.
  • If N>M+1, we have \displaystyle A_N = \frac{1}{N}+\frac{N+1}{N}A_{N-1}\vphantom{ sum_{s=0}^{N-1}A_s }.

By divine insight, you can now guess the explicit formula from the statement. And using induction a little more carefully than I did when I first tried it, you can easily prove this. The base case N=M+1 is obvious. For N>M+1, we have by inductive assumption:

\begin{array}{rcl} A_N &=& \displaystyle \frac 1N + \frac{N+1}{N}A_{N-1} \vphantom{\sum_{s=0}^{M}} \\ &=& \displaystyle \frac1N+\frac{N+1}{N}(2\frac{N}{M+2}-1) \vphantom{\sum_{s=0}^{M}} \\ &=& \displaystyle \frac1N+2\frac{N+1}{M+2}-\frac{N+1}{N} \vphantom{\sum_{s=0}^{M}} \\ &=& \displaystyle 2\frac{N+1}{M+2}-1 \end{array}.

q.e.d.

So much for the total number of stages: it is linear in the size of the set – note the difference to the number of “levels” of stages from the heuristics given above. Stopping the algorithm earlier (leaving larger unsorted subsets of size M) will decrease the number of stages. So far, so obvious.

Now, for the number B_N of swaps during partitioning. We will prove the theorem:

  • B_N=0 for N\leq M,
  • \displaystyle B_N = \frac{N+1}{6}(2H_{N+1}-2H_{M+2}+1-\frac6{M+2})+\frac12\approx \frac{N+1}{3}\log\left(\frac{N+1}{M+1}\right)+\frac{N+1}{6}\left(1-\frac{6}{M+1}\right)+\frac12 for any N> M.

Here, H_N=\sum_{k=1}^n\frac1k is the N-th harmonic number. Basic integral calculus or the Euler-Maclaurin-formula will tell you that H_N \approx \log N+\gamma+\frac1{2n}.

Again, the first assertion is obvious. For the other assertion, we can do the same trick as we did for A_N. Now, the twist is that we have more trouble about what happens during one partitioning stage: If the pivot element belongs to position s, we will have, say, t swaps from the left-hand side to the right-hand side during the partitioning, so there’s one addition to B_N for each displaced element here. In this situation, t can take any number form 0 to s-1, and the number of possibilities for each of them is \binom{s-1}{t}\binom{N-s}{t}: choose t displaced elements among the s-1 from the left-hand side, and choose t displaced elements$ among the N-s from the right-hand side, then swap them accordingly. Since we have to choose \binom{N-1}{s-1} elements in total that go to the left-hand side, we arrive at the recursion

\displaystyle B_N = \frac1N\sum_{s=1}^N\left(\sum_{t=0}^{s-1}\frac{\binom{s-1}{t}\binom{N-s}{t}}{\binom{N-1}{s-1}}t+B_{s-1}+B_{N-s}\right).

From some usual fuddling with binomial coefficients that is laid down in Knuth’s book, chapter 1.2.6, Example 1, we find

\displaystyle \sum_{t=0}^{s-1}\binom{s-1}{t}\binom{N-s}{t}t = \binom{N-2}{s-2}(N-s) = \binom{N-1}{s-1}\frac{s-1}{N-1}(N-s),

and therefore we get

\displaystyle B_N = \frac1N\sum_{s=1}^N\frac{(s-1)(N-s)}{N-1}+\frac2N\sum_{s=0}^{N-1}B_s.

Now, the first summand can be computed by the well-known formulas for sums of integers and sums of squares, so that B_N = \frac{N-2}{6}+\frac 2N\sum_{s=0}^{N-1}B_s. Afterwards, we can do the difference trick to get

\displaystyle B_N = \frac{2N-3}{6N}+\frac{N+1}{N}B_{N-1} for N>M+1.

Again, divine intuition will show you that the stated explicit formula fits the recursion. The proof is by induction, since

\begin{array}{rcl} B_{M+1} &=& \displaystyle \frac{M-1}{6}+\frac{2}{M+1}\sum_{s=0}^{M} B_s \\ &=& \displaystyle \frac{M-1}{6} \vphantom{\sum_{s=0}^{M}}\\ &=& \displaystyle \frac{N-2}{6} \vphantom{\sum_{s=0}^{M}} \\ &=& \displaystyle \frac{N+1}{6}(0+1-\frac{6}{N+1})+\frac12. \vphantom{\sum_{s=0}^{M}} \end{array}

For N>M+1,

\begin{array}{rcl} B_N &=& \displaystyle \frac{2N-3}{6N}+\frac{N+1}{N}\left(\frac{N}{6}\left(2H_{N}-2H_{M+2}+1-\frac6{M+2}\right) + \frac12\right) \vphantom{\sum_{s=0}^{M}} \\ &=& \displaystyle \frac{2N-3}{6N}+\frac{N+1}{6}\left(2H_{N}-2H_{M+2}+1-\frac6{M+2}\right) + \frac{N+1}{2N} \vphantom{\sum_{s=0}^{M}} \\&=& \displaystyle \frac{2N-3+3N+3}{6N} + \frac{N+1}{6}\left(2H_{N}-2H_{M+2}+1-\frac6{M+2}\right) \vphantom{\sum_{s=0}^{M}} \\&=& \displaystyle \frac 56 + \frac{N+1}{6}2H_N + \frac{N+1}{6}\left(-2H_{M+2}+1-\frac6{M+2}\right) \vphantom{\sum_{s=0}^{M}} \\&=& \displaystyle \frac{N+1}{6}\frac{2}{N+1}+\frac{N+1}{6}2H_N+\frac{N+1}{6}\left(-2H_{M+2}+1-\frac6{M+2}\right) +\frac12 \vphantom{\sum_{s=0}^{M}} \\&=& \displaystyle \frac{N+1}{6}(2H_{N+1}-2H_{M+2}+1-\frac6{M+2}) +\frac12\end{array}.

q.e.d.

Next, C_N, the number of comparisons while partitioning. The theorem is

  • C_N=0 for N\leq M,
  • \displaystyle C_N = (N+1)(2H_{N+1}-2H_{M+2}+1)\approx 2(N+1)\log\left(\frac{N+1}{M+2}\right)+N+1 for any N> M.

The analysis is similar to what we did before, so we only note the main steps here. Let’s say, the pivot ends up in position s. Then, in a partitioning on N elements, you have to compare the pivot to the elements x_2,\ldots, x_{s+1}, because that’s where the i-pointer will stop (remember that the elements to the right of position s are larger than the pivot). The pivot will also be compared to the elements x_{N},\ldots, x_{s} because that’s where the j-pointer will stop. That makes (s+1)-1 + N-(s-1) = N+1 comparisons in total for this partitioning, and besides there will be partitions on s-1 and N-s elements after that. So,

\displaystyle C_N = \frac1N\sum_{s=1}^{N} (N+1)+C_{s-1}+C_{N-s} = N+1 + \frac2N\sum_{s=0}^{N-1}C_{s}.

Hence, for N>M+1, C_N = 2+\frac{N+1}{N}C_{N-1}, and induction tells you that the assertion follows. q.e.d.

This is the only of the numbers where I have understood how you could come up with the explicit formula – but only because Knuth gives a construction for it that didn’t properly translate to the other recursions (possibly because I didn’t spend too much time on this particular icing on the cake). I have heard in a fine talk a couple of years ago: “It doesn’t matter if it is remarkable: it is sharp, you can prove it”.

Alright, so the action appears to come from B_N and C_N. Both behave roughly behave like N \log N, up to some constant fuss and terms of lower order. This is the number that we have already found in the heuristics, and one can prove that in any sorting algorithm a bottleneck like this is hidden – you can’t sort faster than O(N\log N). But one of the merits of Quicksort is that this bottleneck is pretty wide: the loops that are responsible for the comparisons while partitioning are very fast an most machines (if you put some effort in proper coding) and they are designed to clean up many inversions of the unsorted set simultaneously (unlike plain Insertion Sort or Bubblesort that look at each inversion independently): the constants involved in the O-notation can thus be kept very low.

In retrospect, one might have guessed that C_N\approx 6B_N. Of course, all the numbers in the relevant data set have to be compared to the pivot – but only for 1 in 6 cases, a swap is necessary: if the left pointer is larger than the pivot and the right pointer is smaller. Since there are 3 real numbers involved, there are 6 permutations and only in 1 of those, we’ll have to increase B_N. Note as well, that up to some constant fuss, B_N \approx \frac16 C_N - A_N: the swap of the pivot to its proper place does not count for B_N. Every stage ends with a swap of the pivot, so those are accounted for in A_N already.

Finally, let’s go to the Insertion Sort part that cleans up the little mess that Quicksort has left us. The key idea is again that the pivot ends up at its proper position after partitioning. Even if Quicksort stops immediately after that, the remaining inversions in the set can only be in the at most M elements in between some pivots. Hence, we only have to focus on subsets like \{1,\ldots,M\} and have to build the numbers D_N, E_N out of building blocks from those subsets.

  • D_N = N-H_N for N\leq M,
  • \displaystyle D_N = (N+1)(1-\frac{2H_{M+1}}{M+2}) for any N> M.

Remember that D_N deals with the number of times that Insertion Sort is invoked at all. This can only happen if one element is smaller than its immediate predecessor: We work through the entire set of N numbers and every time we find someone out of order, we let it sink down to its proper place (and thanks to Quicksort, it’s not a far way to go). By construction, the subset of numbers that we have considered after this Insertion procedure is ordered – it may just not yet be complete. In particular, when we consider the next element and if it is out of oder compared to its immediate predecessor, we have to invoke Insertion Sort. Equivalently, we need Insertion Sort if this element is not larger than all of its predecessors! Hence, we have to look at how many running maxima there are in a given random permutation. Insertion Sort will start if and only if there is not a running maximum at hand. The expected value of running maxima in a permutation on N elements has been computed by Knuth in chapter 1.10 of his book, and its value is H_N.

Now, this solves the case N\leq M: D_N = N-H_N. For larger N, we have to remember that there are no inversions across pivot elements, so there is no additional Insertion Sort that is induced by a partitioning stage (only by the lack thereof). So, D_N = \frac 2N\sum_{s=0}^{N-1}D_s. In particular, for N = M+1,

\displaystyle D_N = \frac2{M+1}\sum_{s=0}^{N-1}(s-H_s) = M+2-2H_{M+1}.

By induction, the assertion follows. q.e.d.

The number E_n is the number of swaps that Insertion Sort has to take care of.

  • \displaystyle E_N = \frac{N^2-N}{4} for N\leq M,
  • \displaystyle E_N = \frac{(N+1)M(M-1)}{6(M+2)} for any N> M.

In a subset of length N\leq M, the average number of swaps is \frac{N^2-N}{4}, which you can prove easily by induction (the number N can be placed in position j, say, with (N-1)! possibilities for the other elements, and it will induce N-j swaps there; by induction, there are (N-1)!\frac{(N-1)^4-(N-1)}{4} inversions already; sum up and enjoy) or if you look it up in chapter 1.10 of Knuth’s book.

By the arguments from above, we end up with

\displaystyle E_N = \frac 2N\sum_{s=0}^{N-1}E_s = \frac{N+1}{N}E_{N-1}.

Induction wraps it all up. q.e.d.

Now, you can have a closer look at your own machine and programming language to find out the weight of each of these numbers for your source code. The total run-time is O(N \log N), but there will be an optimal choice of the constant M. This constant is hidden in the O-notation, but it may get noticeable in daily use. Knuth’s implementation ends up with M=9. I don’t have sufficient information to do it properly on the R code given above (what is the overhead for each of the commands used – which is costly and which is cheap?). Empirical tests tell me to use some M\approx 10 – but this is splitting hairs when you get down to it: the differences in run-time of my code are negligible for large data sets (and I would advise the use of the proper sort-command anyway).

Of course, there can still be some nitpicking about the performance of Quicksort as it stands here. The pivot can be randomized or at least chosen somewhat more intelligent (if you already have a sorted set, Quicksort will not split the set at all but only seperate the pivot – this is its worst-case scenario yielding a very poor O(N^2) only). As Knuth put it: Quicksort “likes” an unsorted file. The proper choice of the pivot is obviously crucial for the algorithm, and the impact may still hide some interesting mathematics. Empirically though, on many data sets (even some pathological ones) the standard choice of pivot elements was a little faster than randomizing or picking certain median values, using my own implementation in R. Many people have done research on this (and on other aspects of the algorithm) in order to fine-tune Quicksort on as many systems and for as many kinds of input data as possible (let me google some example for you).

Yet another issue is stability: Quicksort is not a stable algorithm, it will not preserve an order that already stands within the input. Since you will often sort partially sorted sets, this is a rather poor performance that leads the way to other sorting algorithms.

But this post has already been long enough as it stands. So long.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s