Notes #14 --- Sorting

Problems of searching, sorting, merging, and finding order statistics (min, max, median, etc.) are of great importance in computer science.

We have discussed at length how searching is implemented, and (particularly in the contexts of sequential and multikey files) the significance of merging for secondary file manipulation, and implementations for two-way merge:

Sorting is more difficult, because things have to be moved within a file, and files are typically too big to fit in internal memory.

We first review internal sorting algorithms, then look at their applicability or inapplicability for external sorting, and then discuss external sorting techniques.

Internal Sorting Techniques

We can classify sorting techniques according to their underlying structure.

  1. Swap-based sorts begin conceptually with the entire list, and exchange particular pairs of elements (adjacent elements except for Shell sorts) moving toward a more sorted list.
  2. Merge-based sorts develop sorted lists, and then add either another element (insertion sort) or list (merge sorts).
  3. Tree-based sorts store the data, at least conceptually, in a binary tree; there are two different approaches, one based on heaps, and the other based on search trees.
  4. Finally, the other category catches sorts which use additional key-value information, such as radix or bucket sort.

Table #1:  Summary of Sorts by Family, Complexity, and Space Usage

General comments on sorting:

  1. Quicksort tends to be the most efficient technique if the data set is large (and the data is not nearly sorted or reverse-sorted).
  2. Insertion sort tends to be the most efficient when the data set is small, and for lots of other special cases.

 Extensions, general issues, and questions

Requirements for external sorts

An external sort cannot see records not in the current buffer, and cannot swap with records far away without using undesirable amounts of space and horribly complicating the algorithm. We would also prefer an algorithm which would allow us to use buffers constructively. We already know how to merge using buffers, so some sort of merge sort appears reasonable.

On the other hand, each pass through a file will require large time overhead, and we would like to cut down on the number of passes. One way to do this will be to merge more than two files at a time.

To do this efficiently, however, we will have to be able

  1. To have a procedure for sorting the set of current nodes from each buffer.
  2. To take in another node from the buffer when the current node is put into output.
  3. To know when we have reached into a new "run" on a file, since its keys do not necessarily compare in any useful way to the keys we have already seen.

Both (min) heap sort and tournament sort can easily be modified to do this; we will illustrate the concepts with heap sort because it's easier to follow, but tournament sort is probably easier to use in practice.

We also need a means for setting up the initial runs of a file, and distributing them to multiple tapes. There are two standard alternatives:

  1. Identify the runs as for natural merge sort, or
  2. In a first pass, use insertion sort (or some other internal sort) to sort one buffer at a time. As each run is identified or constructed, it is added to one of the tapes.

Summary

  1. Identify or construct initial runs, and "deal them out" to several tapes.
  2. Identify input and output tapes for this run.
  3. Using one buffer for each tape, merge the first runs from each tape, using a modified min heap or tournament sort, and put the new, merged run on an output tape.
  4. Identify the input and output tapes for the next run, and if necessary deal out the runs to the input tapes.

Minor differences in external sorting algorithms come from the number of tapes they merge, and whether they use heap, tournament, or some related sort algorithm. The major difference, however, is the rule for identifying input and output tapes, and the process for identifying the new runs.

Three Algorithms

  1. Merge-and-deal:
  1. Two-phase merge:
  1. Fibonacci merge:

Internal merge

Binary merge is straightforward, and won't be discussed further. The only tricky point is recognizing when a new run begins: Precisely when the next key on the input tape is larger than the key just added to output from that tape.

The internal merge step uses, as mentioned above, a variant of either (min) heap sort or tournament sort. Each record in the heap/tournament tree is tagged with a triple

(key value, tape number, run number)

with the ordering: compare run numbers; if run numbers are equal, compare key numbers. The run number increases as above.

 Min-heap-merge algorithm.

External sorts can now be seen as comprising the following essentially independent steps:

                   identify initial input and output tapes
                   set up initial runs 
                          /* identify natural runs or sort buffers-full */
                   deal out initial runs to input tapes
                   repeat 
                       merge runs to output tape(s)          /* heap or tournament sort */
	               either
	                   deal output runs to input tapes   /* for two-phase merge */
                       or
                           change input and output tapes     
                          /* two-phase or Fibonacci merge */
                   until there is exactly one run

Deal-and-Merge

The key idea in deal-and-merge is that there are a number of input tapes and one output tape.

Two-Phase Merge

The key idea is to use 2n tapes, with n input and n output tapes. Runs are merged from input tapes onto the output tapes cyclically --- the first run of each of the input tapes is merged to the first output tape, and so on. After this pass, the input tapes become the output tapes and vice-versa.

 Fibonacci merge

Complexity of external merge sort:

The space cost of any of these methods is the sum of

  1. The space for input.
  2. The space for output.
  3. The internal space.

In computing time cost, we will ignore overhead, and also ignore the need for floor and ceiling operators. We need to consider the following factors in determining time cost:

  1. The cost of setting up the initial runs.
  2. The cost of a single merge pass.
  3. The number of passes needed.

Cost of two-phase merge

Assume there are N records and 2r tapes, and that the initial step uses an O(n lg n) sort to sort buffers of size k.

  1. The cost of setting up initial runs of uniform length is the cost of sorting each buffer, times the number of buffers:
            N/k * O (k lg k) = O (N lg k)
  1. A single merge pass puts every key through a min heap of r elements. The cost to reheap is O(lg r) per element, or O(N lg r) in total (ignoring the (minimal) O(r) cost to set up each initial heap).
  2. Each pass merges groups of r runs into single runs, so makes R runs into R/r runs. The effect of t passes is to reduce the initial N/k runs to N/(k r^t), and so log_r (N/k) passes are needed to reduce to a single run.

Thus the total cost is

             O (N lg k) + log_r (N/k) * O (N lg r)
                    = O (N lg k + N (log_r (N/k) * lg r))

which by change-of-base formula for logs is:

             O (N lg k + N lg (N/K)) = O (N (lg k + lg (N/k))) = O (N lg N)

by properties of logs. Thus two-phase merge is an optimal sort algorithm.

Cost of Deal-and-Merge and Fibonacci Merge

Table 3:  Comparison of External Merge/Sort Algorithms