Notes #14 --- Sorting
Problems of searching, sorting, merging, and finding order statistics (min, max, median, etc.) are of great importance in computer science.
We have discussed at length how searching is implemented, and (particularly in the contexts of sequential and multikey files) the significance of merging for secondary file manipulation, and implementations for two-way merge:
Sorting is more difficult, because things have to be moved within a file, and files are typically too big to fit in internal memory.
We first review internal sorting algorithms, then look at their applicability or inapplicability for external sorting, and then discuss external sorting techniques.
Internal Sorting Techniques
We can classify sorting techniques according to their underlying structure.
Table #1:
Summary of Sorts by Family, Complexity, and Space UsageGeneral comments on sorting:
Extensions, general issues, and questions
Requirements for external sorts
An external sort cannot see records not in the current buffer, and cannot swap with records far away without using undesirable amounts of space and horribly complicating the algorithm. We would also prefer an algorithm which would allow us to use buffers constructively. We already know how to merge using buffers, so some sort of merge sort appears reasonable.
On the other hand, each pass through a file will require large time overhead, and we would like to cut down on the number of passes. One way to do this will be to merge more than two files at a time.
To do this efficiently, however, we will have to be able
Both (min) heap sort and tournament sort can easily be modified to do this; we will illustrate the concepts with heap sort because it's easier to follow, but tournament sort is probably easier to use in practice.
We also need a means for setting up the initial runs of a file, and distributing them to multiple tapes. There are two standard alternatives:
Summary
Minor differences in external sorting algorithms come from the number of tapes they merge, and whether they use heap, tournament, or some related sort algorithm. The major difference, however, is the rule for identifying input and output tapes, and the process for identifying the new runs.
Three Algorithms
Internal merge
Binary merge is straightforward, and won't be discussed further. The only tricky point is recognizing when a new run begins: Precisely when the next key on the input tape is larger than the key just added to output from that tape.
The internal merge step uses, as mentioned above, a variant of either (min) heap sort or tournament sort. Each record in the heap/tournament tree is tagged with a triple
(key value, tape number, run number)
with the ordering: compare run numbers; if run numbers are equal, compare key numbers. The run number increases as above.
Min-heap-merge algorithm.
External sorts can now be seen as comprising the following essentially independent steps:
identify initial input and output tapes
set up initial runs
/* identify natural runs or sort buffers-full */
deal out initial runs to input tapes
repeat
merge runs to output tape(s) /* heap or tournament sort */
either
deal output runs to input tapes /* for two-phase merge */
or
change input and output tapes
/* two-phase or Fibonacci merge */
until there is exactly one run
Deal-and-Merge
The key idea in deal-and-merge is that there are a number of input tapes and one output tape.
Two-Phase Merge
The key idea is to use 2n tapes, with n input and n output tapes. Runs are merged from input tapes onto the output tapes cyclically --- the first run of each of the input tapes is merged to the first output tape, and so on. After this pass, the input tapes become the output tapes and vice-versa.
Fibonacci merge
Complexity of external merge sort:
The space cost of any of these methods is the sum of
In computing time cost, we will ignore overhead, and also ignore the need for floor and ceiling operators. We need to consider the following factors in determining time cost:
Cost of two-phase merge
Assume there are N records and 2r tapes, and that the initial step uses an O(n lg n) sort to sort buffers of size k.
N/k * O (k lg k) = O (N lg k)
Thus the total cost is
O (N lg k) + log_r (N/k) * O (N lg r)
= O (N lg k + N (log_r (N/k) * lg r))
which by change-of-base formula for logs is:
O (N lg k + N lg (N/K)) = O (N (lg k + lg (N/k))) = O (N lg N)
by properties of logs. Thus two-phase merge is an optimal sort algorithm.
Cost of Deal-and-Merge and Fibonacci Merge
Table 3:
Comparison of External Merge/Sort Algorithms