If “too large to fit in memory” means that there are too many items to fit into memory at one time, the data can be sorted in groups that will fit into memory, and then the resulting files can be merged. A sort such as a radix sort can also be used as an external sort, by making each bucket in the sort a file.
Even the quick sort can be an external sort. The data can be partitioned by writing it to two smaller files. When the partitions are small enough to fit, they are sorted in memory and concatenated to form the sorted file.
The example given below is an external sort. It sorts data in groups of 10,000 strings and writes them to files, which are then merged. If you compare this listing to the listing of the merge sort, you will notice many similarities.
Any of the four sort programs introduced so far in this chapter can be used as the in-memory sort algorithm. The functions myfgets() and myfputs() simply handle inserting and removing the newline (‘\n’) characters at the ends of lines. The openFile() function handles error conditions during the opening of files, and fileName() generates temporary filenames.
The function split() reads in up to 10,000 lines from the input file on lines 69–74, sorts them in memory on line 76, and writes them to a temporary file on lines 77–80. The function merge() takes two files that are already sorted and merges them into a third file in exactly the same way that the merge() routine in the last example merged two lists. The function mergePairs() goes through all the temporary files and calls merge() to combine pairs of files into single files, just as mergePairs() in the last example combines lists. Finally, main() invokes split() on the original file, then calls mergePairs() until all the files are combined into one big file. It then replaces the original unsorted file with the new, sorted file.
No comments:
Post a Comment