Saturday 28 January 2012

How can I sort things that are too large to bring into memory?

A sorting program that sorts items that are on secondary storage (disk or tape) rather than primary storage (memory) is called an external sort. Exactly how to sort large data depends on what is meant by “too large to fit in memory.” If the items to be sorted are themselves too large to fit in memory (such as images), but there aren’t many items, you can keep in memory only the sort key and a value indicating the data’s location on disk. After the key/value pairs are sorted, the data is rearranged on disk into the correct order.

If “too large to fit in memory” means that there are too many items to fit into memory at one time, the data can be sorted in groups that will fit into memory, and then the resulting files can be merged. A sort such as a radix sort can also be used as an external sort, by making each bucket in the sort a file.

Even the quick sort can be an external sort. The data can be partitioned by writing it to two smaller files. When the partitions are small enough to fit, they are sorted in memory and concatenated to form the sorted file.

The example given below is an external sort. It sorts data in groups of 10,000 strings and writes them to files, which are then merged. If you compare this listing to the listing of the merge sort, you will notice many similarities.

Any of the four sort programs introduced so far in this chapter can be used as the in-memory sort algorithm. The functions myfgets() and myfputs() simply handle inserting and removing the newline (‘\n’) characters at the ends of lines. The openFile() function handles error conditions during the opening of files, and fileName() generates temporary filenames.

The function split() reads in up to 10,000 lines from the input file on lines 69–74, sorts them in memory on line 76, and writes them to a temporary file on lines 77–80. The function merge() takes two files that are already sorted and merges them into a third file in exactly the same way that the merge() routine in the last example merged two lists. The function mergePairs() goes through all the temporary files and calls merge() to combine pairs of files into single files, just as mergePairs() in the last example combines lists. Finally, main() invokes split() on the original file, then calls mergePairs() until all the files are combined into one big file. It then replaces the original unsorted file with the new, sorted file.

An example of an external sorting algorithm.


No comments:

Post a Comment