Thursday, April 02, 2009

Sorting A Million Integers

Have you ever heard the popular programming interview question, "Show me how to sort one million 32 bit integers in 2MB of RAM?"

I've read about it, but have never actually been asked it in an interview. I think the correct response is that one million 32 bit integers would not fit in 2MB of RAM. But if you assume the problem implies you have an unlimited amount disk space, then the problem is actually solvable. I recently read a good article going over the solution to this problem, so I thought I'd share the links.

The article (and its follow up) assume you know about heaps and heapsort. In case you don't, a heap is a tree-like data structure and heapsort is a very fast sorting algorithm. You can read about them more on wikipedia, Heap here and Heapsort here. If you're unfamiliar with heapsort, I'd strongly suggest you read about Sorting Algorithms in general. Although you'll probably never need to program one for work, it's really nice topic to know about and will probably help you learn to judge your own algorithms better. By the way, here is a java implementation of Heapsort based on the pseudocode from its wikipedia article.
  1. public class Heap {  
  2.   
  3.   public static int[] swap( int[] a, int first, int last ) {  
  4.     int tmp = a[first];  
  5.     a[first] = a[last];  
  6.     a[last] = tmp;  
  7.     return a;  
  8.   }  
  9.   
  10.   public static int[] heapSort( int[] a, int count ) {  
  11.     a = heapify(a, count);  
  12.   
  13.     int end = count - 1;  
  14.     while( end > 0 ) {  
  15.       //swap the root (maximum value) of the heap with the last element of the heap  
  16.       a = swap( a, 0, end );  
  17.   
  18.       //decrease the size of the heap by one so that the previous  
  19.       //max value will stay in its proper placement  
  20.       end--;  
  21.   
  22.       //put the heap back in max-heap order  
  23.       a = siftDown(a, 0, end);  
  24.     }  
  25.   
  26.     return a;  
  27.   }  
  28.   
  29.   public static int[] heapify(int[]a, int count ) {  
  30.     //Start is a's index of the last parent node  
  31.     int start = (count-2)/2;  
  32.   
  33.     while( start >= 0 ) {  
  34.       //Sift down the node at index start to the proper place such that  
  35.       //all nodes below the start index are in heap order  
  36.       a = siftDown(a, start, count-1);  
  37.       start--;  
  38.     }  
  39.   
  40.     return a;  
  41.   }  
  42.   
  43.   public static int[] siftDown(int[] a, int start, int end) {  
  44.     int root = start;  
  45.   
  46.     //While the root has at least one child...  
  47.     while( root *2 + 1 <= end ) {  
  48.       //root*2 points to the left child  
  49.       int child = root * 2 + 1;  
  50.   
  51.       //If the child has a sibling and the child's value is less than its sibling's...  
  52.       if( child + 1 <= end && a[child] < a[child+1] ) {  
  53.         //Then point to the right child instead  
  54.         child = child+1;  
  55.       }  
  56.   
  57.       //If out of max heap order...  
  58.       if( a[root] < a[child] ) {  
  59.         //Then repeat to continue sifting down the child now  
  60.         a = swap(a, root, child);  
  61.         root = child;  
  62.       } else {  
  63.         return a;  
  64.       }  
  65.     }  
  66.   
  67.     return a;  
  68.   }  
  69. }  
With the Heapsort explanation out of the way, here is the article. It's very clear and well written by Guido van Rossum, the creator of Python. Here is the link.

In the comments, someone pointed out the link to a faster algorithm, which you can read about here.

Guido van Rossum replied with:

"That's a cool way of doing it, but not for Python, where an int takes 12 bytes and the list takes another 4 bytes per value. Given some other overhead you'd end up having to make a lot of passes over the full input (the same number as would be feasible with my version -- I used 100 but it could likely be less), which would probably be slower. It's interesting that a heap features in both algorithms. I'm not convinced that it's the fastest way to sort though, despite the O(N log N) complexity. Python's built-in sort is wicked."

By the way, if you like reading articles about problem solving using programming, I suggest you also read Solving Every Sudoku Puzzle and How to Write a Spelling Checker, both by Peter Norvig. They're very good articles. Does anyone know about any other very well written problem solving articles?