I've read about it, but have never actually been asked it in an interview. I think the correct response is that one million 32 bit integers would not fit in 2MB of RAM. But if you assume the problem implies you have an unlimited amount disk space, then the problem is actually solvable. I recently read a good article going over the solution to this problem, so I thought I'd share the links.
The article (and its follow up) assume you know about heaps and heapsort. In case you don't, a heap is a tree-like data structure and heapsort is a very fast sorting algorithm. You can read about them more on wikipedia, Heap here and Heapsort here. If you're unfamiliar with heapsort, I'd strongly suggest you read about Sorting Algorithms in general. Although you'll probably never need to program one for work, it's really nice topic to know about and will probably help you learn to judge your own algorithms better. By the way, here is a java implementation of Heapsort based on the pseudocode from its wikipedia article.
- public class Heap {
- public static int[] swap( int[] a, int first, int last ) {
- int tmp = a[first];
- a[first] = a[last];
- a[last] = tmp;
- return a;
- }
- public static int[] heapSort( int[] a, int count ) {
- a = heapify(a, count);
- int end = count - 1;
- while( end > 0 ) {
- //swap the root (maximum value) of the heap with the last element of the heap
- a = swap( a, 0, end );
- //decrease the size of the heap by one so that the previous
- //max value will stay in its proper placement
- end--;
- //put the heap back in max-heap order
- a = siftDown(a, 0, end);
- }
- return a;
- }
- public static int[] heapify(int[]a, int count ) {
- //Start is a's index of the last parent node
- int start = (count-2)/2;
- while( start >= 0 ) {
- //Sift down the node at index start to the proper place such that
- //all nodes below the start index are in heap order
- a = siftDown(a, start, count-1);
- start--;
- }
- return a;
- }
- public static int[] siftDown(int[] a, int start, int end) {
- int root = start;
- //While the root has at least one child...
- while( root *2 + 1 <= end ) {
- //root*2 points to the left child
- int child = root * 2 + 1;
- //If the child has a sibling and the child's value is less than its sibling's...
- if( child + 1 <= end && a[child] < a[child+1] ) {
- //Then point to the right child instead
- child = child+1;
- }
- //If out of max heap order...
- if( a[root] < a[child] ) {
- //Then repeat to continue sifting down the child now
- a = swap(a, root, child);
- root = child;
- } else {
- return a;
- }
- }
- return a;
- }
- }
In the comments, someone pointed out the link to a faster algorithm, which you can read about here.
Guido van Rossum replied with:
"That's a cool way of doing it, but not for Python, where an int takes 12 bytes and the list takes another 4 bytes per value. Given some other overhead you'd end up having to make a lot of passes over the full input (the same number as would be feasible with my version -- I used 100 but it could likely be less), which would probably be slower. It's interesting that a heap features in both algorithms. I'm not convinced that it's the fastest way to sort though, despite the O(N log N) complexity. Python's built-in sort is wicked."
By the way, if you like reading articles about problem solving using programming, I suggest you also read Solving Every Sudoku Puzzle and How to Write a Spelling Checker, both by Peter Norvig. They're very good articles. Does anyone know about any other very well written problem solving articles?
1 comment:
Here is one more approach of sort 4GB files with limited RAM using merge sort
http://anandtechblog.blogspot.com/2010/12/how-to-sort-4gb-files-google-question.html
Post a Comment