# How do you quickly sort 84 million unique Integers?

For the last few weeks, I’ve been working on increasing the performance of various areas at work and one of the areas that kept on hitting my profiles has been sorting integers. All of us with a Computer Science degree have sorted integers before and we know how to cost algorithms with Big-O notation, but most of the time, the default sort in Java, C#, and other languages is “good enough” that we don’t really think of it.

When it starts to show up on a profile though, sometimes that is time to start thinking about it. As with many things, algorithms based on N usually perform well when N is reasonably sized, so you do not always notice the issue until N becomes interesting or the number of times you call the algorithm increases.

Whenever diagnosing performance problems, it is important to characterize whether it is one or more big calls to the method that causes the entire time sink or whether it is the sheer number of calls to that method. In either of these problems, I think a reasonable first thing to ask is whether that call is really necessary or is it at the right location? More specifically, if you are not running binary searches or other algorithms or use-cases which require the values to be sorted, why sort them at all? If for no other reason, this causes you to think through how to code is or could be used and sometimes leads to a different solution.

In my particular case, there were a few large sorts that were taking all the time. Each of these sorts were sorting approximately 84 million integers via the JDK standard Arrays.sort() or the Colt IntArrayList.sort() method and taking approximately 10 seconds per sort. Depending on your background, 10 seconds may be an eternity or a dream, so suffice it to say that the entire processing required for this component had to be less than 3 seconds, approximately half of that had already been accounted for.

So assuming it is required, how do you sort faster? An obvious approach to server engineers is to multithread this operation, such as sorting chunks of that memory in parallel and then merging them together, or perhaps a multi-threaded version of quick-sort, and I am sure that my team expected me to go hide in a corner for a while and code up one of these approaches...

But I didn’t. You see, there’s something particularly interesting about the integers we were sorting. First, they are all positive, and second, there aren’t any duplicate integers. So instead of sorting them, can we do anything else with them that we can infer the sort and have more memory locality?

Well, one way to do it would be to use a set of bits. Say that the largest number in the array was 200 million, you can allocate 200 million bits and then sort the numbers using an algorithm like this:

- for each N in array
- set bit N
- while more bits
- find next set bit
- set array[i] to that bit
- increment i

Before we go further, you will note that the above algorithm goes through the array twice, and that may seem bad to you, however, there is another thing to keep in mind when assessing this algorithm, which is memory locality, and the fact that especially in the second loop, you are effectively iterating sequentially through two arrays which are probably in the L1 or L2 cache already, potentially making a big performance improvement if you have performance-grade processors.

The following sample code implements this in Java using the BitSet class. I’ve personally not gone through the code to see if it is the most optimal implementation for what I am doing with it, but after this change, the performance was in the acceptable range.

- // Code available here
- BitSet bits = new BitSet();
- for ( int i = 0; i < array.length; ++ i ) {
- bits.set( array[i] );
- }
- int lastBit = 0;
- int i = 0;
- while ( true ) {
- lastBit = bits.nextSetBit( lastBit );
- if ( lastBit == -1 ) {
- break;
- }
- array[i] = lastBit;
- ++ lastBit;
- ++ i;
- }
- }

Performance across machines will vary. On my work servers with the latest Xeons, it averages around 800ms per sort, compared to an average of 10 seconds. On my home server, it is a little slower but averages around 9 seconds, verses 23 seconds for the QuickSort algorithm.

Now the above algorithm is not optimal for all sizes. In my tests, the magical point was somewhere around one million rows where the performance of the default Arrays.sort() and the above crossed over, so we do a quick size check and either go to the standard QuickSort or to the BitSet-based sorting.

Those familiar with the BitSet API may mention that I could gain better performance by telling the BitSet algorithm the number of bits I am interested in using and allowing it to pre-allocate the memory for that. Unfortunately though, I did not have that information available for most areas in the code and it was not feasible to guess in this area of the code. What was super interesting though is that doing a pre-scan of the entire array to find the maximum number did not change the runtime of the algorithm significantly. In fact, the TP99 of the two were off by around 100 milliseconds in my test. This could definitely be an area to improve the above algorithm going forward.

If the above algorithm was not fast enough for you still, there is still room for improvement. First, you can look at pre-allocating the array. Second, the second loop of the algorithm is easily parallelizable and perhaps the BitSet implementation for this use-case could be simplified.

As I mentioned above, the algorithm only works for unique, positive integers. If you have negative numbers, you could possibly use a second BitSet object for the absolute value of the negative numbers and iterate it in reverse (applying the sign obviously) which would also work. Duplicates obviously don’t work though, and if you have that problem, ask yourself if you really need the sort at all or whether you need the duplicates, and if you do, consider a parallel sorting algorithm.

Now off to another interesting problem. Wanna help?