Jan 18, 2006

java.util.concurrent Memory Leaks...

About two months ago, one of the Java servers my group maintains ran out of memory, and since then, if this service was up for more than a week, again, we get the dreaded java.lang.OutOfMemoryError. Each time that this would happen, we would take a histogram of the memory usage (jmap -histo pid). After gathering a few of those and comparing those results to what we expected, there are two classes that seemed suspiciously using a lot of memory, java.util.concurrent.LinkedBlockingQueue$Node and java.util.concurrent.locks.AbstractQueuedSynchronizer$Node, so I started looking at how these classes were used.

The java.util.concurrent.LinkedBlockingQueue was used in several places in the program, however, its usage was fairly straight forward. In addition to usually being the one that used the most memory, the other thing that made this class suspect was that its usage was just added a week before the problem first occurred. Furthermore, in all of the various tests that I ran, the number of Nodes would increase significantly, but when the full garbage collector ran (either automatically or when forced), the number of nodes would drop to the number of entries that I expected there to be. This made me suspect that perhaps the garbage collector was not running when the OutOfMemoryError was thrown or that perhaps there was not enough memory at that point to even run the garbage collector, but alas, I was quickly brought back to my senses.

During the investigation of the above, I googled on the class and it brought some interesting results. On the Concurrency Interest mailing list, an entry from last year suggests a memory leak in LinkedBlockingQueue and it was confirmed by Doug Lea. I re-wrote the program (in the method: originallyReportedLeak) suggested in the message to confirm that it was an actual issue, and surely, it is a current problem. Running the program with the JMX arguments, you can attach jconsole to see the memory grow, and no amount of forcing the garbage collector to run will bring it down.

But of course, that was not my problem. The issue with the leak in the above program has to do with a timeout occurring. In all but one of the cases was not using the methods with a timeout, and in that case, however, the timeout value was MAX_INT, and because of the way we use this particular service, there is no way that it would have been able to timeout and leak. And as I mentioned above, forcing a garbage collection always brought it back to normal, so alas, back to the drawing board.

The other class that used outrageous amounts of memory, AbstractQueuedSynchronizer’s Node inner class, however, was not as easy to find where it was used. By using Borland OptimizeIt, however, I found out that this class was used by java.util.concurrent.CountDownLatch, where the code was calling await() with a one second timeout, and in this case, the timeout was always triggered. Sound familiar? I wrote a program (in the method: anotherSimilarLeak) that demonstrates this precise memory leak, and surely, it is the same underlying problem as in the LinkedBlockingQueue above.

In this particular program, we decided that the latch was not necessary for this particular method, and that replacing it with a sleep was all that was necessary. This Bug is fixed in Mustang, the next version of Java, so hopefully your project can wait until it fixed. Otherwise, you may unfortunately need to roll-your-own timeout functionality... just make sure to properly document such hacks so that you can remove them in the near future!

Filed In