Concurrent Garbage Collection
25 Jun 2009I mentioned in passing a few posts ago that cotto sent me a link to a paper about a concurrent garbage collector VCGC. eternaleye sent me another paper that proposed improvements to the VCGC algorithm, while following the same general concept. The second one is called MCGC. I read over those two quite enthusiastically, and then moved on to read a paper about G1, the new Java VM garbage collector. There are some similarities between the two approaches, but plenty of differences too. In the end I think a collector somewhere in this family is going to be the GC of choice for Parrot (at least, for birds built on multithreaded systems). Of course, it may be just one of a possibly large stable of collectors, each suited for different needs.
The first paper introduces the "Very Concurrent GC" (VCGC) which is able to use a multithreaded approach without the need for fine-grained thread synchronization. "Balderdash!" I can hear you saying, but have faith: I've read the paper and the algorithm is so beautiful in it's simplicity that I can't believe I didn't think of it first. And it's very plausible. Here's the gist of it: Each memory allocation round has a color. We allocate memory with color x and operate like normal until we hit a synchronization point. At the synchronization point, we increase x++. All memory allocated before the synchronization point now has color x-1, and all memory allocated thereafter will have a color of x. We continue executing again until the next synchronization point, and then we bump x up again. Memory allocated in the first window now have color x-2, Memory allocated in the second window now have color x-1, etc. All memory chunks with color x-2 are swept and reclaimed to the system.
In this system, memory blocks implicitly change color because we change what the color numbers mean at each synchronization point. Without proactive marking, the blocks "fall off" the end of the window and get collected by a collection thread. The collection thread has no other job then to iterate over the allocation space and free all memory with color x-2. We call this thread the "Sweeper". Saving chunks from certain doom is the job of the "Marker", a thread that is the only thread in the system capable of changing the color of a block. The Marker runs a normal GC mark algorithm, starting at the root set and bumping all memory that's reachable to color x. The point when both the Marker thread and the Sweeper thread have completed their current run is called the synchronization point, which is when we increment x (called the "epoch") and restart both threads to run again. It's simple and low-overhead because it doesn't require any moving or compacting, and it only has to twiddle a handful of bits to make everything work. Also, it appears to scale well to multicore systems and multithreaded programs.
G1 is a very interesting collector which I find is conceptually not entirely dissimilar from VCGC, although I'm probably minimizing the differences in my own mind. It seems to be based on more of a lazy approach, picking low-hanging fruit to reduce the need for complete end-to-end GC runs and making allocation more efficient. G1 divides the heap into regions, and focuses on regions where there are the fewest active blocks. In the region, G1 frees any garbage it finds and copies any live items to a "dense prefix" somewhere else. This allows the entire region to be used by the allocator for easy linear allocations.
G1 appears to be very heterogenous in that memory of all sizes is allocated from a single pool. In that sense, a G1-like collector may be suitable for use with our STRING system, which is badly in need of performance tuning. Something like VCGC would probably be more useful for the common case of homogenous header pools, like our PMC pools and our sized pools, which facilitate very rapid array indexing through the pool.
VCGC and variants suffer from a few worst-case scenarios, such as situations where garbage is very long-lived (thus wasting time where the sweeper repeatedly checks things that are not garbage) and situations where garbage is very short-lived (where blocks become garbage quickly, and need to wait for two epochs to be swept, which increases memory consumption). The benefit of course is simplicity in the algorithm. Adding complexity, such as in MCGC, can reduce these problems.
My plan in the near future, probably after the AIO project, is to start prototyping a new concurrent collector core modeled on VCGC. I will use a simple and direct implementation of it initially, no bells or whistles or fancy-schmance optimizations. Once we have a basic concurrent core installed and working properly, it will be easier to add these kinds of optimizations in an incremental fashion.
And I have plenty of potential optimizations in mind, including some simple tweaks to the basic algorithm that I think will add some time-saving generational semantics. If you have some ideas too, I would love to hear them. I may create a planning page on the wiki soon to start putting ideas together for this.
What's really most important at this point, and pmichaud mentioned this several times at YAPC::NA, was just getting a second core working. We don't even care what core it is, we just need a second one to prove that our architecture is pluggable and work out any kinks in the API. Once we have a second core in place, it will be that much easier to add a third and a fourth, etc. With all this in mind, we really don't need to be swinging for the fences right now, just looking to add something quick that maybe offers some small performance benefit over the current system.
This entry was originally posted on Blogger and was automatically converted. There may be some broken links and other errors due to the conversion. Please let me know about any serious problems.
Comments
skids
6/26/2009 2:50:29 PM
Concurrent sounds great, and if I hadn't gotten totally distracted last night I would have read more of this material already. I'll have to catch up on it soon.
In the meantime my general comments: cache-friendliness of GCs should be a high priority. Much of the allocations are aligned (I think? Do we have policy about what has to be aligned for any data structures? where?) so that opens the door to out-of-band GC robbing or the low pointer bits -- in addition to the memory arenas which already provide some neat advantages.
One things I don't see much in GC literature is the idea of allowing objects to shrink. They can destroy, but some objects like Hash could afford to allocate more agressively if they knew they would get a callback later that would give them some time to trim their allocations if they are no longer growing, or compress to a read-optimize form if they find themselves stagnant.
I'm not so concerned about "unnecessary traversals" if the GC is throttled by a performance measure. GCs do not have to run full bore when there's not a pressing need to free memory. During those times, if we could cap them at a certain percentage of the runtime of the rest of the application, then their runtime would only represent a defined -- perhaps even tunable -- level of overhead that scales with the resource use of the application itself. So you wouldn't have any of this sit-and-spin behavior we see on some poorly coded Java applications.
Of course doing so involves having either a portable performance counter (most platforms have one right at hand, it's just a matter of abstracting them,) or a rough idea about the resource usage of each function and a way to keep a tally...
Finally, it is of course silly to actually visit a leaf node looking for children. Moreover, visits should be short-circuited without pulling leaf nodes into cache to check a flag. At the very least, if you have objects you know cannot ever reference, that part should be out of band -- perhaps another good use of pointer bit-robbing or arenas even in GC's that are not strictly out-of-band.