Programming, Software and Code

Parrot-Data-Structures Benchmarks

When I first conceived of the Parrot-Data-Structures project, I envisioned a place where we could develop performance-optimized PMC types. A part of proving that we've improved performance is to provide benchmarks. So, this morning I went through and wrote up some naive benchmarks to compare several of my new PMC types against the venerable ResizablePMCArray. I didn't compare against the FixedPMCArray because the latter doesn't support push/pop/shift operations and I wouldn't be able to make direct algorithmic comparisons.

I've only put together one benchmark so far for stacks: We push 1,000,000 items onto the stack and them pop them all back off again in tight loops. This forces the stack to resize up to 1,000,000 items worth of storage. The times were small and I could have upped it to ten million items or more, but then we start to see more effects from caching pages to disk and lose insight into the application we are testing.

(FixedPMCStack) Time to push 1000000: 0.0456011295318604
(FixedPMCStack) Time to pop 1000000: 0.0357608795166016
(FixedPMCStack) Time to total 1000000: 0.0813620090484619

(ResizablePMCStack) Time to push 1000000: 0.0498058795928955
(ResizablePMCStack) Time to pop 1000000: 0.0467569828033447
(ResizablePMCStack) Time to total 1000000: 0.0965628623962402

(ResizablePMCStack2) Time to push 1000000: 0.0470800399780273
(ResizablePMCStack2) Time to pop 1000000: 0.0364069938659668
(ResizablePMCStack2) Time to total 1000000: 0.0834870338439941

(ResizablePMCArray) Time to push 1000000: 0.0531971454620361
(ResizablePMCArray) Time to pop 1000000: 0.0347628593444824
(ResizablePMCArray) Time to total 1000000: 0.0879600048065186

I've shown three of my types as they compare to Parrot's ResizablePMCArray type. FixedPMCStack is fixed-size, so it's allocated up-front and does not need to be resized on each push. ResizablePMCStack is a linked-list of mini-array chunks. Each chunk holds 16 pointers, so we can push 16 objects before a new allocation. ResizablePMCStack2 is an alternate stack implementation that I put together today. It uses a flat memory buffer but does geometric reallocations. Starting at 16 objects, every time we run out of space we allocate twice as much (16, 32, 64, etc). Finally is the ResizablePMCArray, which resizes the buffer on each push. This start size and the growth factor can be tuned, though I haven't made any efforts to do that.

FixedPMCStack obviously wins the competion since it only needs to malloc once and never needs to reallocate or free the memory. At this sample size the benefits are not huge over the other types. ResizablePMCArray2 eeks out a win over ResizablePMCArray for this sample size. However, at smaller samples such as 10,000 objects to push, ResizablePMCArray wins instead. I suspect we could tune the size of allocated chunks in ResizablePMCStack or the starting allocations and grow factors of ResizablePMCStack2 to alter these results and the threshold where the first type becomes less efficient than the latter type. As the data set gets larger, the geometric growth of RPS2 takes over and severely decreases the number of allocations that need to be made, while for the basic RPS type, the size and frequency of allocations is constant.

ResizablePMCArray performs well enough in these benchmarks. It does more allocations than any of my types, but uses a relatively efficient flat memory buffer to hold data, so it's not blown out of the runnings entirely.

Now for the FIFO Queue types. For these types, I used 100,000 items in a similar loop (push 100,000 items, pop them all). I didn't use 1,000,000 like I did in the stack tests above for a reason I will discuss below.

(FixedPMCQueue) Time to push 100000: 0.00739598274230957
(FixedPMCQueue) Time to shift 100000: 0.00774002075195312
(FixedPMCQueue) Time to total 100000: 0.0151360034942627

(ResizablePMCQueue) Time to push 100000: 0.0121519565582275
(ResizablePMCQueue) Time to shift 100000: 0.00611591339111328
(ResizablePMCQueue) Time to total 100000: 0.0182678699493408

(ResizablePMCArray) Time to push 100000: 0.00558805465698242
(ResizablePMCArray) Time to shift 100000: 5.47745704650879
(ResizablePMCArray) Time to total 100000: 5.48304510116577

FixedPMCQueue uses a pre-allocated memory buffer, which is a big saver and makes it the obvious winner in the category. However, since it's setup as a ring buffer each push/pop operation requires a few extra pointer checks that the other types don't need to go through. This is why the ResizablePMCArray wins among all types in push performance.

Shift performance is a different issue entirely. FixedPMCQueue again isn't the winner here but it is close. ResizablePMCQueue does just a little bit better here using it's efficient linked-list implementation. Even though each linked-list node needs to be free()'d, the implementation of free() in libc is pretty lightweight compared to the cost for malloc(). In fact, all things considered, it looks from the numbers above that free() is about twice as fast as malloc(), all other things being equal.

Shift is where the ResizablePMCArray type loses it's composure completely. FixedPMCQueue and ResizablePMCQueue both took about 1.5 one-hundredths of a second to complete the benchmark. ResizablePMCArray took about 360 times as much to perform the same operation. And the reason is that it took almost 5.5 seconds just to do 100,000 shift operations. And that's not even the worst of it. RPA.shift() is O(n2) complexity. When I tried to bump this benchmark up to one million items, the benchmark ran for over 10 minutes before I had to kill it with no end in sight. Both my queue types are time-linear, and bumping up to one million items took only 10 times longer for them. Why is ResizablePMCArray O(n2)? Because each shift operation requires a memmove, which loops over each item in the array and moves it to a new buffer. This is terrible and one of the reasons why I started the parrot-data-structures project in the first place.

I plan on adding a few more benchmarks. For instance, rapid-fire push/pop benchmarks that don't cause resizes might be interesting to isolate the per-primative operation cost without considering memory allocation costs. Your average program doesn't need to push one million items onto a stack or queue, of course. And with these benchmarks I'll be able to focus some optimization effort on these types to make them better.

The overall lesson to be learned from this post is this: For stack-like operations the ResizablePMCArray is a decent—though not perfect—tool. For queue-like operations on the other hand, ResizablePMCArray is lousy and should be avoided. At least, it should be avoided until we do some optimization effort to make it better.

This entry was originally posted on Blogger and was automatically converted. There may be some broken links and other errors due to the conversion. Please let me know about any serious problems.