Recent adventures in performance testing
I recently gained access to Intel’s Academic Manycore Testing Lab (MTL) to do some performance testing work on our compiler and runtime, Manticore. This environment contains 3 quad 8-core machines, one accessible via SSH and the others with an easy job queueing system. Over the two-week period I had access, I didn’t get as much done as I’d like, but I did find out quite a bit.
We continue to scale well past 16 cores
We have shown in the past that Manticore scales very well on a variety of benchmarks up to 16 cores. But, this time we’ve been able to show that we continue to see speedups all the way to 32 cores! Of course, there’s also bad news. While we have decent speedups at 16 cores and admirable speedups (relative to competitive implementations of the chosen benchmarks) at 24 cores, our 32 core speedup is only slightly better than that of 24 cores on some benchmarks. We need to do better. And more, we need to figure out where the bottleneck is!
We had a huge GC bug lurking, more than a year after our last one
Our benchmarks and output in general have gone through some very heavy stress testing, on decently sized manycore machines. However, whenever you double the number of cores, you can expect to find something new. And in this case, on one of the 3 machines, SMT is enabled, allowing 64 cores of testing. And at load… there’s a garbage collection bug. Three days of groveling over the heap, generated assembly, and instrumented binaries enabled me to work around it, but one of the downsides to a highly parallel garbage collector and pretty complicated scheduler is that when there’s a bug, it can take a while to track down the real root cause.
SMT does not appear to be useful for high-computation loads
At least for the sorts of work that most of our benchmarks have, SMT (which doubles the number of available cores by allowing two threads of work per physical core) does not offer significant advantages. Of course, this will need more investigation. Right now, when we spawn up additional threads, they each need their own nursery for garbage collection. But, we don’t resize those nurseries so they will all fit within the L3 cache size, potentially causing some memory thrashing that isn’t evident without the additional SMT threads. I also suspect the poorer speedups at lower numbers of processors is due to densely packing processes rather than packing densely at the package and core level but using extra SMT cores last.
Want to learn more? Well, this and other results are the topic of a large journal paper we’re writing on our work building the Manticore runtime system, which was specifically designed to take advantage of manycore architecture machines. Of course, in the journal paper there will be real numbers, rudimentary statistical analysis, and concrete recommendations. This article is just a post to thank the fantastic folks at Intel for making the Manycore Testing Lab available. I highly recommend it to anyone in an academic setting who doesn’t have an extra $30k to spend on one of these machines. Well, after I finish slamming those machines with work myself, of course!