Morsing's Blog

1 February 2016

Causal Profiling for Go


Last year, I came across this paper which describes a novel profiling algorithm, which I have decided to implement for Go.

Why do we need it?

Causal profiling is a way of measuring your program's performance that works around some of the shortcomings of regular signal-based profiling.

In regular profiling, the profiler tells the kernel to interrupt your program every time a certain number of instructions have been executed. At each interrupt, the profiler takes a sample of which part of the program is running and continues execution. After enough samples have been taken, you gather various statistics on where the program was interrupted. Since the samples happen regular intervals, you can tell from the distribution of them where your program is using CPU time. This is incredibly useful when optimizing, because you can see the places where you might be using more CPU time than you thought you were.

The biggest shortcoming of regular profiling is that there isn't necessarily a correlation between where CPU time is being spent and progress being made. The classic example of a construct that break this assumption is a mutex. A thread in your program could take a mutex and sleep for a long time. Meanwhile, every other thread in your system would be forced to wait until that mutex is released. If the other threads could be doing some useful work, that potential doesn't show up in your regular CPU profiles, since nothing is using the CPU.

While regular profiling can show you how CPU time is being spent, causal profiling can show you where CPU time is not being spent.

How does it work?

Causal profiling is fairly simple. You insert markers into your program that measure how long it takes to perform some operation. Usually that would be around some unit of work, like servicing an HTTP request. You then do an experiment where you take a random point in your program, make it faster and see how that changed the time it took for the program to do the operation. With enough of these experiments, you can see how optimizing various parts of your program can impact performance. If you can get a 30% increase in performance by making something 15% faster, that indicates that there are locks that prevent other threads from doing work.

Well, that's the conceptual model. If we could make a point in your program faster, then we could just do that to every point in the program and have a blazing fast program with no work. In reality, causal profiling can't make a point in the program faster, so it simulates it.

Instead of making that point in the program faster, it makes everything else go slower. To make a thread faster, causal profiling measures when the thread is executing and insert delays into all other threads. To speed up a thread by 50%, it inserts delays equal to 50% of the time spent executing that thread. If you subtract the length of the delays from the time you got from the measurements, you get the final result of how much faster your program would be if that thread is faster.

There are some more nuance to the final algorithm, but that is the core of it. Besides slowing down parts of the program, the profiler also has to do some accounting so that threads unblocked by the instrumented thread doesn't have delays inserted. An actual speedup would have caused the unblocked thread to be executed earlier, so the profiler must not delay it in that case. Same for when a thread has been delayed and then unblocks another thread. Since only the delayed thread is running concurrently with the instrumented thread, adding a delay to the unblocked thread break the speedup illusion.

The handover of delays is the "causal" part of causal profiling. I probably would have gone for a different name, because it is really an implementation detail on how the speedup illusion is performed, rather than something at the core at the algorithm.

The Go implementation

I've implemented a prototype of causal profiling for Go, which you can find in my own little Go playground on Github. The runtime/causalprof package implements the profiling, while `go tool causalprof` makes sense of the output. It's all very buggy at the moment, but I've already run some programs and got results back from it.

An example

One of the weird gotcha's in Go is that the default random source has a mutex around it. This program looks like every request is independent of each other, but the call to rand.Int63n can block other goroutines.

If we run causal profiling on this program, we see that making rand.Int63n 95% faster would make the overall execution 22.3% faster. Regular profiling on this program shows that only 3.74% of the time is spent inside this function. While this is a toy example, it demonstrates just how hard it is to reason about the effects of locking.


Coz, the original implementation of causal profiling worked on a thread level, entirely driven by signals. By sleeping inside a signal handler, the profiler can insert a delay easily. Since Go has a userspace scheduler, sleeping a thread won't work. If you sleep, you prevent that thread from scheduling and because other threads will start executing runnable goroutines, you end up not delaying execution at all.

Instead of working on a thread level, my implementation works on a per-goroutine level. Every time a goroutine is getting scheduled, the profiler checks to see it needs to be delayed. If it does need to be delayed, we sleep the goroutine by using the timer thread and execute it once that timer runs out.

The Coz profiler has to do some dynamic linking tricks to intercept system calls and find out what unblocks what. This is trivial in Go since every unblock needs to be done by calling runtime.ready.

This implementation is not finished by far. The causalprof tool needs polish and probably a way to graph results. Experiments are a set length and probably need to change based on how many units of work are finished and there are probably some bugs in there that I haven't even discovered yet. However, I plan on spending a bit more time on it, and hopefully it will be in a usable state by sometime this year.

And now for something completely different

I am looking for work right now! If you want to work with someone with more Go internals knowledge than is probably necessary, you can let me know by clicking the email link on the sidebar.

Related articles