Morsing's Blog

1 February 2016

Causal Profiling for Go


Last year, I came across this paper which describes a novel profiling algorithm, which I have decided to implement for Go.

Why do we need it?

Causal profiling is a way of measuring your program's performance that works around some of the shortcomings of regular signal-based profiling.

In regular profiling, the profiler tells the kernel to interrupt your program every time a certain number of instructions have been executed. At each interrupt, the profiler takes a sample of which part of the program is running and continues execution. After enough samples have been taken, you gather various statistics on where the program was interrupted. Since the samples happen regular intervals, you can tell from the distribution of them where your program is using CPU time. This is incredibly useful when optimizing, because you can see the places where you might be using more CPU time than you thought you were.

The biggest shortcoming of regular profiling is that there isn't necessarily a correlation between where CPU time is being spent and progress being made. The classic example of a construct that break this assumption is a mutex. A thread in your program could take a mutex and sleep for a long time. Meanwhile, every other thread in your system would be forced to wait until that mutex is released. If the other threads could be doing some useful work, that potential doesn't show up in your regular CPU profiles, since nothing is using the CPU.

While regular profiling can show you how CPU time is being spent, causal profiling can show you where CPU time is not being spent.

How does it work?

Causal profiling is fairly simple. You insert markers into your program that measure how long it takes to perform some operation. Usually that would be around some unit of work, like servicing an HTTP request. You then do an experiment where you take a random point in your program, make it faster and see how that changed the time it took for the program to do the operation. With enough of these experiments, you can see how optimizing various parts of your program can impact performance. If you can get a 30% increase in performance by making something 15% faster, that indicates that there are locks that prevent other threads from doing work.

Well, that's the conceptual model. If we could make a point in your program faster, then we could just do that to every point in the program and have a blazing fast program with no work. In reality, causal profiling can't make a point in the program faster, so it simulates it.

Instead of making that point in the program faster, it makes everything else go slower. To make a thread faster, causal profiling measures when the thread is executing and insert delays into all other threads. To speed up a thread by 50%, it inserts delays equal to 50% of the time spent executing that thread. If you subtract the length of the delays from the time you got from the measurements, you get the final result of how much faster your program would be if that thread is faster.

There are some more nuance to the final algorithm, but that is the core of it. Besides slowing down parts of the program, the profiler also has to do some accounting so that threads unblocked by the instrumented thread doesn't have delays inserted. An actual speedup would have caused the unblocked thread to be executed earlier, so the profiler must not delay it in that case. Same for when a thread has been delayed and then unblocks another thread. Since only the delayed thread is running concurrently with the instrumented thread, adding a delay to the unblocked thread break the speedup illusion.

The handover of delays is the "causal" part of causal profiling. I probably would have gone for a different name, because it is really an implementation detail on how the speedup illusion is performed, rather than something at the core at the algorithm.

The Go implementation

I've implemented a prototype of causal profiling for Go, which you can find in my own little Go playground on Github. The runtime/causalprof package implements the profiling, while `go tool causalprof` makes sense of the output. It's all very buggy at the moment, but I've already run some programs and got results back from it.

An example

One of the weird gotcha's in Go is that the default random source has a mutex around it. This program looks like every request is independent of each other, but the call to rand.Int63n can block other goroutines.

If we run causal profiling on this program, we see that making rand.Int63n 95% faster would make the overall execution 22.3% faster. Regular profiling on this program shows that only 3.74% of the time is spent inside this function. While this is a toy example, it demonstrates just how hard it is to reason about the effects of locking.


Coz, the original implementation of causal profiling worked on a thread level, entirely driven by signals. By sleeping inside a signal handler, the profiler can insert a delay easily. Since Go has a userspace scheduler, sleeping a thread won't work. If you sleep, you prevent that thread from scheduling and because other threads will start executing runnable goroutines, you end up not delaying execution at all.

Instead of working on a thread level, my implementation works on a per-goroutine level. Every time a goroutine is getting scheduled, the profiler checks to see it needs to be delayed. If it does need to be delayed, we sleep the goroutine by using the timer thread and execute it once that timer runs out.

The Coz profiler has to do some dynamic linking tricks to intercept system calls and find out what unblocks what. This is trivial in Go since every unblock needs to be done by calling runtime.ready.

This implementation is not finished by far. The causalprof tool needs polish and probably a way to graph results. Experiments are a set length and probably need to change based on how many units of work are finished and there are probably some bugs in there that I haven't even discovered yet. However, I plan on spending a bit more time on it, and hopefully it will be in a usable state by sometime this year.

And now for something completely different

I am looking for work right now! If you want to work with someone with more Go internals knowledge than is probably necessary, you can let me know by clicking the email link on the sidebar.

9 April 2014

Effective error handling in Go.


One of the things the things that Go gets a lot of criticism for is how errors are handled. While it might seem daunting to have to explicitly inspect every error, there are steps you can take to defend yourself against erroneous error handling.

Indented flow is for errors.

When writing Go code, prefer the form

f, err := os.Open(path)
if err != nil {
    // handle error
// do stuff


f, err := os.Open(path)
if err == nil {
    // do stuff
// handle error

This way, the error free case will read as a straight line down the page.

Define your errors

One of the first steps to knowing how to handle an error is knowing what the error is. If your package can somehow cause an error, your users could be interested in knowing that you caused it. To do this, you just need to implement the error interface, which can be something as simple as this:

type Error string

func (e Error) Error() string { return string(e) }

Users of your package can now tell if your package caused an error by doing a type assertion

result, err := yourpackage.Foo()
if ype, ok := err.(yourpackage.Error); ok {
    // use ype to handle error

This can also be used as a way to expose structured error information to your users.

type ParseError struct {
    File  *File
    Error string

func (oe *OpenError) Error() string {
    // format error string here

func ParseFiles(files []*File) error {
    for _, f := range files {
        err := f.parse()
        if err != nil {
            return &OpenError{
                File:  f,
                Error: err.Error(),

This way, your users can now tell which exact file failed to parse.

You should be careful about wrapping errors though. When you wrap an error, information can be lost.

var c net.Conn
f, err := DownloadFile(c, path)
switch e := err.(type) {
    // this will get executed if err == nil
case net.Error:
    // close connection, not valid anymore
    return e
case error:
    // if err is non-nil
    return err
// do other things.

If you wrap net.Error, this code will not see that it was the network which failed and reuse the invalid connection.

A good rule of thumb is that if your package uses an outside interface, don't wrap errors generated by calls to them. Your user might care more about their errors than yours.

Errors as state.

Some times you might want to hold on to an error, either because you can delay reporting it or because you know you'll report it again soon.

A good example of the first case is the bufio package. When a bufio.Reader encounters an error, it will hold on to error until the buffer has been emptied. Only then will it report it.

A good example of the second case is go/loader. When called with parameters that cause it to error, it will hold on to the error since it is likely that it will be called again with the same parameters.

Use functions to avoid repetition

If you have a piece of error handling that is repeated, you can make a function out of it.

func handleError(c net.Conn, err error) {
    // repeated error handling

func DoStuff(c net.Conn) error {
    f, err := downloadFile(c, path)
    if err != nil {
        handleError(c, err)
        return err
    f, err := doOtherThing(c)
    if err != nil {
        handleError(c, err)
        return err

An alternative way of writing this is

func handleError(c net.Conn, err error) {
    if err == nil {
    // repeated error handling

func DoStuff(c net.Conn) error {
    defer func() { handleError(c, err) }()
    f, err := downloadFile(c, path)
    if err != nil {
        return err
    f, err := doOtherThing(c)
    if err != nil {
        return err

That's all.

That's all there really is to it.

By Daniel Morsing

8 April 2014

Machine code and garbage collection

So you're building a language.

If you're constructing a programming language, chances are that you want to use machine code somewhere in your implementation. It might be because you want to use C modules, it might be that you're using a JITing, or you might just want to compile ahead-of-time to a binary.

If your language is garbage-collected, there are a number of things you must consider. This is an attempt to mention just some of them.

Some background

But first, let's take a brief look at how garbage collectors work, or more specifically, how one type of garbage collectors work. You have your heap with a collection of objects. These objects can have references to other objects. A tracing garbage collector stop the running program at some point during its execution. It will then go through the already known set of objects, usually global variables and objects on the stack. It'll mark these objects as reachable, find all the references to other objects within them and mark them as reachable as well. This is called the mark phase.

Once it has run out of objects to scan, it will take all objects that it didn't reach and collect them. This is called the sweep phase.

The fact that a garbage collector only collects objects it didn't mark as reachable has some serious implications. It means that if you didn't recognize a reference somewhere, you will end up collecting items that are still live.

The problem

The big problem with machine code and garbage collection is that machine code does not know what a reference is. To the machine code, memory is just a large array of numbers and it has no idea what's reference and what's not. Since we'll be collecting all objects that we didn't reach, missing a reference means potentially freeing memory that isn't ready to be free. We'll need to find some way of getting around this problem.

One strategy is to treat all memory as references. This is what's called a conservative garbage collector. By doing this, we can be sure that no matter what, you will not miss a reference.

The downside with this tactic is that you can have spurious references. Somewhere in your data, you could have a string of bytes which looks like a pointer. The garbage collector would then keep the memory around, even though it could be freed. This is how the Boehm garbage collector for C works.

Even if you implement this scheme, there are still things to watch out for. Since compilers have become very good at optimizing and keeping often referenced variables in registers, you need to make sure that you're looking at the registers as well. This problem manifested itself as a bug in Ruby modules, where it would free memory that was only referenced by register.

Another strategy is to disable garbage collection while your native code is running. Once you're back in your interpreter, the native code would have left you in a state that your interpreter can understand and you can start your garbage collection. If you're implementing a dynamically typed language, you will have type information at hand anyway and you can use this to your advantage by just looking up where the references are.

The disadvantage is that your garbage collection is delayed until a return from your native code. This is fine for something like JITed code where you'll eventually return into your intepreter, but implementing event loops in C is a problem since you never return. This is also a problem if you're running more than one thread in your implementation. Since any one thread executing machine code can stall your collection, multiple threads increase the chance that you'll have to wait even longer before you can garbage collect.

Yet another strategy is to build type information for your machine code that the garbage collector can use. This usually manifests itself as 2 data structures. One is a bitmap of the stack frame, showing the garbage collector where pointers on the stack are for any given instruction. The other is bytecode telling the garbage collector where to find references. Using this combination of data structures, the garbage collector is able to precisely figure out where the references are.

Having this data means that you'll have to calculate it somehow. If you're compiling ahead-of-time, this shouldn't be a big problem. You just make sure that you emit the data when handling your types. However, if you're trying to integrate C modules, the normal C compilers will not help you. You'll either have to use conservative garbage collection or implement your own C compiler, just to get this information.

What's next

So far, I've only just scratched the surface. There are many more things to consider like how to make sure that all your threads can be stopped for collection, interrupting your machine code only when the heap is in a consistent state and how to make your compiler generate machine code that won't lose references.

I am by no means an expert on this subject. You shouldn't use any of this advice to actually build a garbage collector. But I do have an appreciation of how much work goes into building one and after this, I hope you do too.

By Daniel Morsing

8 September 2013

The Go netpoller


I'm bored again or I have something more important to do, so it's time for another blog post about the Go runtime. This time I'm gonna take a look at how Go handles network I/O.


In Go, all I/O is blocking. The Go ecosystem is built around the idea that you write against a blocking interface and then handle concurrency through goroutines and channels rather than callbacks and futures. An example is the HTTP server in the "net/http" package. Whenever it accepts a connection, it will create a new goroutine to handle all the requests that will happen on that connection. This construct means that the request handler can be written in a very straightforward manner. First do this, then do that. Unfortunately, using the blocking I/O provided by the operating system isn't suitable for constructing our own blocking I/O interface.

In my previous post about the Go runtime, I covered how the Go scheduler handles syscalls. To handle a blocking syscall, we need a thread that can be blocked inside the operating system. If we were to build our blocking I/O on top of the OS' blocking I/O, we'd spawn a new thread for every client stuck in a syscall. This becomes really expensive once you have 10,000 client threads, all stuck in a syscall waiting for their I/O operation to succeed.

Go gets around this problem by using the asynchronous interfaces that the OS provides, but blocking the goroutines that are performing I/O.

The netpoller

The part that converts asynchronous I/O into blocking I/O is called the netpoller. It sits in its own thread, receiving events from goroutines wishing to do network I/O. The netpoller uses whichever interface the OS provides to do polling of network sockets. On Linux, it uses epoll, on the BSDs and Darwin, it uses kqueue and on Windows it uses IoCompletionPort. These interfaces all have in common that they provide user space a way to efficiently poll for the status of network I/O.

Whenever you open or accept a connection in Go, the file descriptor that backs it is set to non-blocking mode. This means that if you try to do I/O on it and the file descriptor isn't ready, it will return an error code saying so. Whenever a goroutine tries to read or write to a connection, the networking code will do the operation until it receives such an error, then call into the netpoller, telling it to notify the goroutine when it is ready to perform I/O again. The goroutine is then scheduled out of the thread it's running on and another goroutine is run in its place.

When the netpoller receives notification from the OS that it can perform I/O on a file descriptor, it will look through its internal data structure, see if there are any goroutines that are blocked on that file and notify them if there are any. The goroutine can then retry the I/O operation that caused it to block and succeed in doing so.

If this is sounding a lot like using the old select and poll Unix system calls to do I/O, it's because it is. But instead of looking up a function pointer and a struct containing a bunch of state variables, the netpoller looks up a goroutine that can be scheduled in. This frees you from managing all that state, rechecking whether you received enough data on the last go around and juggling function pointers like you would do with traditional Unix networking I/O.

30 June 2013

The Go scheduler


One of the big features for Go 1.1 is the new scheduler, contributed by Dmitry Vyukov. The new scheduler has given a dramatic increase in performance for parallel Go programs and with nothing better to do, I figured I'd write something about it.

Most of what's written in this blog post is already described in the original design doc. It's a fairly comprehensive document, but pretty technical.

All you need to know about the new scheduler is in that design document but this post has pictures, so it's clearly superior.

What does the Go runtime need with a scheduler?

But before we look at the new scheduler, we need to understand why it's needed. Why create a userspace scheduler when the operating system can schedule threads for you?

The POSIX thread API is very much a logical extension to the existing Unix process model and as such, threads get a lot of the same controls as processes. Threads have their own signal mask, can be assigned CPU affinity, can be put into cgroups and can be queried for which resources they use. All these controls add overhead for features that are simply not needed for how Go programs use goroutines and they quickly add up when you have 100,000 threads in your program.

Another problem is that the OS can't make informed scheduling decisions, based on the Go model. For example, the Go garbage collector requires that all threads are stopped when running a collection and that memory must be in a consistent state. This involves waiting for running threads to reach a point where we know that the memory is consistent.

When you have many threads scheduled out at random points, chances are that you're going to have to wait for a lot of them to reach a consistent state. The Go scheduler can make the decision of only scheduling at points where it knows that memory is consistent. This means that when we stop for garbage collection, we only have to wait for the threads that are being actively run on a CPU core.

Our Cast of Characters

There are 3 usual models for threading. One is N:1 where several userspace threads are run on one OS thread. This has the advantage of being very quick to context switch but cannot take advantage of multi-core systems. Another is 1:1 where one thread of execution matches one OS thread. It takes advantage of all of the cores on the machine, but context switching is slow because it has to trap through the OS.

Go tries to get the best of both worlds by using a M:N scheduler. It schedules an arbitrary number of goroutines onto an arbitrary number of OS threads. You get quick context switches and you take advantage of all the cores in your system. The main disadvantage of this approach is the complexity it adds to the scheduler.

To acomplish the task of scheduling, the Go Scheduler uses 3 main entities:

The triangle represents an OS thread. It's the thread of execution managed by the OS and works pretty much like your standard POSIX thread. In the runtime code, it's called M for machine.

The circle represents a goroutine. It includes the stack, the instruction pointer and other information important for scheduling goroutines, like any channel it might be blocked on. In the runtime code, it's called a G.

The rectangle represents a context for scheduling. You can look at it as a localized version of the scheduler which runs Go code on a single thread. It's the important part that lets us go from a N:1 scheduler to a M:N scheduler. In the runtime code, it's called P for processor. More on this part in a bit.

Here we see 2 threads (M), each holding a context (P), each running a goroutine (G). In order to run goroutines, a thread must hold a context.

The number of contexts is set on startup to the value of the GOMAXPROCS environment variable or through the runtime function GOMAXPROCS(). Normally this doesn't change during execution of your program. The fact that the number of contexts is fixed means that only GOMAXPROCS are running Go code at any point. We can use that to tune the invocation of the Go process to the individual computer, such at a 4 core PC is running Go code on 4 threads.

The greyed out goroutines are not running, but ready to be scheduled. They're arranged in lists called runqueues. Goroutines are added to the end of a runqueue whenever a goroutine executes a go statement. Once a context has run a goroutine until a scheduling point, it pops a goroutine off its runqueue, sets stack and instruction pointer and begins running the goroutine.

To bring down mutex contention, each context has its own local runqueue. A previous version of the Go scheduler only had a global runqueue with a mutex protecting it. Threads were often blocked waiting for the mutex to unlocked. This got really bad when you had 32 core machines that you wanted to squeeze as much performance out of as possible.

The scheduler keeps on scheduling in this steady state as long as all contexts have goroutines to run. However, there are a couple of scenarios that can change that.

Who you gonna (sys)call?

You might wonder now, why have contexts at all? Can't we just put the runqueues on the threads and get rid of contexts? Not really. The reason we have contexts is so that we can hand them off to other threads if the running thread needs to block for some reason.

An example of when we need to block, is when we call into a syscall. Since a thread cannot both be executing code and be blocked on a syscall, we need to hand off the context so it can keep scheduling.

Here we see a thread giving up its context so that another thread can run it. The scheduler makes sure there are enough threads to run all contexts. M1 in the illustration above might be created just for the purpose of handling this syscall or it could come from a thread cache. The syscalling thread will hold on to the goroutine that made the syscall since it's technically still executing, albeit blocked in the OS.

When the syscall returns, the thread must try and get a context in order to run the returning goroutine. The normal mode of operation is to steal a context from one of the other threads. If it can't steal one, it will put the goroutine on a global runqueue, put itself on the thread cache and go to sleep.

The global runqueue is a runqueue that contexts pull from when they run out of their local runqueue. Contexts also periodically check the global runqueue for goroutines. Otherwise the goroutines on global runqueue could end up never running because of starvation.

This handling of syscalls is why Go programs run with multiple threads, even when GOMAXPROCS is 1. The runtime uses goroutines that call syscalls, leaving threads behind.

Stealing work

Another way that the steady state of the system can change is when a context runs out of goroutines to schedule to. This can happen if the amount of work on the contexts' runqueues is unbalanced. This can cause a context to end up exhausting it's runqueue while there is still work to be done in the system. To keep running Go code, a context can take goroutines out of the global runqueue but if there are no goroutines in it, it'll have to get them from somewhere else.

That somewhere is the other contexts. When a context runs out, it will try to steal about half of the runqueue from another context. This makes sure there is always work to do on each of the contexts, which in turn makes sure that all threads are working at their maximum capacity.

Where to go?

There are many more details to the scheduler, like cgo threads, the LockOSThread() function and integration with the network poller. These are outside the scope of this post, but still merit study. I might write about these later. There are certainly plenty of interesting constructions to be found in the Go runtime library.

By Daniel Morsing

See the index for more articles.