Morsing's Blog

20 September 2017

A Causal Profiling update


Back in February last year, I wrote about my attempts at porting the Causal Profiling algorithm to the Go runtime.

Since then, development has stalled, but I did find some time recently to update it to the newest Go version. I realized during the rebase that the patchset could be simplified greatly and give far more accurate results. In this blog post, I'm going to go into some of the details and counterintuitive findings from this simplification.

For the rest of this post, I assume some familiarity with Causal Profiling and the Go scheduler, which I also wrote about some time back.

What went wrong?

Causal Pofiling works by performing experiments. A line is chosen and we start a profiler that samples when that line is being executed. When a signal is triggered, the algorithm selectively delays execution of all other threads in the system. The end effect is that we virtually speed up the chosen line. During this experiment, we also perform a measurement on some code that we've instrumented to find if it's sped up. This would usually be something like an important HTTP request that we want to optimize. Once the experiment is finished, we take the delays added by the profiler and subtract it from the instrumented measurements and that's our result.

This leaves us in the position of choosing what constitutes a thread. The Go runtime has 3 things that are sufficiently thread-like. Gs, Ms and Ps. Gs are goroutines, with a stack and instruction pointer. They're the thing representing the Go code that you've written. Ms are operating system threads. They're tasked with executing the Gs. For various reasons, an M might not be executing Go code. The most common one is being in a syscall. Which leaves us with Ps. Ps are the scheduling context. As Ms move into syscalls, Ps are handed off between them to make sure that there's always GOMAXPROCS Ms executing Go code at any given time.

Choosing the M as the Causal Profiling thread is difficult. They tend to not be executing at all times and dealing with the syscall machinery makes it an inaccurate fit.

Choosing the P means that threads are always busy, removing a large part of the bookkeeping needed for Causal Profiling. It would seem that it's the perfect fit, but it does have some disadvantages. During the time that the other Ps are executing their delays, the sped-up P would be free to steal work off them. The execution would then not be delayed, which in turn would mean that there'd be a speed-up, even though the G causing the delay and the Gs executed subsequently wouldn't be causally connected.

It breaks down even further when you consider the case where there's only one P. A speed-up being applied then means that no delays are being inserted, turning the algorithm into a regular profiler.

To avoid these cases of implicit causal connections I ended up with the G as the thread. Since Gs are only ever unblocked by other Gs, it ended up simplifying the causal connections implementation significantly.

The part where it didn't work so well was the implementation of delays. Sleeping goroutines are implemented with a single timer process that sits on a thread by itself and enqueues goroutines to the scheduler when their sleeps have been executed. This allows the runtime to only have one sleeping OS thread, but thousands of sleeping goroutines, significantly reducing the amount of resources needed for handling timers.

For Causal Profiling, every time a given goroutine was scheduled, I'd check if it needed to be delayed and put it into the queue of the timer process. This led to lots of lock contention on internal state of the timer process. Additionally, since it involved 2 roundtrips into the scheduler for every goroutine execution, the overhead there was increased. This let to inaccurate delays and made the results noisy and susceptible to misinterpretation.

A realization

The noisy measurements were a large part of why I mostly abandoned the project. I've been rebasing patches so that they apply cleanly on new Go releases, but otherwise, I've found little time to do any development.

During the latest rebase I had a realization. If you're using Ps as the thread for the purposes of Causal Profiling, the implicit causal connections are actually a feature. They end up modeling the effects that would happen if the currently executing thread actually did get sped up.

Consider the 1 P scenario I described above. There is only one thread ever doing any work on executing goroutines and it can only execute one goroutine at a time. In such a scenario, speeding up any given execution correlates with a 1-to-1 speed-up of the final program. The causal connection being inherited by goroutines running on that P is exactly the end result we want.

This extends to the multiple P scenario. Say you have one P that has just had a speed-up applied to it and another one executing its delay. If the sped-up P finishes its current task and then work steals from the other P, that reduces the queuing delay that the stolen task would experience. That lines up perfectly with what would happen if an actual speed-up happened.

The implementation

With moving the thread concept onto Ps, there's an issue. Causal Profiling requires that threads that are unblocked by other threads are credited with the delays executed by unblocking thread. However, Ps never directly interact. Gs interact with other Gs and Ps are just sitting in a loop, finding work to do and executing it.

The solution to this issue is to treat Gs like tokens. Every time a G is executed on a P, we tag it with the delay count of that P. If it then gets scheduled onto another P, we will inherit the value of that delay.

Delays no longer involve the Go scheduler. Just like in the original Causal Profiling paper, we now execute the delays inside the signal handler. The OS scheduler is now handling these delays, which it does much more precisely. Since there are only going to be GOMAXPROCS Ps executing delays at any given time, the volume of sleeping threads is also kept to a minimum.


Based on my tiny experiments so far, the results are way less noisy than before and can actually be used to guide optimization.

The other thing that I've noticed is that Ps are way more causally connected than Gs. Because of this, delays tend to be handed off more between threads, meaning less delay overall. The end result is that speed-ups have a bigger effect. This made me suspicious at first, because I didn't want to fall into the trap of evaluating a profiling algorithm more favorably because it gives a bigger effect. However, thinking through the queuing delays and the execution model, I've convinced myself that the bigger effect is just it being more accurate.

If you want to play with Causal Profiling and aren't afraid of applying patches to the core Go runtime you're running, you can find it on my github. If you need some help getting started, feel free to reach out to me on the email address on the sidebar.

Aside: If you'd like to work with someone who does these kinds of experiments, I'm currently available for hire. Have a look at my CV and if you find it interesting, You can reach me on the email on the sidebar.

28 August 2017

Let's assess Kubernetes


Being on the Go conference circuit, I see about 3 different talks a year about new ways to boot Kubernetes. While I have a vague understanding of what it is, I don't have any practical experience or deeper knowledge about it. Having some spare time on my hands, I decided I'd try it out by porting a simple application to use it and write down my first impressions.

I'm not hoping to write a "Getting started with Kubernetes" post, because the official docs do a way better job of this than I could imagine ever doing. In general, the docs are really good. In particular, I found the concepts section really helpful when trying to grasp the system. Well done docs writers! 👍

As for the application, I have a custom-written blog server that you're reading this page on right now. It used to run on a small Linode instance that I manually operated. It might seem like extreme overkill to use a cluster management stack to deploy a single small app like this and frankly, it is. However, I found it to be a good way of getting hands on experience with the system. At the time of publication, this blog is now running on a single-node Google Container Engine cluster.

Pod lifecycles

Kubernetes' claim to fame is its scheduling. For each deployment that Kubernetes manages, it schedules sets of containers (known as pods) onto machines to be run (known as nodes). During rollout and scaling, pods get killed and created in order to satisfy the replica requirements. While the scheduling ensures better utilization of resources, I feel like the bigger impact that Kubernetes has is the environment that the pods run in. Out of the box, it provides image management, internal DNS and rollout automation. This makes me think that it's worth running the system with single pods scheduled onto single nodes, effectively disabling the scheduler.

One thing that the lifecycle doesn't seem to handle is caches that need to be kept hot. While it's generally a bad idea to have these caches, they do show up in clusters in the wild. If you have a memcache container, running along with a server of some kind, the only way to upgrade the server is to kill the memcache along with it. There's a mechanism for handling stateful pods, but that requires dumping the entire state onto disk and then reading it back when the pod is rescheduled. If you're finding yourself in a situation where this is necessary, then waiting for the restore from disk isn't high up on the list of things you want to do.


The networking setup for pod-to-pod communication is really nice. On Google Cloud Platform, each node gets a /24 subnet in the private range and each pod then gets its own private IP. On any given pod, you can then use whatever port range you want for your services. This sort of isolation avoids situations where multiple applications are all trying to bind to the same port. If you want to drop your http server at port 80, you can do so, without having to worry about other http servers. Most applications know to avoid 80, but there are a lot of things that open debug servers on 8080 and I have had systems fail because of this.

Another benefit of this namespacing is that you can interact with various pieces of software that might not be set up for running on non-standard ports. Running DNS on any port that isn't 53 is really damn hard, because nothing will query on it.

This kind of isolation is based on each node having its own subnet. If your cloud provider only gives you a single IP per node, you will have to install some kind of overlay network and in my experience, they tend to not work that well.

While pod-to-pod networking is great, you still have to figure out what the IP addresses are in order to communicate between them. Kubernetes has a concept of a "service" for this. By default, each service gets a single internal cluster IP. You can discover this IP via internal DNS and then connect to it. If you have multiple pods that satisfy the same service, there is automatic load balancing between them which is handled by the communicating node.

I'm not sure what I think about the single cluster IP yet. Most applications are really bad at handling multiple DNS results and will select the first result encountered, leading to uneven resource usage. The cluster IP saves them from this issue. However, this setup falls for the classic trap of conflating service discovery with liveness. Once you get into larger cluster sizes, you start seeing more partitions that aren't symmetrical. The node hosting a pod might be able to perform its health checks and report them back the Kubernetes master, but not be reachable from other nodes. When this happens, load balancing will still try to reach the faulty node and since there is only one IP, you don't have a fallback. You can try terminating the connection and trying again in the hopes that it will load balance the TCP connection onto another node that is reachable, but that isn't quite optimal.

My advice is that if you're deploying on Kubernetes, you should add a health check based on how many requests have been served on a given pod since the last health check. If it's lower than a certain amount, mark the pod as unhealthy. This way, the TCP load balancing should quickly evict the pod, even if it's reachable from the Kubernetes master. There are ways of configuring services to give you direct pod IPs, but unless you need network identity, I don't think it's necessary.

Since pods have private IPs that aren't routable on the public internet, you need some sort of translation from the pod local IP to a routable IP when accessing things from outside the cluster. I'm usually not a fan of this type of NAT, so this looks like a clear cut case for IPv6. Just give every node a publicly routable subnet and the issue goes away. Unfortunately, Kubernetes doesn't support IPv6.

In practice, NAT isn't that big of an issue. For traffic that comes from outside your cluster, Kubernetes pushes you hard to use services that interact with cloud platform load balancers that provide a single IP. Since a lot of contributors to Kubernetes worked on orchestration at Google, it's not a surprise that they'd design around a Maglev model.

Sadly, I've yet to figure out a way to expose a service to the outside world in a high-availability way in an environment that doesn't have a such a load balancer. You can instruct Kubernetes to route any traffic that reaches the cluster on an external IP, but the IP don't get taken into account when scheduling pods onto nodes. If the pod gets scheduled away from a node that has that IP routed to it, then you end up with the node having to do NAT, which is extra work for that node. Another other issue is that there isn't a concept of port collisions on the external IP layer. If you have a controller that updated external IPs (and whatever DNS service you're using) according to which nodes they land on, you could potentially have 2 pods that both want port 80 traffic, but with nothing to distinguish the 2 IPs from each other.

For now, I'm going to live with the cloud load balancer. It makes my life easier and I don't expect to be running Kubernetes outside a cloud environment any time soon. If you do know a way to do this, I'd love to know how. My email can be found in the link on the side.

More cloud Magic

Managing persistent volumes is another place where cloud magic comes into play. The supported ways to get a disk that a pod can access skews heavily towards cloud providers and while you can create a volume that is just an attached disk on a node, it is an alpha feature and not ready for production yet. It'd be interesting to see if you could bootstrap an NFS service running on Kubernetes and then have Kubernetes use it for handing out persistent volumes, but since I'm only running a tiny blog server, I think that's a task for another day.

Cloud magic might seem like a dealbreaker for some people, but the cluster computing environment is so heavily based on cloud services now that you're going to have a hard time avoiding it. I see a lot of people avoiding the more magical parts of the cloud to prevent vendor lock-in, but they underestimate the cost of developing solutions in-house and the amount of implicit assumptions that come with running on a cloud platform in the first place. Kubernetes provides a consistent interface on this cloud magic, so while you might be relying on it, at least it's technically standardized.


I found my tiny excursion on Kubernetes quite enjoyable. Obviously, this is a toy problem, so any issues that might surface from large-scale use isn't apparent to me. I'm also running this on a cloud platform, so the backend administration might as well be made of bees and I wouldn't notice. As a user of the system however, I'm starting to realize why so many people want new ways to boot it.

As a side note, I'm currently looking for work. If you need a Go compiler/runtime engineer who's managed to know distributed systems through osmosis, hit me up on the sidebar. If you're curious about my skills, you can check out my CV.

1 February 2016

Causal Profiling for Go


Last year, I came across this paper which describes a novel profiling algorithm, which I have decided to implement for Go.

Why do we need it?

Causal profiling is a way of measuring your program's performance that works around some of the shortcomings of regular signal-based profiling.

In regular profiling, the profiler tells the kernel to interrupt your program every time a certain number of instructions have been executed. At each interrupt, the profiler takes a sample of which part of the program is running and continues execution. After enough samples have been taken, you gather various statistics on where the program was interrupted. Since the samples happen regular intervals, you can tell from the distribution of them where your program is using CPU time. This is incredibly useful when optimizing, because you can see the places where you might be using more CPU time than you thought you were.

The biggest shortcoming of regular profiling is that there isn't necessarily a correlation between where CPU time is being spent and progress being made. The classic example of a construct that break this assumption is a mutex. A thread in your program could take a mutex and sleep for a long time. Meanwhile, every other thread in your system would be forced to wait until that mutex is released. If the other threads could be doing some useful work, that potential doesn't show up in your regular CPU profiles, since nothing is using the CPU.

While regular profiling can show you how CPU time is being spent, causal profiling can show you where CPU time is not being spent.

How does it work?

Causal profiling is fairly simple. You insert markers into your program that measure how long it takes to perform some operation. Usually that would be around some unit of work, like servicing an HTTP request. You then do an experiment where you take a random point in your program, make it faster and see how that changed the time it took for the program to do the operation. With enough of these experiments, you can see how optimizing various parts of your program can impact performance. If you can get a 30% increase in performance by making something 15% faster, that indicates that there are locks that prevent other threads from doing work.

Well, that's the conceptual model. If we could make a point in your program faster, then we could just do that to every point in the program and have a blazing fast program with no work. In reality, causal profiling can't make a point in the program faster, so it simulates it.

Instead of making that point in the program faster, it makes everything else go slower. To make a thread faster, causal profiling measures when the thread is executing and insert delays into all other threads. To speed up a thread by 50%, it inserts delays equal to 50% of the time spent executing that thread. If you subtract the length of the delays from the time you got from the measurements, you get the final result of how much faster your program would be if that thread is faster.

There are some more nuance to the final algorithm, but that is the core of it. Besides slowing down parts of the program, the profiler also has to do some accounting so that threads unblocked by the instrumented thread doesn't have delays inserted. An actual speedup would have caused the unblocked thread to be executed earlier, so the profiler must not delay it in that case. Same for when a thread has been delayed and then unblocks another thread. Since only the delayed thread is running concurrently with the instrumented thread, adding a delay to the unblocked thread break the speedup illusion.

The handover of delays is the "causal" part of causal profiling. I probably would have gone for a different name, because it is really an implementation detail on how the speedup illusion is performed, rather than something at the core at the algorithm.

The Go implementation

I've implemented a prototype of causal profiling for Go, which you can find in my own little Go playground on Github. The runtime/causalprof package implements the profiling, while `go tool causalprof` makes sense of the output. It's all very buggy at the moment, but I've already run some programs and got results back from it.

An example

One of the weird gotcha's in Go is that the default random source has a mutex around it. This program looks like every request is independent of each other, but the call to rand.Int63n can block other goroutines.

If we run causal profiling on this program, we see that making rand.Int63n 95% faster would make the overall execution 22.3% faster. Regular profiling on this program shows that only 3.74% of the time is spent inside this function. While this is a toy example, it demonstrates just how hard it is to reason about the effects of locking.


Coz, the original implementation of causal profiling worked on a thread level, entirely driven by signals. By sleeping inside a signal handler, the profiler can insert a delay easily. Since Go has a userspace scheduler, sleeping a thread won't work. If you sleep, you prevent that thread from scheduling and because other threads will start executing runnable goroutines, you end up not delaying execution at all.

Instead of working on a thread level, my implementation works on a per-goroutine level. Every time a goroutine is getting scheduled, the profiler checks to see it needs to be delayed. If it does need to be delayed, we sleep the goroutine by using the timer thread and execute it once that timer runs out.

The Coz profiler has to do some dynamic linking tricks to intercept system calls and find out what unblocks what. This is trivial in Go since every unblock needs to be done by calling runtime.ready.

This implementation is not finished by far. The causalprof tool needs polish and probably a way to graph results. Experiments are a set length and probably need to change based on how many units of work are finished and there are probably some bugs in there that I haven't even discovered yet. However, I plan on spending a bit more time on it, and hopefully it will be in a usable state by sometime this year.

And now for something completely different

I am looking for work right now! If you want to work with someone with more Go internals knowledge than is probably necessary, you can let me know by clicking the email link on the sidebar.

9 April 2014

Effective error handling in Go.


One of the things the things that Go gets a lot of criticism for is how errors are handled. While it might seem daunting to have to explicitly inspect every error, there are steps you can take to defend yourself against erroneous error handling.

Indented flow is for errors.

When writing Go code, prefer the form

f, err := os.Open(path)
if err != nil {
    // handle error
// do stuff


f, err := os.Open(path)
if err == nil {
    // do stuff
// handle error

This way, the error free case will read as a straight line down the page.

Define your errors

One of the first steps to knowing how to handle an error is knowing what the error is. If your package can somehow cause an error, your users could be interested in knowing that you caused it. To do this, you just need to implement the error interface, which can be something as simple as this:

type Error string

func (e Error) Error() string { return string(e) }

Users of your package can now tell if your package caused an error by doing a type assertion

result, err := yourpackage.Foo()
if ype, ok := err.(yourpackage.Error); ok {
    // use ype to handle error

This can also be used as a way to expose structured error information to your users.

type ParseError struct {
    File  *File
    Error string

func (oe *OpenError) Error() string {
    // format error string here

func ParseFiles(files []*File) error {
    for _, f := range files {
        err := f.parse()
        if err != nil {
            return &OpenError{
                File:  f,
                Error: err.Error(),

This way, your users can now tell which exact file failed to parse.

You should be careful about wrapping errors though. When you wrap an error, information can be lost.

var c net.Conn
f, err := DownloadFile(c, path)
switch e := err.(type) {
    // this will get executed if err == nil
case net.Error:
    // close connection, not valid anymore
    return e
case error:
    // if err is non-nil
    return err
// do other things.

If you wrap net.Error, this code will not see that it was the network which failed and reuse the invalid connection.

A good rule of thumb is that if your package uses an outside interface, don't wrap errors generated by calls to them. Your user might care more about their errors than yours.

Errors as state.

Some times you might want to hold on to an error, either because you can delay reporting it or because you know you'll report it again soon.

A good example of the first case is the bufio package. When a bufio.Reader encounters an error, it will hold on to error until the buffer has been emptied. Only then will it report it.

A good example of the second case is go/loader. When called with parameters that cause it to error, it will hold on to the error since it is likely that it will be called again with the same parameters.

Use functions to avoid repetition

If you have a piece of error handling that is repeated, you can make a function out of it.

func handleError(c net.Conn, err error) {
    // repeated error handling

func DoStuff(c net.Conn) error {
    f, err := downloadFile(c, path)
    if err != nil {
        handleError(c, err)
        return err
    f, err := doOtherThing(c)
    if err != nil {
        handleError(c, err)
        return err

An alternative way of writing this is

func handleError(c net.Conn, err error) {
    if err == nil {
    // repeated error handling

func DoStuff(c net.Conn) error {
    defer func() { handleError(c, err) }()
    f, err := downloadFile(c, path)
    if err != nil {
        return err
    f, err := doOtherThing(c)
    if err != nil {
        return err

That's all.

That's all there really is to it.

By Daniel Morsing

8 April 2014

Machine code and garbage collection

So you're building a language.

If you're constructing a programming language, chances are that you want to use machine code somewhere in your implementation. It might be because you want to use C modules, it might be that you're using a JITing, or you might just want to compile ahead-of-time to a binary.

If your language is garbage-collected, there are a number of things you must consider. This is an attempt to mention just some of them.

Some background

But first, let's take a brief look at how garbage collectors work, or more specifically, how one type of garbage collectors work. You have your heap with a collection of objects. These objects can have references to other objects. A tracing garbage collector stop the running program at some point during its execution. It will then go through the already known set of objects, usually global variables and objects on the stack. It'll mark these objects as reachable, find all the references to other objects within them and mark them as reachable as well. This is called the mark phase.

Once it has run out of objects to scan, it will take all objects that it didn't reach and collect them. This is called the sweep phase.

The fact that a garbage collector only collects objects it didn't mark as reachable has some serious implications. It means that if you didn't recognize a reference somewhere, you will end up collecting items that are still live.

The problem

The big problem with machine code and garbage collection is that machine code does not know what a reference is. To the machine code, memory is just a large array of numbers and it has no idea what's reference and what's not. Since we'll be collecting all objects that we didn't reach, missing a reference means potentially freeing memory that isn't ready to be free. We'll need to find some way of getting around this problem.

One strategy is to treat all memory as references. This is what's called a conservative garbage collector. By doing this, we can be sure that no matter what, you will not miss a reference.

The downside with this tactic is that you can have spurious references. Somewhere in your data, you could have a string of bytes which looks like a pointer. The garbage collector would then keep the memory around, even though it could be freed. This is how the Boehm garbage collector for C works.

Even if you implement this scheme, there are still things to watch out for. Since compilers have become very good at optimizing and keeping often referenced variables in registers, you need to make sure that you're looking at the registers as well. This problem manifested itself as a bug in Ruby modules, where it would free memory that was only referenced by register.

Another strategy is to disable garbage collection while your native code is running. Once you're back in your interpreter, the native code would have left you in a state that your interpreter can understand and you can start your garbage collection. If you're implementing a dynamically typed language, you will have type information at hand anyway and you can use this to your advantage by just looking up where the references are.

The disadvantage is that your garbage collection is delayed until a return from your native code. This is fine for something like JITed code where you'll eventually return into your intepreter, but implementing event loops in C is a problem since you never return. This is also a problem if you're running more than one thread in your implementation. Since any one thread executing machine code can stall your collection, multiple threads increase the chance that you'll have to wait even longer before you can garbage collect.

Yet another strategy is to build type information for your machine code that the garbage collector can use. This usually manifests itself as 2 data structures. One is a bitmap of the stack frame, showing the garbage collector where pointers on the stack are for any given instruction. The other is bytecode telling the garbage collector where to find references. Using this combination of data structures, the garbage collector is able to precisely figure out where the references are.

Having this data means that you'll have to calculate it somehow. If you're compiling ahead-of-time, this shouldn't be a big problem. You just make sure that you emit the data when handling your types. However, if you're trying to integrate C modules, the normal C compilers will not help you. You'll either have to use conservative garbage collection or implement your own C compiler, just to get this information.

What's next

So far, I've only just scratched the surface. There are many more things to consider like how to make sure that all your threads can be stopped for collection, interrupting your machine code only when the heap is in a consistent state and how to make your compiler generate machine code that won't lose references.

I am by no means an expert on this subject. You shouldn't use any of this advice to actually build a garbage collector. But I do have an appreciation of how much work goes into building one and after this, I hope you do too.

By Daniel Morsing

See the index for more articles.