Morsing's Blog

11 June 2019

Using Causal Profiling to Optimize the Go HTTP/2 Server


If you've been keeping up with this blog, you might be familiar with Causal profiling, a profiling method that aims to bridge the gap between spending cycles on something and that something actually helping improve performance. I've ported this profiling method to Go and figured I'd turn it loose on a real piece of software, The HTTP/2 implementation in the standard library.


HTTP/2 is a new version of the old HTTP/1 protocol which we know and begrudgingly tolerate. It takes a single connection and multiplexes requests onto it, reducing connection establishing overhead. The Go implementation uses one goroutine per request and a couple more per connection to handle asynchronous communications, having them all coordinate to determine who writes to the connection when.

This structure is a perfect fit for causal profiling. If there is something implicitly blocking progress for a request, it should pop up red hot on the causal profiler, while it might not on the conventional one.

Experiment setup

To grab measurements, I set up an synthetic benchmark with an HTTP/2 server and client. The server takes the headers and body from the Google home page and just writes it for every request it sees. The client asks for the root document, using the client headers from firefox. The client limits itself to 10 requests running concurrently. This number was chosen arbitrarily, but should be enough to keep the CPU saturated.

Causal profiling requires us to instrument the program. We do this by setting Progress markers, which measure the time between 2 points in the code. The HTTP/2 server uses a function called runHandler that runs the HTTP handler in a goroutine. Because we want to measure scheduler latency as well as handler runtime, we set the start of the progress marker before we spawn the goroutine. The end marker is put after the handler has written all its data to the wire.

To get a baseline, let's grab a traditional CPU profile from the server, yielding this profiling graph:

OK, that's mostly what you'd expect from a large well-optimized program, a wide callgraph with not that many obvious places to sink our effort into. The big red box is the syscall leaf, a function we're unlikely to optimize.

The text gives us a few more places to look, but nothing substantial

(pprof) top
Showing nodes accounting for 40.32s, 49.44% of 81.55s total
Dropped 453 nodes (cum <= 0.41s)
Showing top 10 nodes out of 186
      flat  flat%   sum%        cum   cum%
    18.09s 22.18% 22.18%     18.84s 23.10%  syscall.Syscall
     4.69s  5.75% 27.93%      4.69s  5.75%  crypto/aes.gcmAesEnc
     3.88s  4.76% 32.69%      3.88s  4.76%  runtime.futex
     3.49s  4.28% 36.97%      3.49s  4.28%  runtime.epollwait
     2.10s  2.58% 39.55%      6.28s  7.70%  runtime.selectgo
     2.02s  2.48% 42.02%      2.02s  2.48%  runtime.memmove
     1.84s  2.26% 44.28%      2.13s  2.61%  runtime.step
     1.69s  2.07% 46.35%      3.97s  4.87%  runtime.pcvalue
     1.26s  1.55% 47.90%      1.39s  1.70%  runtime.lock
     1.26s  1.55% 49.44%      1.26s  1.55%  runtime.usleep

Mostly runtime and cryptography functions. Let's set aside cryptography, since it will already have been optimized a lot.

Causal Profiling to the rescue

Before we get to the profiling results from causal profiling, it'd be good to have a refresher on how it works. When causal profiling is enabled, a series of experiments are performed. An experiment starts by picking a call-site and some amount of virtual speed-up to apply. Whenever that call-site is executed (which we detect with the profiling infrastructure), we slow down every other executing thread by the virtual speed-up amount.

This would seem counter-intuitive, but since we know how much we slowed down a program when we take our measurement from the Progress marker, we can undo the effect, which gives us the time it would have taken if the chosen call-site was sped up. For a more in-depth look at causal profiling, I suggest you read my other causal profiling posts and the original paper.

The end result is that a causal profile looks like a collection of call-sites, which were sped up by some amount and as a result changed the runtime between Progress markers. For the HTTP/2 server, a call-site might look like:

0x4401ec /home/daniel/go/src/runtime/select.go:73
  0%    2550294ns
 20%    2605900ns    +2.18%    0.122%
 35%    2532253ns    -0.707%    0.368%
 40%    2673712ns    +4.84%    0.419%
 75%    2722614ns    +6.76%    0.886%
 95%    2685311ns    +5.29%    0.74%

For this example, we're looking at an unlock call within the select runtime code. We have the amount that we virtually sped up this call-site, the time it took, the percent difference from the baseline. This call-site doesn't show us much potential for speed-up. It actually shows that if we speed up the select code, we might end up making our program slower.

The fourth column is a bit tricky. It is the percentage of samples that were detected within this call-site, scaled by the virtual speed-up. Roughly, it shows us the amount of speed-up we should expect if we were seeing this as a traditional profile.

Now for a more interesting call-site:

0x4478aa /home/daniel/go/src/runtime/stack.go:881
  0%    2650250ns
  5%    2659303ns    +0.342%    0.84%
 15%    2526251ns    -4.68%    1.97%
 45%    2434132ns    -8.15%    6.65%
 50%    2587378ns    -2.37%    8.12%
 55%    2405998ns    -9.22%    8.31%
 70%    2394923ns    -9.63%    10.1%
 85%    2501800ns    -5.6%    11.7%

This call-site is within the stack growth code and shows that speeding it up might give us some decent results. The fourth column shows that we're essentially running this code as a critical section. With this in mind, let's look back at the traditional profile we collected, now focused on the stack growth code.

(pprof) top -cum newstack
Active filters:
Showing nodes accounting for 1.44s, 1.77% of 81.55s total
Dropped 36 nodes (cum <= 0.41s)
Showing top 10 nodes out of 65
      flat  flat%   sum%        cum   cum%
     0.10s  0.12%  0.12%      8.47s 10.39%  runtime.newstack
     0.09s  0.11%  0.23%      8.25s 10.12%  runtime.copystack
     0.80s  0.98%  1.21%      7.17s  8.79%  runtime.gentraceback
         0     0%  1.21%      6.38s  7.82%  net/http.(*http2serverConn).writeFrameAsync
         0     0%  1.21%      4.32s  5.30%  crypto/tls.(*Conn).Write
         0     0%  1.21%      4.32s  5.30%  crypto/tls.(*Conn).writeRecordLocked
         0     0%  1.21%      4.32s  5.30%  crypto/tls.(*halfConn).encrypt
     0.45s  0.55%  1.77%      4.23s  5.19%  runtime.adjustframe
         0     0%  1.77%      3.90s  4.78%  bufio.(*Writer).Write
         0     0%  1.77%      3.90s  4.78%  net/http.(*http2Framer).WriteData

This shows us that newstack is being called from writeFrameAsync. It is called in the goroutine that's spawned every time the HTTP/2 server wants to send a frame to the client. Only one writeFrameAsync can run at any given time and if a handler is trying to send more frames down the wire, it will be blocked until writeFrameAsync returns.

Because writeFrameAsync goes through several logical layers, it uses a lot of stack and stack growth is inevitable.

Speeding up the HTTP/2 server by 28.2% because I felt like it

If stack growth is slowing us down, then we should find some way of avoiding it. writeFrameAsync gets called on a new goroutine every time, so we pay the price of the stack growth on every frame write.

Instead, if we reuse the goroutine, we only pay the price of stack growth once and every subsequent write will reuse the now bigger stack. I made this modification to server and the causal profiling baseline turned from 2.650ms to 1.901ms, a reduction of 28.2%.

A bit caveat here is that HTTP/2 servers usually don't run full speed on localhost. I suspect if this was hooked up to the internet, the gains would be a lot less because the stack growth CPU time would get hidden in the network latency.


It's early days for causal profiling, but I think this small example does show off the potential it holds. If you want to play with causal profiling, you can check out my branch of the Go project with causal profiling added. You can also suggest other benchmarks for me to look at and we can see what comes from it.

P.S. I am currently between jobs and looking for work. If you'd be interested in working with low-level knowledge of Go internals and distributed systems engineering, check out my CV and send me an email on [email protected].

9 October 2018

The anatomy of a silly hack


I recently deployed a small hack on my personal website. In short it enables this:

[daniel@morslaptop ~]$ dig LOC

; <<>> DiG 9.11.4-P1-RedHat-9.11.4-2.P1.fc27 <<>> LOC
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 62528
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

; EDNS: version: 0, flags:; udp: 512
;        IN    LOC

;; ANSWER SECTION:    4    IN    LOC    43 46 46.715 S 145 49 45.069 W 417946.18m 1m 1m 1m

;; Query time: 42 msec
;; WHEN: Tue Oct 09 12:04:02 BST 2018
;; MSG SIZE  rcvd: 75

I am tracking the location of the International Space station via DNS. This might raise some questions like "Why does DNS have the ability to represent location?", "how the hell are you finding the location of the ISS?" and "Why would you willingly run DNS infrastructure in the Year of Our Lord 2018?".

DNS LOC records

DNS LOC records are mostly a historical oddity that lets you specify a location on earth. Back in the infancy of computer networking, when everybody wasn't sold on the whole TCP/IP thing yet, ferrying email between computers was handled through UUCP. UUCP worked by periodically talking to nearby systems and exchanging any email that needed to be forwarded.

To send an email, you'd have to look up a path from your location to the computer system of the recipient. You'd then address the email to host!foo!bar!dest_host!recipient. To figure out the path, map files were created that included basic information like the administrative contact, the phone number to call when your email was delayed and in some cases, the location of the server.

Meanwhile, the DNS system was being developed. DNS was meant to replace the HOSTS.TXT file used in TCP/IP networks that mapped user-readable names to IP addresses. To establish feature parity with old UUCP maps, an RFC was written that let you point out where your server is on a map. These LOC records are 16 byte binary representations of latitude, longitude, height above sea level and precision information

I don't know why they included the altitude of the server in the format. I guess people were going to use it to say whether a server was in the basement or on the third floor. For our purposes, the important thing is the range it allows. With a maximum of 42849 kilometers, this easily allows us to represent the location of low-earth orbiting objects.

The international Space Station is one of those objects, so how do we find it?

A brief introduction to the world of orbital mechanics

Tracking satellites in orbit is hard. It usually involves people at NORAD with radars. The good news is that satellites follow the laws of physics and the folks at NORAD are a kindly bunch, who lets us look over their notes.

Every day, they publish new information about the objects in the earths orbit that they track. This data is known as a two-line element set (TLE). The ISS is one of those objects and given a data set, we can predict where the ISS will be sometime in the future.

An aside: TLEs include the year a satellite was launched in its data set and it is specified to have 2 digits. In the leadup to Y2K, there was discussions about changing the TLE format such that it would be able to handle satellites launched after 1999. However, since no artificial satellites existed before 1957, it was decided that any year before 57 would be in the 21st century and any year after would be in the 20th century, solving the problem once and for all!

For the purposes of modeling a satellite orbiting earth, we consider earth as a point mass, with a reference coordinate system where the Z axis points positive through the true north pole and X and Y sits in a plane through the equator. This is known as an earth-centered inertial (ECI) coordinate system and does not rotate with the earth. The model for calculating the future orbit is known as the SGP4/SDP4 algorithms and take into account a whole slew of things, like gravitational effects, atmospheric drag and the shape of the earth. Given the TLE data, we know the orbit and can calculate a future position in the ECI.

The ECI makes orbital calculations easier, because we don't have to handle the rotation of the earth, but we want latitude, longitude and altitude (LLA) for our DNS positioning and latitude is defined by an angular offset to the international meridian. Since we can't stop the world turning, we're forced to do More Maths. Luckily, the angular offset between the reference plane and the meridian can be defined by where the earth is in its rotation around itself and where the earth is in its rotation around the sun. If that all remains the same, then we're all set!

So, we have the position of the ISS, calculated from a TLE in the ECI, converted to LLA and shoved into DNS. Now that we've run out of 3 letter acronyms, let's see how we manage to actually get this thing deployed.

Deploying this thing

Despite my enthusiasm for this project, I do not want to develop a DNS server. The protocol has become a vast, complex mess and I want to offload the effort as much as possible.

PowerDNS provides a good authoritative DNS server that can use remote backends via HTTP. These backends can use a JSON format and I wrote a small HTTP server in Go that responds with custom LOC records that follow the ISS, calculated from a TLE.

Since the SGP4 model is just a model, we have to periodically match it with the physical data to track the ISS accurately. Additionally, the ISS may use its boosters to push it into a higher orbit as it counteracts its orbital decay. Because of this, we fetch new TLEs every day from CelesTrak to make sure that we're following the ISS properly.

The PowerDNS server and my backend HTTP server are deployed as 2 containers in a pod, running on Google Kubernetes Engine. Since my website is managed by Cloudflare, we add an NS record for pointing at and an A record for to delegate the responsibility for the subdomain to my new DNS server.

You can try it out by running dig LOC in your terminal. You may have to change nameservers if the one on your router doesn't handle DNS esoterica properly.

Sadly, there are a couple of things that I'd like to have added to this hack that weren't possible. GKE doesn't allow you to have a load balanced service that listens on UDP and TCP both on the same IP address which breaks certain DNS operations. Since the LOC record is fairly small, we never actually hit a truncated DNS record and the lack of TCP is mostly a nuisance.

Cloudflare doesn't allow you to specify an A record and NS record for the same domain. Additionally, they don't allow you to proxy wildcard domains unless you are an enterprise customer. Because of this, I couldn't add a small HTTP server to that redirected to an explanation and a graphical view like the ISS tracker.


All in all, I'm pretty happy with how this turned out and I learned a bunch about orbital mechanics.

As you've probably guessed by me blogging again: I am looking for work. If you want to work with someone who has knowledge of DNS, Go internals and knows where the HTTP bodies are buried, have a look at my CV and send me an email on [email protected].

18 March 2018

TCP is an underspecified two-node consensus algorithm and what that means for your proxies


I recently found myself dealing with TCP load balancing for a project and I've come to think that generic TCP proxies can't be implemented without substantial pain and how it makes TLS terminating proxies a bad idea.

That might seem like a pretty bold statement right out of the gate, so let's drill down a bit.

TCP is not a stream of bytes

When people talk about TCP, it's easy to fall into the trap of thinking of it as a connection, with a bi-directional stream of bytes. That is the abstraction that TCP provides, but it's not what TCP is. TCP is an agreement between 2 nodes to run a simple consensus algorithm. The data that is agreed on is (roughly) how much of what I have sent have you seen and how much have I seen of what you've sent. Since there are only 2 nodes, the algorithm is much simpler than what you would see in Raft or Paxos, but like a lot of consensus algorithms, it's based on nodes agreeing on what the current highest number is.

Throughout this post, I'll be using "connection" as a shorthand for this agreement, but keep in mind that we're talking about 2 nodes communicating over a lossy connection, not a property of the network itself.

Besides the streams being sent, there's another important bit of information: the state of the connection itself. Annoyingly, some of this information is not transmitted over the network. The state of the connection is based largely on heuristics of the individual TCP implementations and to make matters worse, we allow programs to change this behavior depending on the application protocol. If you have a box in the middle of the network that is able to read the entirety of a TCP session, it would not be able to guess at what the state of a TCP connection is. It would end up in the position of having to guess at what is meant by a certain series of TCP/IP packets.

So, what impact does this have for TCP proxies? Let's set up a simple hypothetical with 3 nodes, a client, a proxy and a server. Whenever the client establishes a TCP connection to the proxy, it in turn establishes a TCP connection to the server. Whatever the client sends to the proxy gets forwarded to the server and whatever the server sends to the proxy gets forwarded to the client.

On the application layer, The client sends a request that takes the server a long time to reply to. Since the client expects that the request will take a long time, it enables TCP keepalive to periodically inform the server that it is still alive and able to receive the response.

Since the proxy will be the recipient of the keepalive packets, the server will not see them. It might think that the client has gone away, stop processing the request and close the TCP connection. We can have the proxy guess at what timeout might be appropriate, but those values are very protocol-specific and we end up either having the proxy terminate still viable sessions or taking up resources on the proxy machine.

Having a proxy in the middle shields the server from the specifics of the clients TCP behavior, even in cases where it would want to know it. The usual case where people might want that information is the IP address and there are ways of having it be transmitted, but TCP is a large spec. There are a myriad of features in TCP like TCP Fast Open or alternative congestion control or the client might be using archaic features and the proxy will either negate their advantages, or outright break the connection.

This serves as an example of the end-to-end principle in action and we don't have to stray to far from TCP to see more examples of it. On the IP layer, datagrams can be split into multiple parts for when the underlying physical transport cannot support a packet of a given size. The idea was that IP datagrams would be split and then recombined by the routers in the middle of the network when the physical layer would support a packet of that size again. This turned out to be disastrous in practice. Hosts would often get partial datagrams that would never be able to recombine and they would also have no way to tell the host on the other end that a packet was lost (the packet acknowledgement is in the TCP layer). Because of this issue and many more, we have largely scrapped the idea of IP fragmentation and came up with better solutions.

What can be done about it?

If you're building an application that does use TCP, you need to be prepared for the possibility that your application will end up being proxied through a host that might not particularly care for whatever TCP tricks you're doing or what the state of the protocol is at any given moment. You can guard against these issues by constructing your protocol in a resilient manner. Note that while these safeguards will help with proxies, they're in general a good idea, since they will also guard against lower-level network issues on the IP layer.

Application-level pings

Since you can't rely on the proxy to pass through the behavior of the TCP connection, techniques like keepalive packets can no longer be used to ensure liveness of a connection. Instead, you'll have to implement a ping on the application level. Since a proxy must pass through the data if it is to be useful in any way, these pings will have to poke through the proxy.

End-of-file is not the end of things

A lot of TCP proxies will turn connection errors into a clean termination. If you're using the closing of the TCP connection as a way to signal no more data (looking at you, HTTP 1.0), you cannot determine whether you have read the entire response, or there might be more data if the operation was retried.

If you're in the position of having to implement a proxy, for load balancing or inspection reasons, there are a couple of things you can do to make it less invasive.

Implement the protocol

The only way a TCP proxy might know when it is safe to terminate a connection, is when it knows the protocol state. A great example of this is an HTTP load balancer. It can see if there is an outstanding request and keep the connection open. More importantly, if the proxy needs to go down for maintenance, it can terminate connections cleanly and let the client re-establish.

Be a NAT

If the mismatch between a proxy's TCP implementation and the server's TCP implementation is an issue, another solution is for the proxy to not even implement TCP at all. Instead of acting as a TCP proxy, act like a smart IP forwarder. Since the proxy is not constructing packets, only modifying and forwarding them, the mismatch in implementation goes away. Examples of software that does this are Linux Virtual Server or Google's Maglev.

TLS proxies

A special case of the TCP proxy is the TLS proxy. They take a TCP connection containing a TLS session and turn them into unencrypted TCP within a trusted network. These proxies are useful because they offload the responsibilities of implementing complex cryptography code away from the application servers. Additionally, they are useful for key management, since your backends no longer have to have the keys stored on the servers.

But since they're a TCP proxy, they have the same issues as any other TCP proxy. Additionally, they cannot function as a NAT, since they have to respond with their own data to negotiate the TLS handshake.

So, we're stuck in a dilemma. Either use a TLS proxy, gain the ease of cryptographic deployment and lose nuances in TCP handling, or push TLS termination into the servers, creating key management issues and increased responsibility for cryptography code.


I have not yet figured out what can be done to solve the specific TLS issue. Using HSMs or something like CloudFlare's Keyless SSL and terminating at the edges might help with key management, but it still pushes large burden onto your application servers. Since the edges have to terminate, you lose the ability to route based on the information inside the TLS session. SNI would usually allow you to route to different backends and have key management be in one location, but that is no longer possible. This poses a large problem for cloud providers who want to run their TLS terminators as multi-tenant machines.

I think the issue also serves to illustrate a layering violation in the design of TLS itself. While TLS uses the TCP protocol to simplify its key negotiation handshake, the reliable delivery of TCP is completely orthogonal to the goals of privacy and integrity that TLS provides. If we pull the cryptography into TCP, we might be able to simplify things further. If a given TCP segment was always guaranteed to contain enough data to decrypt and authenticate it, then a pass-through NAT-like TLS proxy becomes trivial, without having to do significant connection tracking or creating acknowledgement packets on the end servers behalf. QUIC is an example of this kind of construction, although they chose to encrypt the transport layer for different reasons related to middleboxes.

For now, I'm going to grit my teeth, implement the protocols fully in my TLS terminators and hope that the mismatch never gets too bad, but there is an interesting issue to be solved here. If you meet me in person, buy me a beer and I'll tell you all about my plans to put a TLS terminator inside a hypervisor.

20 September 2017

A Causal Profiling update


Back in February last year, I wrote about my attempts at porting the Causal Profiling algorithm to the Go runtime.

Since then, development has stalled, but I did find some time recently to update it to the newest Go version. I realized during the rebase that the patchset could be simplified greatly and give far more accurate results. In this blog post, I'm going to go into some of the details and counterintuitive findings from this simplification.

For the rest of this post, I assume some familiarity with Causal Profiling and the Go scheduler, which I also wrote about some time back.

What went wrong?

Causal Pofiling works by performing experiments. A line is chosen and we start a profiler that samples when that line is being executed. When a signal is triggered, the algorithm selectively delays execution of all other threads in the system. The end effect is that we virtually speed up the chosen line. During this experiment, we also perform a measurement on some code that we've instrumented to find if it's sped up. This would usually be something like an important HTTP request that we want to optimize. Once the experiment is finished, we take the delays added by the profiler and subtract it from the instrumented measurements and that's our result.

This leaves us in the position of choosing what constitutes a thread. The Go runtime has 3 things that are sufficiently thread-like. Gs, Ms and Ps. Gs are goroutines, with a stack and instruction pointer. They're the thing representing the Go code that you've written. Ms are operating system threads. They're tasked with executing the Gs. For various reasons, an M might not be executing Go code. The most common one is being in a syscall. Which leaves us with Ps. Ps are the scheduling context. As Ms move into syscalls, Ps are handed off between them to make sure that there's always GOMAXPROCS Ms executing Go code at any given time.

Choosing the M as the Causal Profiling thread is difficult. They tend to not be executing at all times and dealing with the syscall machinery makes it an inaccurate fit.

Choosing the P means that threads are always busy, removing a large part of the bookkeeping needed for Causal Profiling. It would seem that it's the perfect fit, but it does have some disadvantages. During the time that the other Ps are executing their delays, the sped-up P would be free to steal work off them. The execution would then not be delayed, which in turn would mean that there'd be a speed-up, even though the G causing the delay and the Gs executed subsequently wouldn't be causally connected.

It breaks down even further when you consider the case where there's only one P. A speed-up being applied then means that no delays are being inserted, turning the algorithm into a regular profiler.

To avoid these cases of implicit causal connections I ended up with the G as the thread. Since Gs are only ever unblocked by other Gs, it ended up simplifying the causal connections implementation significantly.

The part where it didn't work so well was the implementation of delays. Sleeping goroutines are implemented with a single timer process that sits on a thread by itself and enqueues goroutines to the scheduler when their sleeps have been executed. This allows the runtime to only have one sleeping OS thread, but thousands of sleeping goroutines, significantly reducing the amount of resources needed for handling timers.

For Causal Profiling, every time a given goroutine was scheduled, I'd check if it needed to be delayed and put it into the queue of the timer process. This led to lots of lock contention on internal state of the timer process. Additionally, since it involved 2 roundtrips into the scheduler for every goroutine execution, the overhead there was increased. This let to inaccurate delays and made the results noisy and susceptible to misinterpretation.

A realization

The noisy measurements were a large part of why I mostly abandoned the project. I've been rebasing patches so that they apply cleanly on new Go releases, but otherwise, I've found little time to do any development.

During the latest rebase I had a realization. If you're using Ps as the thread for the purposes of Causal Profiling, the implicit causal connections are actually a feature. They end up modeling the effects that would happen if the currently executing thread actually did get sped up.

Consider the 1 P scenario I described above. There is only one thread ever doing any work on executing goroutines and it can only execute one goroutine at a time. In such a scenario, speeding up any given execution correlates with a 1-to-1 speed-up of the final program. The causal connection being inherited by goroutines running on that P is exactly the end result we want.

This extends to the multiple P scenario. Say you have one P that has just had a speed-up applied to it and another one executing its delay. If the sped-up P finishes its current task and then work steals from the other P, that reduces the queuing delay that the stolen task would experience. That lines up perfectly with what would happen if an actual speed-up happened.

The implementation

With moving the thread concept onto Ps, there's an issue. Causal Profiling requires that threads that are unblocked by other threads are credited with the delays executed by unblocking thread. However, Ps never directly interact. Gs interact with other Gs and Ps are just sitting in a loop, finding work to do and executing it.

The solution to this issue is to treat Gs like tokens. Every time a G is executed on a P, we tag it with the delay count of that P. If it then gets scheduled onto another P, we will inherit the value of that delay.

Delays no longer involve the Go scheduler. Just like in the original Causal Profiling paper, we now execute the delays inside the signal handler. The OS scheduler is now handling these delays, which it does much more precisely. Since there are only going to be GOMAXPROCS Ps executing delays at any given time, the volume of sleeping threads is also kept to a minimum.


Based on my tiny experiments so far, the results are way less noisy than before and can actually be used to guide optimization.

The other thing that I've noticed is that Ps are way more causally connected than Gs. Because of this, delays tend to be handed off more between threads, meaning less delay overall. The end result is that speed-ups have a bigger effect. This made me suspicious at first, because I didn't want to fall into the trap of evaluating a profiling algorithm more favorably because it gives a bigger effect. However, thinking through the queuing delays and the execution model, I've convinced myself that the bigger effect is just it being more accurate.

If you want to play with Causal Profiling and aren't afraid of applying patches to the core Go runtime you're running, you can find it on my github. If you need some help getting started, feel free to reach out to me on the email address on the sidebar.

Aside: If you'd like to work with someone who does these kinds of experiments, I'm currently available for hire. Have a look at my CV and if you find it interesting, You can reach me on the email on the sidebar.

28 August 2017

Let's assess Kubernetes


Being on the Go conference circuit, I see about 3 different talks a year about new ways to boot Kubernetes. While I have a vague understanding of what it is, I don't have any practical experience or deeper knowledge about it. Having some spare time on my hands, I decided I'd try it out by porting a simple application to use it and write down my first impressions.

I'm not hoping to write a "Getting started with Kubernetes" post, because the official docs do a way better job of this than I could imagine ever doing. In general, the docs are really good. In particular, I found the concepts section really helpful when trying to grasp the system. Well done docs writers! 👍

As for the application, I have a custom-written blog server that you're reading this page on right now. It used to run on a small Linode instance that I manually operated. It might seem like extreme overkill to use a cluster management stack to deploy a single small app like this and frankly, it is. However, I found it to be a good way of getting hands on experience with the system. At the time of publication, this blog is now running on a single-node Google Container Engine cluster.

Pod lifecycles

Kubernetes' claim to fame is its scheduling. For each deployment that Kubernetes manages, it schedules sets of containers (known as pods) onto machines to be run (known as nodes). During rollout and scaling, pods get killed and created in order to satisfy the replica requirements. While the scheduling ensures better utilization of resources, I feel like the bigger impact that Kubernetes has is the environment that the pods run in. Out of the box, it provides image management, internal DNS and rollout automation. This makes me think that it's worth running the system with single pods scheduled onto single nodes, effectively disabling the scheduler.

One thing that the lifecycle doesn't seem to handle is caches that need to be kept hot. While it's generally a bad idea to have these caches, they do show up in clusters in the wild. If you have a memcache container, running along with a server of some kind, the only way to upgrade the server is to kill the memcache along with it. There's a mechanism for handling stateful pods, but that requires dumping the entire state onto disk and then reading it back when the pod is rescheduled. If you're finding yourself in a situation where this is necessary, then waiting for the restore from disk isn't high up on the list of things you want to do.


The networking setup for pod-to-pod communication is really nice. On Google Cloud Platform, each node gets a /24 subnet in the private range and each pod then gets its own private IP. On any given pod, you can then use whatever port range you want for your services. This sort of isolation avoids situations where multiple applications are all trying to bind to the same port. If you want to drop your http server at port 80, you can do so, without having to worry about other http servers. Most applications know to avoid 80, but there are a lot of things that open debug servers on 8080 and I have had systems fail because of this.

Another benefit of this namespacing is that you can interact with various pieces of software that might not be set up for running on non-standard ports. Running DNS on any port that isn't 53 is really damn hard, because nothing will query on it.

This kind of isolation is based on each node having its own subnet. If your cloud provider only gives you a single IP per node, you will have to install some kind of overlay network and in my experience, they tend to not work that well.

While pod-to-pod networking is great, you still have to figure out what the IP addresses are in order to communicate between them. Kubernetes has a concept of a "service" for this. By default, each service gets a single internal cluster IP. You can discover this IP via internal DNS and then connect to it. If you have multiple pods that satisfy the same service, there is automatic load balancing between them which is handled by the communicating node.

I'm not sure what I think about the single cluster IP yet. Most applications are really bad at handling multiple DNS results and will select the first result encountered, leading to uneven resource usage. The cluster IP saves them from this issue. However, this setup falls for the classic trap of conflating service discovery with liveness. Once you get into larger cluster sizes, you start seeing more partitions that aren't symmetrical. The node hosting a pod might be able to perform its health checks and report them back the Kubernetes master, but not be reachable from other nodes. When this happens, load balancing will still try to reach the faulty node and since there is only one IP, you don't have a fallback. You can try terminating the connection and trying again in the hopes that it will load balance the TCP connection onto another node that is reachable, but that isn't quite optimal.

My advice is that if you're deploying on Kubernetes, you should add a health check based on how many requests have been served on a given pod since the last health check. If it's lower than a certain amount, mark the pod as unhealthy. This way, the TCP load balancing should quickly evict the pod, even if it's reachable from the Kubernetes master. There are ways of configuring services to give you direct pod IPs, but unless you need network identity, I don't think it's necessary.

Since pods have private IPs that aren't routable on the public internet, you need some sort of translation from the pod local IP to a routable IP when accessing things from outside the cluster. I'm usually not a fan of this type of NAT, so this looks like a clear cut case for IPv6. Just give every node a publicly routable subnet and the issue goes away. Unfortunately, Kubernetes doesn't support IPv6.

In practice, NAT isn't that big of an issue. For traffic that comes from outside your cluster, Kubernetes pushes you hard to use services that interact with cloud platform load balancers that provide a single IP. Since a lot of contributors to Kubernetes worked on orchestration at Google, it's not a surprise that they'd design around a Maglev model.

Sadly, I've yet to figure out a way to expose a service to the outside world in a high-availability way in an environment that doesn't have a such a load balancer. You can instruct Kubernetes to route any traffic that reaches the cluster on an external IP, but the IP don't get taken into account when scheduling pods onto nodes. If the pod gets scheduled away from a node that has that IP routed to it, then you end up with the node having to do NAT, which is extra work for that node. Another other issue is that there isn't a concept of port collisions on the external IP layer. If you have a controller that updated external IPs (and whatever DNS service you're using) according to which nodes they land on, you could potentially have 2 pods that both want port 80 traffic, but with nothing to distinguish the 2 IPs from each other.

For now, I'm going to live with the cloud load balancer. It makes my life easier and I don't expect to be running Kubernetes outside a cloud environment any time soon. If you do know a way to do this, I'd love to know how. My email can be found in the link on the side.

More cloud Magic

Managing persistent volumes is another place where cloud magic comes into play. The supported ways to get a disk that a pod can access skews heavily towards cloud providers and while you can create a volume that is just an attached disk on a node, it is an alpha feature and not ready for production yet. It'd be interesting to see if you could bootstrap an NFS service running on Kubernetes and then have Kubernetes use it for handing out persistent volumes, but since I'm only running a tiny blog server, I think that's a task for another day.

Cloud magic might seem like a dealbreaker for some people, but the cluster computing environment is so heavily based on cloud services now that you're going to have a hard time avoiding it. I see a lot of people avoiding the more magical parts of the cloud to prevent vendor lock-in, but they underestimate the cost of developing solutions in-house and the amount of implicit assumptions that come with running on a cloud platform in the first place. Kubernetes provides a consistent interface on this cloud magic, so while you might be relying on it, at least it's technically standardized.


I found my tiny excursion on Kubernetes quite enjoyable. Obviously, this is a toy problem, so any issues that might surface from large-scale use isn't apparent to me. I'm also running this on a cloud platform, so the backend administration might as well be made of bees and I wouldn't notice. As a user of the system however, I'm starting to realize why so many people want new ways to boot it.

As a side note, I'm currently looking for work. If you need a Go compiler/runtime engineer who's managed to know distributed systems through osmosis, hit me up on the sidebar. If you're curious about my skills, you can check out my CV.

See the index for more articles.