28 August 2017

Let's assess Kubernetes

Introduction

Being on the Go conference circuit, I see about 3 different talks a year about new ways to boot Kubernetes. While I have a vague understanding of what it is, I don't have any practical experience or deeper knowledge about it. Having some spare time on my hands, I decided I'd try it out by porting a simple application to use it and write down my first impressions.

I'm not hoping to write a "Getting started with Kubernetes" post, because the official docs do a way better job of this than I could imagine ever doing. In general, the docs are really good. In particular, I found the concepts section really helpful when trying to grasp the system. Well done docs writers! 👍

As for the application, I have a custom-written blog server that you're reading this page on right now. It used to run on a small Linode instance that I manually operated. It might seem like extreme overkill to use a cluster management stack to deploy a single small app like this and frankly, it is. However, I found it to be a good way of getting hands on experience with the system. At the time of publication, this blog is now running on a single-node Google Container Engine cluster.

Pod lifecycles

Kubernetes' claim to fame is its scheduling. For each deployment that Kubernetes manages, it schedules sets of containers (known as pods) onto machines to be run (known as nodes). During rollout and scaling, pods get killed and created in order to satisfy the replica requirements. While the scheduling ensures better utilization of resources, I feel like the bigger impact that Kubernetes has is the environment that the pods run in. Out of the box, it provides image management, internal DNS and rollout automation. This makes me think that it's worth running the system with single pods scheduled onto single nodes, effectively disabling the scheduler.

One thing that the lifecycle doesn't seem to handle is caches that need to be kept hot. While it's generally a bad idea to have these caches, they do show up in clusters in the wild. If you have a memcache container, running along with a server of some kind, the only way to upgrade the server is to kill the memcache along with it. There's a mechanism for handling stateful pods, but that requires dumping the entire state onto disk and then reading it back when the pod is rescheduled. If you're finding yourself in a situation where this is necessary, then waiting for the restore from disk isn't high up on the list of things you want to do.

Networking

The networking setup for pod-to-pod communication is really nice. On Google Cloud Platform, each node gets a /24 subnet in the 10.0.0.0/8 private range and each pod then gets its own private IP. On any given pod, you can then use whatever port range you want for your services. This sort of isolation avoids situations where multiple applications are all trying to bind to the same port. If you want to drop your http server at port 80, you can do so, without having to worry about other http servers. Most applications know to avoid 80, but there are a lot of things that open debug servers on 8080 and I have had systems fail because of this.

Another benefit of this namespacing is that you can interact with various pieces of software that might not be set up for running on non-standard ports. Running DNS on any port that isn't 53 is really damn hard, because nothing will query on it.

This kind of isolation is based on each node having its own subnet. If your cloud provider only gives you a single IP per node, you will have to install some kind of overlay network and in my experience, they tend to not work that well.

While pod-to-pod networking is great, you still have to figure out what the IP addresses are in order to communicate between them. Kubernetes has a concept of a "service" for this. By default, each service gets a single internal cluster IP. You can discover this IP via internal DNS and then connect to it. If you have multiple pods that satisfy the same service, there is automatic load balancing between them which is handled by the communicating node.

I'm not sure what I think about the single cluster IP yet. Most applications are really bad at handling multiple DNS results and will select the first result encountered, leading to uneven resource usage. The cluster IP saves them from this issue. However, this setup falls for the classic trap of conflating service discovery with liveness. Once you get into larger cluster sizes, you start seeing more partitions that aren't symmetrical. The node hosting a pod might be able to perform its health checks and report them back the Kubernetes master, but not be reachable from other nodes. When this happens, load balancing will still try to reach the faulty node and since there is only one IP, you don't have a fallback. You can try terminating the connection and trying again in the hopes that it will load balance the TCP connection onto another node that is reachable, but that isn't quite optimal.

My advice is that if you're deploying on Kubernetes, you should add a health check based on how many requests have been served on a given pod since the last health check. If it's lower than a certain amount, mark the pod as unhealthy. This way, the TCP load balancing should quickly evict the pod, even if it's reachable from the Kubernetes master. There are ways of configuring services to give you direct pod IPs, but unless you need network identity, I don't think it's necessary.

Since pods have private IPs that aren't routable on the public internet, you need some sort of translation from the pod local IP to a routable IP when accessing things from outside the cluster. I'm usually not a fan of this type of NAT, so this looks like a clear cut case for IPv6. Just give every node a publicly routable subnet and the issue goes away. Unfortunately, Kubernetes doesn't support IPv6.

In practice, NAT isn't that big of an issue. For traffic that comes from outside your cluster, Kubernetes pushes you hard to use services that interact with cloud platform load balancers that provide a single IP. Since a lot of contributors to Kubernetes worked on orchestration at Google, it's not a surprise that they'd design around a Maglev model.

Sadly, I've yet to figure out a way to expose a service to the outside world in a high-availability way in an environment that doesn't have a such a load balancer. You can instruct Kubernetes to route any traffic that reaches the cluster on an external IP, but the IP don't get taken into account when scheduling pods onto nodes. If the pod gets scheduled away from a node that has that IP routed to it, then you end up with the node having to do NAT, which is extra work for that node. Another other issue is that there isn't a concept of port collisions on the external IP layer. If you have a controller that updated external IPs (and whatever DNS service you're using) according to which nodes they land on, you could potentially have 2 pods that both want port 80 traffic, but with nothing to distinguish the 2 IPs from each other.

For now, I'm going to live with the cloud load balancer. It makes my life easier and I don't expect to be running Kubernetes outside a cloud environment any time soon. If you do know a way to do this, I'd love to know how. My email can be found in the link on the side.

More cloud Magic

Managing persistent volumes is another place where cloud magic comes into play. The supported ways to get a disk that a pod can access skews heavily towards cloud providers and while you can create a volume that is just an attached disk on a node, it is an alpha feature and not ready for production yet. It'd be interesting to see if you could bootstrap an NFS service running on Kubernetes and then have Kubernetes use it for handing out persistent volumes, but since I'm only running a tiny blog server, I think that's a task for another day.

Cloud magic might seem like a dealbreaker for some people, but the cluster computing environment is so heavily based on cloud services now that you're going to have a hard time avoiding it. I see a lot of people avoiding the more magical parts of the cloud to prevent vendor lock-in, but they underestimate the cost of developing solutions in-house and the amount of implicit assumptions that come with running on a cloud platform in the first place. Kubernetes provides a consistent interface on this cloud magic, so while you might be relying on it, at least it's technically standardized.

Conclusion

I found my tiny excursion on Kubernetes quite enjoyable. Obviously, this is a toy problem, so any issues that might surface from large-scale use isn't apparent to me. I'm also running this on a cloud platform, so the backend administration might as well be made of bees and I wouldn't notice. As a user of the system however, I'm starting to realize why so many people want new ways to boot it.

As a side note, I'm currently looking for work. If you need a Go compiler/runtime engineer who's managed to know distributed systems through osmosis, hit me up on the sidebar. If you're curious about my skills, you can check out my CV.

Next article

Previous article

Links