Morsing's Blog

18 March 2018

TCP is an underspecified two-node consensus algorithm and what that means for your proxies


I recently found myself dealing with TCP load balancing for a project and I've come to think that generic TCP proxies can't be implemented without substantial pain and how it makes TLS terminating proxies a bad idea.

That might seem like a pretty bold statement right out of the gate, so let's drill down a bit.

TCP is not a stream of bytes

When people talk about TCP, it's easy to fall into the trap of thinking of it as a connection, with a bi-directional stream of bytes. That is the abstraction that TCP provides, but it's not what TCP is. TCP is an agreement between 2 nodes to run a simple consensus algorithm. The data that is agreed on is (roughly) how much of what I have sent have you seen and how much have I seen of what you've sent. Since there are only 2 nodes, the algorithm is much simpler than what you would see in Raft or Paxos, but like a lot of consensus algorithms, it's based on nodes agreeing on what the current highest number is.

Throughout this post, I'll be using "connection" as a shorthand for this agreement, but keep in mind that we're talking about 2 nodes communicating over a lossy connection, not a property of the network itself.

Besides the streams being sent, there's another important bit of information: the state of the connection itself. Annoyingly, some of this information is not transmitted over the network. The state of the connection is based largely on heuristics of the individual TCP implementations and to make matters worse, we allow programs to change this behavior depending on the application protocol. If you have a box in the middle of the network that is able to read the entirety of a TCP session, it would not be able to guess at what the state of a TCP connection is. It would end up in the position of having to guess at what is meant by a certain series of TCP/IP packets.

So, what impact does this have for TCP proxies? Let's set up a simple hypothetical with 3 nodes, a client, a proxy and a server. Whenever the client establishes a TCP connection to the proxy, it in turn establishes a TCP connection to the server. Whatever the client sends to the proxy gets forwarded to the server and whatever the server sends to the proxy gets forwarded to the client.

On the application layer, The client sends a request that takes the server a long time to reply to. Since the client expects that the request will take a long time, it enables TCP keepalive to periodically inform the server that it is still alive and able to receive the response.

Since the proxy will be the recipient of the keepalive packets, the server will not see them. It might think that the client has gone away, stop processing the request and close the TCP connection. We can have the proxy guess at what timeout might be appropriate, but those values are very protocol-specific and we end up either having the proxy terminate still viable sessions or taking up resources on the proxy machine.

Having a proxy in the middle shields the server from the specifics of the clients TCP behavior, even in cases where it would want to know it. The usual case where people might want that information is the IP address and there are ways of having it be transmitted, but TCP is a large spec. There are a myriad of features in TCP like TCP Fast Open or alternative congestion control or the client might be using archaic features and the proxy will either negate their advantages, or outright break the connection.

This serves as an example of the end-to-end principle in action and we don't have to stray to far from TCP to see more examples of it. On the IP layer, datagrams can be split into multiple parts for when the underlying physical transport cannot support a packet of a given size. The idea was that IP datagrams would be split and then recombined by the routers in the middle of the network when the physical layer would support a packet of that size again. This turned out to be disastrous in practice. Hosts would often get partial datagrams that would never be able to recombine and they would also have no way to tell the host on the other end that a packet was lost (the packet acknowledgement is in the TCP layer). Because of this issue and many more, we have largely scrapped the idea of IP fragmentation and came up with better solutions.

What can be done about it?

If you're building an application that does use TCP, you need to be prepared for the possibility that your application will end up being proxied through a host that might not particularly care for whatever TCP tricks you're doing or what the state of the protocol is at any given moment. You can guard against these issues by constructing your protocol in a resilient manner. Note that while these safeguards will help with proxies, they're in general a good idea, since they will also guard against lower-level network issues on the IP layer.

Application-level pings

Since you can't rely on the proxy to pass through the behavior of the TCP connection, techniques like keepalive packets can no longer be used to ensure liveness of a connection. Instead, you'll have to implement a ping on the application level. Since a proxy must pass through the data if it is to be useful in any way, these pings will have to poke through the proxy.

End-of-file is not the end of things

A lot of TCP proxies will turn connection errors into a clean termination. If you're using the closing of the TCP connection as a way to signal no more data (looking at you, HTTP 1.0), you cannot determine whether you have read the entire response, or there might be more data if the operation was retried.

If you're in the position of having to implement a proxy, for load balancing or inspection reasons, there are a couple of things you can do to make it less invasive.

Implement the protocol

The only way a TCP proxy might know when it is safe to terminate a connection, is when it knows the protocol state. A great example of this is an HTTP load balancer. It can see if there is an outstanding request and keep the connection open. More importantly, if the proxy needs to go down for maintenance, it can terminate connections cleanly and let the client re-establish.

Be a NAT

If the mismatch between a proxy's TCP implementation and the server's TCP implementation is an issue, another solution is for the proxy to not even implement TCP at all. Instead of acting as a TCP proxy, act like a smart IP forwarder. Since the proxy is not constructing packets, only modifying and forwarding them, the mismatch in implementation goes away. Examples of software that does this are Linux Virtual Server or Google's Maglev.

TLS proxies

A special case of the TCP proxy is the TLS proxy. They take a TCP connection containing a TLS session and turn them into unencrypted TCP within a trusted network. These proxies are useful because they offload the responsibilities of implementing complex cryptography code away from the application servers. Additionally, they are useful for key management, since your backends no longer have to have the keys stored on the servers.

But since they're a TCP proxy, they have the same issues as any other TCP proxy. Additionally, they cannot function as a NAT, since they have to respond with their own data to negotiate the TLS handshake.

So, we're stuck in a dilemma. Either use a TLS proxy, gain the ease of cryptographic deployment and lose nuances in TCP handling, or push TLS termination into the servers, creating key management issues and increased responsibility for cryptography code.


I have not yet figured out what can be done to solve the specific TLS issue. Using HSMs or something like CloudFlare's Keyless SSL and terminating at the edges might help with key management, but it still pushes large burden onto your application servers. Since the edges have to terminate, you lose the ability to route based on the information inside the TLS session. SNI would usually allow you to route to different backends and have key management be in one location, but that is no longer possible. This poses a large problem for cloud providers who want to run their TLS terminators as multi-tenant machines.

I think the issue also serves to illustrate a layering violation in the design of TLS itself. While TLS uses the TCP protocol to simplify its key negotiation handshake, the reliable delivery of TCP is completely orthogonal to the goals of privacy and integrity that TLS provides. If we pull the cryptography into TCP, we might be able to simplify things further. If a given TCP segment was always guaranteed to contain enough data to decrypt and authenticate it, then a pass-through NAT-like TLS proxy becomes trivial, without having to do significant connection tracking or creating acknowledgement packets on the end servers behalf. QUIC is an example of this kind of construction, although they chose to encrypt the transport layer for different reasons related to middleboxes.

For now, I'm going to grit my teeth, implement the protocols fully in my TLS terminators and hope that the mismatch never gets too bad, but there is an interesting issue to be solved here. If you meet me in person, buy me a beer and I'll tell you all about my plans to put a TLS terminator inside a hypervisor.

Related articles