Eyes Above The Waves

Robert O'Callahan. Christian. Repatriate Kiwi. Hacker.

Monday 13 August 2018

The Parallel Stream Multiplexing Problem

Imagine we have a client and a server. The client wants to create logical connections to the server (think of them as "queries"); the client sends a small amount of data when it opens a connection, then the server sends a sequence of response messages and closes the connection. The responses must be delivered in-order, but the order of responses in different connections is irrelevant. It's important to minimize the start-to-finish latency of connections, and the latency between the server generating a response and the client receiving it. There could be hundreds of connections opened per second and some connections produce thousands of response messages. The server uses many threads; a connection's responses are generated by a specific server thread. The client may be single-threaded or use many threads; in the latter case a connection's responses are received by a specific client thread. What's a good way to implement this when both client and server are running in the same OS instance? What if they're communicating over a network?

This problem seems quite common: the network case closely resembles a Web browser fetching resources from a single server via HTTP. The system I'm currently working on contains an instance of this internally, and communication between the Web front end and the server also looks like this. Yet even though the problem is common, as far as I know it's not obvious or well-known what the best solutions are.

A standard way to handle this would be to multiplex the logical connections into a single transport. In the local case, we could use a pair of OS pipes as the transport, a client-to-server pipe to send requests and a server-to-client pipe to return responses. The client allocates connection IDs and the server attaches connection IDs to response messages. Short connections can be very efficient: a write syscall to open a connection, a write syscall to send a response, maybe another write syscall to send a close message, and corresponding read syscalls. One possible problem is server write contention: multiple threads sending responses must make sure the messages are written atomically. In Linux this happens "for free" if your messages are all smaller than PIPE_BUF (4096), but if they aren't you have to do something more complicated, the simplest being to hold a lock while writing to the pipe, which could become a bottleneck for very parallel servers. There is a similar problem with client read contention, which is mixed up with the question of how you dispatch received responses to the thread reading from a connection.

A better local approach might be for the client to use an AF_UNIX socket to send requests to the server, and with each request message pass a file descriptor for a fresh pipe that the server should use to respond to the client. It requires a few more syscalls but client threads require no user-space synchronization, and server threads require no synchronization after the dispatch of a request to a server thread. A pool of pipes in the client might help.

The network case is harder. A naive approach is to multiplex the logical connections over a TCP stream. This suffers from head-of-line-blocking: a lost packet can cause delivery of all messages to be blocked while the packet is retransmitted, because all messages across all connections must be received in the order they were sent. You can use UDP to avoid that problem, but you need encryption, retransmits, congestion control, etc so you probably want to use QUIC or something similar.

The Web client case is interesting. You can multiplex over a WebSocket much like a TCP stream, with the same disadvantages. You could issue an HTTP request for each logical connection, but this would limit the number of open connections to some unknown maximum, and could have even worse performance than the Websocket if the browser and server don't negotiate QUIC + HTTP2. A good solution might be to multiplex the connections into a RTCDataChannel in non-ordered mode. This is probably quite simple to implement in the client, but fairly complex to implement in the server because the RTCDataChannel protocol is complicated (for good reasons AFAIK).

This multiplexing problem seems quite common, and its solutions interesting. Maybe there are known best practices or libraries for this, but I haven't found them yet.