Friday, 24 August 2018

Long Live The Desktop Computer

Eight years ago I bought a Dell Studio XPS 8100 desktop for a home computer at a moderate price (NZD 3,100). I've just replaced a failing 1TB hard drive with a 500GB SSD, but other than that I've done no upgrades. What's interesting to me is that it's still a perfectly good machine: quad-core i7, 12GB RAM, NVIDIA GPU with 2GB VRAM. Everything I do, this machine could still do well, including software development for work. I guess if I wanted to play the latest AAA game titles or use a 4K monitor on it, I'd be unhappy, but I can't think of anything else I'd even consider doing that would be a problem, and those could be addressed by upgrading the video card. If this machine doesn't fail catastrophically I can see us continuing to use it for many more years. (I run Linux on it; the situation might be different if it was Windows.)

This is interesting because up until 2010 I'd been in the habit of upgrading computers at least every five years because they would improve dramatically over that time in ways that mattered to me. That stopped happening. It hasn't entirely stopped for everyone — Mozilla developers are getting new desktops with double-digit numbers of cores to speed up Firefox builds — but I run my heavy-duty workloads in the cloud now, because really big machines aren't efficiently utilized by a single developer. I guess the economics of utilization and colocation will making cloud-based heavy lifting (not necessarily public clouds) increasingly prevalent over time.

One of the implications is that declining desktop sales don't necessarily mean declining desktop usage. I think they must at least partly reflect longer upgrade cycles.

Another implication is that component reliability for desktops is becoming more important. It doesn't really matter if parts wear out after five years, if you're going to replace the whole machine before then anyway. If the expected lifespan of a machine is fifteen years, it's worth buying more reliable parts.

Another implication is longevity bottlenecks might shift to relatively minor features like what types of USB ports your machine has. I guess some of this can be alleviated by upgrades and dongles but it's worth thinking about.

Friday, 17 August 2018

ASAN And LSAN Work In rr

AddressSanitizer has worked in rr for a while. I just found that LeakSanitizer wasn't working and landed a fix for that. This means you can record an ASAN build and if there's an ASAN error, or LSAN finds a leak, you can replay it in rr knowing the exact addresses of the data that leaked — along with the usual rr goodness of reverse execution, watchpoints, etc. Well, hopefully. Report an issue if you find more problems.

Interestingly, LSAN doesn't work under gdb, but it does work under rr! LSAN uses the ptrace() API to examine threads when it looks for leaks, and it can't ptrace a thread that gdb is already ptracing (the ptrace design deeply relies on there being only one ptracer per thread). rr uses ptrace too, but when one rr tracee thread tries to ptrace another rr tracee thread, rr emulates the ptrace calls so that they work as if rr wasn't present.

Tuesday, 14 August 2018

Diagnosing A Weak Memory Ordering Bug

For the first time in my life I tracked a real bug's root cause to incorrect usage of weak memory orderings. Until now weak memory bugs were something I knew about but had subconciously felt were only relevant to wizards coding on big iron, partly because until recently I've spent most of my career using desktop x86 machines.

Under heavy load a Pernosco service would assert in Rust's std::thread::Thread::unpark() with the error "inconsistent state in unpark". Inspecting the code led to the disturbing conclusion that the only way to trigger this assertion was memory corruption; the value of self.inner.state should always be between 0 and 2 inclusive, and if so then we shouldn't be able to reach the panic. The problem was nondeterministic but I was able to extract a test workload that reproduced the bug every few minutes. I tried recording it in rr chaos mode but was unable to reproduce it there (which is not surprising in hindsight since rr imposes sequential consistency).

With a custom panic handler I was able to suspend the process in the panic handler and attach gdb to inspect the state. Everything looked fine; in particular the value of self.inner.state was PARKED so we should not have reached the panic. I disassembled unpark() and decided I'd like to see the values of registers in unpark() to try to determine why we took the panic path, in particular the value of self.inner (a pointer) loaded into RCX and the value of self.inner.state loaded into RAX. Calling into the panic handler wiped those registers, so I manually edited the binary to replace the first instruction of the panic handler with UD2 to trigger an immediate core-dump before registers were modified.

The core-dump showed that RCX pointed to some random memory and was not equal to self.inner, even though we had clearly just loaded it from there! The value of state in RAX was loaded correctly via RCX, but was garbage because we were loading from the wrong address. At this point I formed the theory the issue was a low-level data race, possibly involving relaxed memory orderings — particularly because the call to unpark() came from the Crossbeam implementation of Michael-Scott lock-free queues. I inspected the code and didn't see an obvious memory ordering bug, but I also looked at the commit log for Crossbeam and found that a couple of memory ordering bugs had been fixed a long time ago; we were stuck on version 0.2 while the released version is 0.4. Upgrading Crossbeam indeed fixed our bug.

Observation #1: stick to sequential consistency unless you really need the performance edge of weaker orderings.

Observation #2: stick to sequential consistency unless you are really, really smart and have really really smart people checking your work.

Observation #3: it would be really great to have user-friendly tools to verify the correctness of unsafe, weak-memory-dependent code like Crossbeam's.

Observation #4: we need a better way of detecting when dependent crates have known subtle correctness bugs like this (security bugs too). It would be cool if the crates.io registry knew about deprecated crate versions and cargo build warned about them.

Monday, 13 August 2018

The Parallel Stream Multiplexing Problem

Imagine we have a client and a server. The client wants to create logical connections to the server (think of them as "queries"); the client sends a small amount of data when it opens a connection, then the server sends a sequence of response messages and closes the connection. The responses must be delivered in-order, but the order of responses in different connections is irrelevant. It's important to minimize the start-to-finish latency of connections, and the latency between the server generating a response and the client receiving it. There could be hundreds of connections opened per second and some connections produce thousands of response messages. The server uses many threads; a connection's responses are generated by a specific server thread. The client may be single-threaded or use many threads; in the latter case a connection's responses are received by a specific client thread. What's a good way to implement this when both client and server are running in the same OS instance? What if they're communicating over a network?

This problem seems quite common: the network case closely resembles a Web browser fetching resources from a single server via HTTP. The system I'm currently working on contains an instance of this internally, and communication between the Web front end and the server also looks like this. Yet even though the problem is common, as far as I know it's not obvious or well-known what the best solutions are.

A standard way to handle this would be to multiplex the logical connections into a single transport. In the local case, we could use a pair of OS pipes as the transport, a client-to-server pipe to send requests and a server-to-client pipe to return responses. The client allocates connection IDs and the server attaches connection IDs to response messages. Short connections can be very efficient: a write syscall to open a connection, a write syscall to send a response, maybe another write syscall to send a close message, and corresponding read syscalls. One possible problem is server write contention: multiple threads sending responses must make sure the messages are written atomically. In Linux this happens "for free" if your messages are all smaller than PIPE_BUF (4096), but if they aren't you have to do something more complicated, the simplest being to hold a lock while writing to the pipe, which could become a bottleneck for very parallel servers. There is a similar problem with client read contention, which is mixed up with the question of how you dispatch received responses to the thread reading from a connection.

A better local approach might be for the client to use an AF_UNIX socket to send requests to the server, and with each request message pass a file descriptor for a fresh pipe that the server should use to respond to the client. It requires a few more syscalls but client threads require no user-space synchronization, and server threads require no synchronization after the dispatch of a request to a server thread. A pool of pipes in the client might help.

The network case is harder. A naive approach is to multiplex the logical connections over a TCP stream. This suffers from head-of-line-blocking: a lost packet can cause delivery of all messages to be blocked while the packet is retransmitted, because all messages across all connections must be received in the order they were sent. You can use UDP to avoid that problem, but you need encryption, retransmits, congestion control, etc so you probably want to use QUIC or something similar.

The Web client case is interesting. You can multiplex over a WebSocket much like a TCP stream, with the same disadvantages. You could issue an HTTP request for each logical connection, but this would limit the number of open connections to some unknown maximum, and could have even worse performance than the Websocket if the browser and server don't negotiate QUIC + HTTP2. A good solution might be to multiplex the connections into a RTCDataChannel in non-ordered mode. This is probably quite simple to implement in the client, but fairly complex to implement in the server because the RTCDataChannel protocol is complicated (for good reasons AFAIK).

This multiplexing problem seems quite common, and its solutions interesting. Maybe there are known best practices or libraries for this, but I haven't found them yet.