Friday, 27 March 2020

What If C++ Abandoned Backward Compatibility?

Some C++ luminaries have submitted an intriguing paper to the C++ standards committee. The paper presents an ambitious vision to evolve C++ in the direction of safety and simplicity. To achieve this, the authors believe it is worthwhile to give up backwards source and binary compatibility, and focus on reducing the cost of migration (e.g. by investing in tool support), while accepting that the cost of each migration will be nonzero. They're also willing to give up the standard linking model and require whole-toolchain upgrades for each new version of C++.

I think this paper reveals a split in the C++ community. I think the proposal makes very good sense for organizations like Google with large legacy C++ codebases that they intend to continue investing in heavily for a long period of time. (I would include Mozilla in that set of organizations.) The long-term gains from improving C++ incompatibly will eventually outweigh the ongoing migration costs, especially because they're already adept at large-scale systematic changes to their code (e.g. thanks to gargantuan monorepo, massive-scale static and dynamic checking, and risk-mitigating deployment systems). Lots of existing C++ software really needs those safety improvements.

I think it also makes sense for C++ developers whose projects are relatively short-lived, e.g. some games. They don't need to worry about migration costs and will reap the benefits of C++ improvement.

For mature, long-lived projects that are poorly resourced, such as rr, it doesn't feel like a good fit. I don't foresee adding a lot of new code to rr, so we won't reap much benefit from improvements in C++. On the other hand it would hurt to pay an ongoing migration tax. (Of course rr already needs ongoing maintenance due to kernel changes etc, but every extra bit hurts.)

I wonder what compatibility properties toolchains would have if this proposal carries the day. I suspect the intent is the latest version of a compiler implements only the latest version of C++, but it's not clear. An aggressive policy like that would increase the pain for projects like rr (and, I guess, every other C++ project packaged by Linux distros) because we'd be dragged along more relentlessly.

It'll be interesting to see how this goes. I would not be completely surprised if it ends with a fork in the language.

Wednesday, 11 March 2020

Debugging Gdb Using rr: Ptrace Emulation

Someone tried using rr to debug gdb and reported an rr issue because it didn't work. With some effort I was able to fix a couple of bugs and get it working for simple cases. Using improved debuggers to improve debuggers feels good!

The main problem when running gdb under rr is the need to emulate ptrace. We had the same problem when we wanted to debug rr replay under rr. In Linux a process can only have a single ptracer. rr needs to ptrace all the processes it's recording — in this case gdb and the process(es) it's debugging. Gdb needs to ptrace the process(es) it's debugging, but they can't be ptraced by both gdb and rr. rr circumvents the problem by emulating ptrace: gdb doesn't really ptrace its debuggees, as far as the kernel is concerned, but instead rr emulates gdb's ptrace calls. (I think in principle any ptrace user, e.g. gdb or strace, could support nested ptracing in this way, although it's a lot of work so I'm not surprised they don't.)

Most of the ptrace machinery that gdb needs already worked in rr, and we have quite a few ptrace tests to prove it. All I had to do to get gdb working for simple cases was to fix a couple of corner-case bugs. rr has to synthesize SIGCHLD signals sent to the emulated ptracer; these signals weren't interacting properly with sigsuspend. For some reason gdb spawns a ptraced process, then kills it with SIGKILL and waits for it to exit; that wait has to be emulated by rr because in Linux regular "wait" syscalls can only wait for a non-child process if the waiter is ptracing the target process, and under rr gdb is not really the ptracer, so the native wait doesn't work. We already had logic for that, but it wasn't working for process exits triggered by signals, so I had to rework that, which was actually pretty hard (see the rr issue for horrible details).

After I got gdb working I discovered it loads symbols very slowly under rr. Every time gdb demangles a symbol it installs (and later removes) a SIGSEGV handler to catch crashes in the demangler. This is very sad and does not inspire trust in the quality of the demangling code, especially if some of those crashes involve potentially corrupting memory writes. It is slow under rr because rr's default syscall handling path makes cheap syscalls like rt_sigaction a lot more expensive. We have the "syscall buffering" fast path for the most frequent syscalls, but supporting rt_sigaction along that path would be rather complicated, and I don't think it's worth doing at the moment, given you can work around the problem using maint set catch-demangler-crashes off. I suspect that (with KPTI especially) doing 2-3 syscalls per symbol demangle (sigprocmask is also called) hurts gdb performance even without rr, so ideally someone would fix that. Either fix the demangling code (possibly writing it in a safe language like Rust), or batch symbol demangling to avoid installing and removing a signal handler thousands of times, or move it to a child process and talk to it asynchronously over IPC — safer too!