Friday, 30 June 2017

Patch On Linux Kernel Stable Branches Breaks rr

A change in 4.12rc5 breaks rr. We're trying to get it fixed before 4.12 is released, and I think that will be OK. Unfortunately that change has already been backported to 3.18.57, 4.4.72, 4.9.32 and 4.11.5 :-( (all released on June 14, and I guess arriving in distros a bit later). Obviously we'll try to get the 4.12 fix also backported to those branches, but that will take a little while.

The symptoms are that long, complex replays fail with "overshot target ticks=... by " where N is generally a pretty large number (> 1000). If you look in the trace file, the value N will usually be similar to the difference between the target ticks and the previous ticks value for that task --- i.e. we tried to stop after N ticks but we actually stopped after about N*2 ticks. Unfortunately, rr tests don't seem to be affected.

I'm not sure if there's a reasonable workaround we can use in rr, or if there is one, whether it's worth effort to deploy. That may depend on how the conversation with upstream goes.

Friday, 23 June 2017

Rising Tolerance For Static Analysis False Positives?

When I was a young graduate student working on static analysis tools, conventional wisdom was that a static analysis tool needed to have a low false-positive rate or no-one would use it. Minimizing false-positives was a large chunk of the effort in every static analysis project.

It seems that times have changed and there are now communities of developers willing to tolerate high false positive rates, at least in some domains. For example:

It will also miss really obvious bugs apparently at random, and flag non-bugs equally randomly. The price of the tool is astronomical, but it's still worthwhile if it catches bugs that human developers miss.
Indeed, I've noticed people in various projects ploughing through reams of false positives in case there are any real issues.

I'm not sure what changed here. Perhaps people have just become more appreciative of the value of subtle bugs caught by static analysis tools. Maybe it's a good time to be a developer of such tools.

Monday, 19 June 2017

Lazy Religion Tropes In Mass Media

I'm enjoying season two of Orphan Black. I hope Tatiana Maslany got paid a separate salary for each role, because she deserves every penny. Unfortunately the show falls down where so many others do: every religiously-identified person is either a chump or a psychopath. Disappointing, but expected; with few exceptions, the Christians portrayed in mass media are like none I've ever known.

I've just finished China Miéville's The Last Days Of New Paris. It's a bit over-Miévillian for me: one hundred and fifty pages of rampant imagination, but too many clever linguistic obscurities for maximal reading enjoyment. It hosts a very crisp example of the "hell exists but heaven does not" trope. Miéville frames it explicitly:

Everyone knew it helped it have a priest perform certain absurdities of his trade if it was demons you had to fight. "Why?" Thibaut asked Élise when they left again. "Why do you think it does work? It's not as if any of this stuff is true."
(Miéville's Parisian demons are explicitly of Hell.) Isn't this ... unfair? Yet it's a very common trope. Likewise I remain a huge fan of Buffy The Vampire Slayer, but it feels like a raw deal that the heroes wield crucifixes but the few ostensibly "religious" characters are villains.

I know, I know; the only way to fix this is to write my own novels and hit TV show. I'll get onto it.

Thursday, 15 June 2017

Is The x86 Architecture Sustainable?

My earlier post pointed to one strange little corner of the x86 instruction set. There is much, much more where that came from. My latest copy of Intel's "Combined Volumes" Software Developer’s Manual runs to 4,684 pages and it doesn't include new features they've announced such as CET. There's a good paper here describing some of the new features and the increasing complexity they bring. It also touches on the "feature interaction problem"; i.e., the complexity of the ISA grows superlinearly as Intel adds more features. One aspect it doesn't touch is how many legacy compatibility features x86 processors carry, which continue to interact with new features. One of the key risks for Intel and its ecosystem is that many of the new features may end up being little-used — just like many of the legacy compatibility features — so their complexity ends up being "dead weight".

Is there a problem here for anyone other than Intel? Yes. Unnecessarily complex hardware costs us all in the end because those CPUs won't be as efficient, cheap, reliable or secure as they otherwise could be — especially as we seem to be hitting a process wall so that we can't rely on increasing transistor density to bail us out. Another issue is that complex ISAs make ISA-dependent tools — assemblers, disassemblers, debuggers, binary instrumentation tools, and even hypervisors and kernels — more difficult to build. Super-complex ISAs encourage software to depend on obscure features unnecessarily, which builds a competitive moat for Intel but makes competition at the CPU level more difficult.

The obvious technically-appealing approach — starting over with a clean-sheet architecture — by itself doesn't solve these problems in the long run (or the short run!). We have to acknowledge that architectures will inevitably grow features in response to market demand (or perceived value) and that some of those features will turn into legacy baggage. The question is, can we adjust the ecosystem to make it possible to drop legacy baggage, and is it worth doing so?

I'm not an ARM expert but the phone and embedded markets already seem to have moved in this direction. Phones use a plethora of ARM derivatives with different architectural features (though in spite of that, ARM has accumulated its own share of weirdnesses!) Can we do it in the cloud/desktop/laptop spaces? I don't see why not. Even software that reaches down to a relatively low level like a Web browser (with a complex tiered JIT, etc) doesn't rely on many complex legacy architectural features. Porting Firefox to a brand new processor architecture is a lot of work — e.g., implementing new JIT backends, FFI glue, and translating a pile of handwritten assembler (e.g. in codecs) to the new architecture. However, porting Firefox to an x86 or ARM variant minus unnecessary legacy baggage would be a lot easier. Your variant may need kernel and bootloader changes but that's certainly doable for Linux derivatives.

So, I think there's room in the market for architectures which may or may not be based on existing architectures but lack legacy features and offer less-than-total binary compatibility with a popular architecture. Perhaps the forthcoming ARM Snapdragon Windows laptops are a step in that direction.

Update Many commenters on Hackernews suggested that x86 complexity is a non-issue because decoding legacy instructions to uops requires insignificant die area. But this really misses the point: x86 complexity is not just about legacy instructions which can be decoded down to some microcode. It's about architectural features like SGX, CET, MPX, TSX, VT — plus the legacy stuff like segment registers and 286 call gates and virtual 8086 mode and so on and so on. It's about the processor state required to support those features, how those features all interact with each other, and how they increase the complexity of context switching, OS support, and so on.

New "rr pack" Command

I think there's huge potential to use rr for debugging cloud services. Apparently right now interactive debugging is mostly not used in the cloud, which makes sense — it's hard to identify the right process to debug, much less connect to it, and even if you could, stopping it for interactive analysis would likely interfere too much with your distributed system. However, with rr you could record any number of process executions without breaking your system, identify the failed runs after the fact, and debug them at your leisure.

Unfortunately there are a couple of problems making that difficult right now. One is that the largest cloud providers don't support the hardware performance counter rr needs. I'm excited to hear that Amazon has recently enabled some HW performance counters on dedicated hosts — hopefully they can be persuaded to add the retired-conditional-branch counter to their whitelist (and someone can fix the Xen PMU virtualization bug that breaks rr). Another problem is that rr's traces aren't easy to move from one machine to another. I've started addressing this problem by implementing a new rr command, rr pack.

There are two problems with rr traces. One is that on filesystems that do not support "reflink" file copies, to keep recording overhead low we sometimes hardlink files into the trace, or for system libraries we just assume they won't change even if we can't hardlink them. This means traces are not fully self-contained in the latter case, and in the former case the recording can be invalidated if the files change. The other problem is that every time an mmap occurs we clone/link a new file into the trace, even if a previous mmap mapped the same file, because we have no fast way of telling if the file has changed or not. This means traces appear to contain large numbers of large files but many of those files are duplicates.

rr pack fixes both of those problems. You run it on a trace directory in-place. It deduplicates trace files by computing a cryptographic hash (BLAKE2b, 256 bits) of each file and keeping only one file for any given hash. It identifies needed files outside the trace directory, and hardlinks to files outside the trace directory, and copies them into the trace directory. It rewrites trace records (the mmaps file) to refer to the new files, so the trace format hasn't changed. You should be able to copy around the resulting trace, and modify any files outside the trace, without breaking it. I tried pretty hard to ensure that interrupted rr pack commands leave the trace intact (using fsync and atomic rename); of course, an interrupted rr pack may not fully pack the trace so the operation should be repeated. Successful rr pack commands are idempotent.

We haven't really experimented with trace portability yet so I can't say how easy it will be to just zip up a trace directory and replay it on a different computer. We know that currently replaying on a machine with different CPUID values is likely to fail, but we have a solution in the works for that — Kyle's patches to add ARCH_SET_CPUID to control "CPUID faulting" are in Linux kernel 4.12 and will let rr record and replay CPUID values.

How I Found A 20-Year-Old Linux Kernel Bug

Recently I improved the rr tests to test that the kernel doesn't write more memory than we expect during system calls. We place syscall output buffers at the ends of pages so that the end of the buffer is immediately followed by an unmapped page. If the kernel reads or writes too much memory then the system call will fail with EFAULT (or something worse will happen).

This found a few bugs in rr's assumptions about how much memory the kernel reads and writes. Interestingly, it also exposed a very old kernel bug: certain wireless ioctls are supposed to take a 32-byte memory parameter but the kernel actually fails with EFAULT if less than 40 bytes are available. (The extra bytes are not modified.) I tried to be a good citizen by reporting the bug and I'm pleased to say that it's actually being fixed!

The bug was apparently introduced in Linux 2.1.15, released December 12, 1996. It's interesting that it wasn't found and fixed until now. I guess not many programs use these ioctls, and those that do probably use buffers that are always followed by at least eight more bytes of data, e.g. any buffer on the stack. Then again, if you wrote a program that allocated those buffers using "malloc", I guess once in a while it would fail if your allocator happens to land one at the end of a page.

This class of bugs --- "small overrunning read that doesn't get used" --- could also be a problem in user-space. I don't recall seeing other bugs in this class, though.

Friday, 9 June 2017

Another Case Of Obscure CPU Nondeterminism

One of "big bets" of rr is that commodity CPUs running user-space code really are deterministic in practice, or at least the nondeterminism can be efficiently detected and controlled, without restorting to pervasive binary instrumentation. We're mostly winning that bet on x86 (but not ARM), but I'm uncomfortable knowing that otherwise-benign architectural changes could break us at any time. (One reason I'm keen to increase rr's user base is make system designers care about not breaking us.)

I recently discovered the obscure XGETBV instruction. This is a problematic instruction for rr because it returns the contents of internal system registers that we can't guarantee will be unchanged from run to run. (It's a bit weird that the contents of these registers are allowed to leak to userspace, but anyway...) Fortunately it's barely used in practice; the only use I've seen of it so far is by the dynamic loader ld.so, to return the contents of the XINUSE register. This obscure register tracks, for each task, whether certain CPU components are known to be in their default states. For example if your x86-64 process has never used any (legacy floating-point) x87 instructions, bit 0 of XINUSE should be clear. That should be OK for rr, since whether certain kinds of instructions have executed should be the same between recording and replay. Unfortunately I have observed cases where the has-used-x87 bit gets unexpectedly flipped on; by singlestepping the tracee it's pretty clear the bit is being set in code that has nothing to do with x87, and the point where it is set, or if it happens at all, varies from run to run. (I have no way to tell whether the bit is being set by hardware or the kernel.) This is disturbing because it means that XGETBV is effectively a nondeterministic instruction. Fortunately, the ld.so users are just testing to see whether AVX has been used, so they mask off the x87 bit almost immediately, eliminating the nondeterminism from the program state before it can do any damage (well, unless you were super unlucky and took a context switch between XGETBV and the mask operation). I haven't seen any bits other than the x87 bit being unexpectedly set.

Fortunately it seems we can work around the problem completely in rr by setting the x87-in-use bit immediately after every exec. If the bit is always set, it doesn't matter if something intermittently sets it again. Another day, another bullet dodged!

Thursday, 1 June 2017

WebAssembly: Mozilla Won

Mozilla staff are being very diplomatic and restrained by allowing WebAssembly to be portrayed as a compromise between the approaches of asm.js and PNaCl. They have good reasons for being so, but I can be a bit less restrained. asm.js and PNaCl represented quite different visions for how C/C++ code should be supported on the Web, and I think WebAssembly is a big victory for asm.js and Mozilla's vision.

Considered as a Web platform feature, PNaCl had three major problems from the beginning, two of which it inherited from NaCl. By design, in NaCl and PNaCl, application code was highly isolated from the world of Javascript and HTML, running in its own process and behaving much like an opaque plugin with very limited interactions with its host page (since any interactions would have to happen over IPC). This led to a much bigger problem, which is that to interact with the outside world (P)NaCl code needed some kind of platform API, and Google decided to create an entirely brand-new platform API — "Pepper" — which necessarily duplicated a lot of the functionality of standard Web platform APIs. To make things even worse, neither Pepper nor PNaCl's LLVM-based bytecode had a proper specification, let alone one with multi-vendor buy-in.

Therefore any non-Chrome-based browser wishing to implement PNaCl would have had to reverse-engineer a Chromium-bug-compatible Pepper spec and reimplement it, or more likely import a large amount of Chromium code to implement Pepper/NaCl. Likewise they'd have had to import a large amount of LLVM code for the PNaCl compiler. Both imports would have to stay in sync with whatever Google did. This would mean lots more code bloat, maintenance and spec work, and more work for Web developers too, not to mention being a severe blow to Web standards. Mozilla people (including me) explained the unacceptability of all this to relevant Google people early on, but to no avail.

Mozilla responded with the asm.js project to show that one could achieve similar goals with minimal extensions to the standard-based Web platform. asm.js sets up a Javascript array to represent memory and compiles C/C++ to Javascript code operating on the array. It avoids those big PNaCl issues: asm.js code can interact with JS and the DOM easily since it shares the JS virtual machine; it specifies no new platform APIs, instead relying on the (already standardized and interoperably implemented) Web platform APIs; and very little new specification work had to be done, since it mostly relies on already-specified JS semantics.

In these key areas WebAssembly follows asm.js, not PNaCl. WebAssembly applications or components interact freely with JS and AFAIK in all browsers WebAssembly is implemented as part of the JS VM. WebAssembly defines no new platform APIs other than some APIs for loading and linking WebAssembly code, relying on standards-based Web APIs for everything else. WebAssembly differs from asm.js by defining a bytecode format with some new operations JS doesn't have, so some spec work was required (and has been done!). Like asm.js, WebAssembly application call-stacks are maintained by the JS VM, outside the memory addressable by the application, which reduces the exploitability of application bugs. (Though, again like asm.js and unlike PNaCl, the compiler is trusted.)

I'm not belittling the contributions of Google and others to WebAssembly. They have done a lot of good work, and ditching PNaCl was an admirable decision for the good of the Web — thank you! Often in these contests proclaiming a "winner" is unimportant or even counterproductive, and it's true that in a sense Web developers are the real winners. I'm calling it out here because I think the good and essential work that Mozilla continues to do to improve the standards-based Web platform is too often overlooked. I also want credit to go to Mozilla's amazing asm.js/WebAssembly team, who went up against Google with far fewer resources but a better approach, and won.