Wednesday, 29 June 2016

Nexus 5X vs Wettest June Hour In Auckland's History

I had lunch with a friend in Newmarket today. I walked there from home, but it was raining so I half-ran there holding my phone and keys under my jacket. Unfortunately when I got there I was only holding my keys :-(. I ran around a bit looking for the phone, couldn't find it, and decided lunch with my friend was more important.

So later I walked home, keeping an eye out for my phone in vain --- during which it was really pouring; activated the Android Device Manager locator; drove to where it said the phone was; looked around (still raining) but couldn't find it and couldn't activate the ring (due to not having a phone!); drove home and activated the "call me when you find this phone" screen. Not long after that a miracle happens: a kid calls me on the phone. Not only has he found it and is willing to wait on the street for fifteen minutes while I drive back to pick it up, but the phone still works despite having been out in the rain for the wettest June hour in Auckland's history. Seriously, on my way home there were torrents of water in the gutters and whole streets were flooded. Congratulations to LG and Google, thanks to God the phone wasn't simply washed away, and thanks to the Grammar boys who found it and waited for me. (The latter I was able to thank directly by giving them my copy of China Mieville's wonderful The City And The City.)

Relearning Debugging With rr

As I've mentioned before, once you have a practical reverse-execution debugger like rr, you need to learn new debugging strategies to exploit its power, and that takes time. (Almost all of your old debugging strategies still work --- they're just wasting your time!) A good example presented itself this morning. A new rr user wanted to stop at a location in JIT-generated code, and modified the JIT compiler to emit an int3 breakpoint instruction at the desired location --- because that's what you'd do with a regular debugger. But with rr there's no need: you can just run past the generation of the code, determine the address of your generated instruction after the fact (by inserting a logging statement at the point where you would have triggered generation of int3, if you must), set a hardware execution breakpoint at that address, and reverse-execute until that location is reached.

One of the best reasons I've heard for not using rr was given by Jeff: "I don't want to forget how to debug on other platforms".

Tuesday, 28 June 2016

Handling Read-Only Shared Memory Usage In rr

One of rr's main limitations is that it can't handle memory being shared between recorded processes and not-recorded processes, because writes by a not-recorded process can't be recorded and replayed at the right moment. This mostly hasn't been a problem in practice. On Linux desktops, the most common cases where this occurs (X, pulseaudio) can be disabled via configuration (and rr does this automatically). However, there is another common case --- dconf --- that isn't easily disabled via configuration. When applications read dconf settings for the first time, the dconf daemon hands them a shared-memory buffer containing a one-byte flag. When those settings change, the flag is set. Whenever an application reads cached dconf settings, it checks this flag to see if it should refetch settings from the daemon. This is very efficient but it causes rr replay to diverge if dconf settings change during recording, because we don't replay that flag change.

Fortunately I've been able to extend rr to handle this. When an application maps the dconf memory, we map that memory into the rr supervisor process as well, and then replace the application mapping with a "shadow mapping" that's only shared between rr and the application. Then rr periodically checks to see whether the dconf memory has changed; if it is, then we copy the changes to the shadow mapping and record that we did so. Essentially we inject rr into the communication from dconf to the application, and forward memory updates in a controlled manner. This seems to work well.

That "periodic check" is performed every time the recorded process completes a traced system call. That means we'll forward memory updates immediately when any blocking system call completes, which is generally what you'd want. If an application busy-waits on an update, we'll never forward it and the application will deadlock, but that could easily be fixed by also checking for updates on a timeout. I'd really like to have a kernel API that lets a process be notified when some other process has modified a chunk of shared memory, but that doesn't seem to exist yet!

This technique would probably work for other cases where shared memory is used as a one-way channel from not-recorded to recorded processes. Using shared memory as a one-way channel from recorded to not-recorded processes already works, trivially. So this leaves shared memory that's read and written from both sides of the recording boundary as the remaining hard (unsupported) case.

Monday, 27 June 2016

Dear Ubuntu, Please Fix Your Debuginfo Packaging

Fedora has a delightfully simple approach to supplying debuginfo for its packages: you run sudo dnf debuginfo-install <original-package-name> and everything you need to debug that code is installed (including sources). It also installs debuginfo for dependent packages.

Ubuntu does not. With Ubuntu, you first have to figure out the name of the debuginfo package to install. This is not easy. Once you've figured that out. you install the package and find that it's mostly useless because it doesn't contain sources. So you try to install sources using sudo apt source ..., but that fails because "You must put some 'source' URIs in your sources.list". Then you look at /etc/apt/sources.list and discover that all the necessary deb-src URLs are there, but commented out by default. (Why? Because otherwise it would make things too easy?) Then you uncomment those URLs and try again, but get the same error. Then you wail and gnash your teeth a bit and figure out you need to run sudo apt-get update before retrying. Finally you've got the source package installed, then you run rr replay (or gdb if you must) and discover that it's not finding those sources. Then, unless you're more stubborn than I am (or just have more time to burn), you give up in despair.

Seriously Ubuntu (or Debian), just copy Red Hat here. Please.

Friday, 24 June 2016

Democracy Is Impressive

Two very surprising things happened in the last few months: US Republicans nominated Donald Trump as their candidate for President, and Britons voted to leave the EU. These events surprised me because they were fiercely opposed by the wealthy, politically powerful elites that are normally portrayed as running the world. I think it's amazing and inspiring that popular uprisings were able to overcome those forces. That fact should be celebrated, and recalled to refute defeatist rhetoric.

Unfortunately I can't celebrate the actual decisions themselves. I think that in the former case the people made a disastrous mistake, and in the latter case that remains to be seen. But the possibility that people will misuse their power is the price of democracy and all human freedom. I think it's a price worth paying.

Thursday, 23 June 2016

PlayCanvas Is Impressive

I've been experimenting on my children with different ways to introduce them to programming. We've tried Stencyl, Scratch, JS/HTML, Python, and CodeAcademy with varying degrees of success. It's difficult because, unlike when I learned to program 30 years ago, it's hard to quickly get results that compare favourably with a vast universe of apps and content they've already been exposed to. Frameworks and engines face a tradeoff between power, flexibility and ease-of-use; if it's too simple then it's impossible to do what you want to do and you may not learn "real programming", but if it's too complex then it may just be too hard to do what you want to do or you won't get results quickly.

Recently I discovered PlayCanvas and so far it looks like the best approach I've seen. It's a Web-based 3D engine containing the ammo.js (Bullet) physics engine, a WebGL renderer, a WYSIWYG editor, and a lot more. It does a lot of things right:

  • Building in a physics engine, renderer and visual editor gives a very high level of abstraction that lets people get impressive results quickly while still being easy to understand (unlike, say, providing an API with 5000 interfaces, one of which does what you want). Stencyl does this, but the other environments I mentioned don't. But Stencyl is only 2D; supporting 3D adds significant power without, apparently, increasing the user burden all that much.
  • Being Web-based is great. There's no installation step, it works on all platforms (I guess), and the docs, tutorials, assets, forkable projects, editor and deployed content are all together on the Web. I suspect having the development platform be the same as the deployment platform helps. (The Stencyl editor is Java but its deployed games are not, so WYS is not always WYG.)
  • Performance is good. The development environment works well on a mid-range Chromebook. Deployed games work on a new-ish Android phone.
  • So far the implementation seems robust. This is really important; system quirks and bugs make learning a lot harder, because novices can't distinguish their own bugs from system bugs.
  • The edit-compile-run cycle is reasonably quick, at least for small projects. Slow edit-compile-run cycles are especially bad for novices who'll be making a lot of mistakes.
  • PlayCanvas is programmable via a JS component model. You write JS components that get imported into the editor and are then attached to scene-graph entities. Components can have typed parameters that appear in the editor, so it's pretty easy to create components reusable by non-programmers. However, for many behaviors (e.g. autonomously-moving objects) you probably need to write code --- which is a good thing. It's a bit harder than Scratch/Stencyl but since you're using JS you have more power and develop more reusable skills, and cargo-culting and tweaking scripts works well. You actually have access to the DOM if you want although mostly you'd stick to the PlayCanvas APIs. It looks like you could ultimately do almost anything you want, e.g. add multiplayer support and voice chat via WebRTC.
  • PlayCanvas has WebVR support though I haven't tried it.
  • It's developed on github and MIT licensed so if something's broken or missing, someone can step in and fix it.

So far I'm very impressed and my child is getting into it.

Friday, 17 June 2016

Managing Vast, Sparse Memory On Linux

For a problem I'm working on, an efficient solution is to use a 32GB array, most of whose entries are unused. Fortunately this is no problem at all on 64-bit Linux using a couple of simple tricks.

The first trick is to allocate the array using the little-known (to me) MAP_NORESERVE option:

p = mmap(nullptr, 32 * 1024 * 1024 * 1024, PROT_READ | PROT_WRITE,
              MAP_PRIVATE | MAP_ANONYMOUS | MAP_NORESERVE, -1, 0));
(I thought this was the default on Linux, so I'm not sure why it's needed, but it is with 4.5.5-201.fc23.x86_64 at least. Oddly enough, if I omit that flag and just do two 16GB mmaps, that works too.) Now you can write to that region and only the pages you write to will actually be allocated. Good times.

Now, once in a while I want to clear that memory or just some subsections of it. memset(array, 0, size) is out of the question since it would take a very long time and would also probably cause all those pages to be allocated, killing my process if I'm lucky and the entire system if I'm less lucky.

Fortunately we have madvise(array, size, MADV_DONTNEED)! For a MAP_ANONYMOUS mapping like ours, this delightful system call simply frees all the pages in the range and instructs the kernel to use fresh zero pages next time the memory is read or written. Not only is it theoretically efficient, it's fast in practice. I can touch 10,000 scattered pages in my 32GB array and then zero the entire array with one MADV_DONTNEED in 12ms total.

It would be nice if tricks like these worked for pages with values other than just zeroes. For example, Firefox sometimes needs large allocations filled with 0xFF000000 for graphics buffers, or special bit patterns representing "poisoned" memory for security purposes. I think you could implement something like that using userfaultfd or mapping a file from a custom FUSE filesystem, but they would probably be significantly slower.

Wednesday, 15 June 2016

Nastiness Works

One thing I experienced many times at Mozilla was users pressuring developers with nastiness --- ranging from subtle digs to vitriolic abuse, trying to make you feel guilty and/or trigger an "I'll show you! (by fixing the bug)" response. I know it happens in most open-source projects; I've been guilty of using it myself.

I particularly dislike this tactic because it works on me. It really makes me want to fix bugs. But I also know one shouldn't reward bad behavior, so I feel bad fixing those bugs. Maybe the best I can do is call out the bad behavior, fix the bug, and avoid letting that same person use that tactic again.

Perhaps you're wondering "what's wrong with that tactic if it gets bugs fixed?" Development resources are finite so every bug or feature is competing with others. When you use nastiness to manipulate developers into favouring your bug, you're not improving quality generally, you're stealing attention away from other issues whose proponents didn't stoop to that tactic and making developers a little bit miserable in the process. In fact by undermining rational triage you're probably making quality worse overall.

Monday, 13 June 2016

"Safe C++ Subset" Is Vapourware

In almost every discussion of Rust vs C++, someone makes a comment like:

the subset of C++14 that most people will want to use and the guidelines for its safe use are already well on their way to being defined ... By following the guidelines, which can be verified statically at compile time, the same kind of safeties provided by Rust can be had from C++, and with less annotation effort.
This promise is vapourware. In fact, it's classic vapourware in the sense of "wildly optimistic claim about a future product made to counter a competitor". (Herb Sutter says in comments that this wasn't designed with the goal of "countering a competitor" so I'll take him at his word (though it's used that way by others). Sorry Herb!)

(FWIW the claim quoted above is actually an overstatement of the goals of the C++ Core Guidelines to which it refers, which say "our design is a simpler feature focused on eliminating leaks and dangling only"; Rust provides important additional safety properties such as data-race freedom. But even just the memory safety claim is vapourware.)

To satisfy this claim, we need to see a complete set of statically checkable rules and a plausible argument that a program adhering to these rules cannot exhibit memory safety bugs. Notably, languages that offer memory safety are not just claiming you can write safe programs in the language, nor that there is a static checker that finds most memory safety bugs; they are claiming that code written in that language (or the safe subset thereof) cannot exhibit memory safety bugs.

AFAIK the closest to this C++ gets is the Core Guidelines Lifetimes I and II document, last updated December 2015. It contains only an "informal overview and rationale"; it refers to "Section III, analysis rules (forthcoming this winter)", which apparently has not yet come forth. (I'm pretty sure they didn't mean the New Zealand winter.) The informal overview shows a heavy dependence on alias analysis, which does not inspire confidence because alias analysis is always fragile. The overview leaves open critical questions about even trivial examples. Consider:

unique_ptr<int> p;
void foo(const int& v) {
  p = nullptr;
  cout << v;
}
void bar() {
  p = make_unique(7);
  foo(*p);
}
Obviously this program is unsafe and must be forbidden, but what rule would reject it? The document says
  • In the function body, by default a Pointer parameter param is assumed to be valid for the duration of the function call and not depend on any other parameter, so at the start of the function lset(param) = param (its own lifetime) only.
  • At a call site, by default passing a Pointer to a function requires that the argument’s lset not include anything that could be invalidated by the function.
Clearly the body of foo is OK by those rules. For the call to foo from bar, it depends on what is meant by "anything that could be invalidated by the function". Does that include anything reachable via global variables? Because if it does, then you can't pass anything reachable from a global variable to any function by reference, which is crippling. But if it doesn't, then what rejects this code?

Update Herb points out that example 7.1 covers a similar situation with raw pointers. That example indicates that anything reachable through a global variable cannot be passed by to a function by raw-pointer or reference. That still seems like a crippling limitation to me. You can't, for example, copy-construct anything (indirectly) reachable through a global variable:

unique_ptr<Foo> p;
void bar() {
  p = make_unique<Foo>(...);
  Foo xyz(*p); // Forbidden!
}

This is not one rogue example that is easily addressed. This example cuts to the heart of the problem, which is that understanding aliasing in the face of functions with potentially unbounded side effects is notoriously difficult. I myself wrote a PhD thesis on the subject, one among hundreds, if not thousands. Designing your language and its libraries from the ground up to deal with these issues has been shown to work, in Rust at least, but I'm deeply skeptical it can be bolted onto C++.

Q&A

Aren't clang and MSVC already shipping previews of this safe subset? They're implementing static checking rules that no doubt will catch many bugs, which is great. They're nowhere near demonstrating they can catch every memory safety bug.

Aren't you always vulnerable to bugs in the compiler, foreign code, or mistakes in the safety proofs, so you can never reach 100% safety anyway? Yes, but it is important to reduce the amount of trusted code to the minimum. There are ways to use machine-checked proofs to verify that compilation and proof steps do not introduce safety bugs.

Won't you look stupid when Section III is released? Occupational hazard, but that leads me to one more point: even if and when a statically checked, plausibly safe subset is produced, it will take significant experience working with that subset to determine whether it's viable. A subset that rejects core C++ features such as references, or otherwise excludes most existing C++ code, will not be very compelling (as acknowledged in the Lifetimes document: "Our goal is that the false positive rate should be kept at under 10% on average over a large body of code").

Sunday, 12 June 2016

Mt Pirongia

At the end of April we did an overnight tramp up Mt Pirongia. Mt Pirongia is an old volcano southwest of Hamilton, not far from Kawhia on the west coast of the North Island. Its summit is 959m above sea level, most of the mountain is covered in thick bush, and there's a nice new DoC hut near the summit. It's only about two hours drive from Auckland so it's one of the closest hut tramps; I'd been thinking about doing it for years but had heard it was "hard core", and so put it off until my children were older.

On the Saturday we had pretty good weather the whole way. Up to about 700m it was pretty easy going but the central part of the mountain comprises several independent peaks, so there are numerous very steep up-and-down sections, some more climbing than walking, some with chains to help. It's tiring but there's not too much of it. It took us about five hours to get all the way to the hut. The views were spectacular; you can see out to the ocean to the west, and across the Waikato and King Country to the east. Supposedly on a clear day you can see Mt Taranaki and the central volcanic plateau, but it was too hazy for us.

The new hut is good, though it doesn't have the solar-powered LED lighting that many other new huts have. It was lovely to watch the evening descend. Because it was a one-night-only tramp, we brought better food than normal --- steaks, puddings, and bacon and eggs for breakfast. My children introduced the rest of our party to Bang! and everyone had a great time.

On Sunday we were in cloud most of the way down, but it was still lovely and took us about four hours. Overall it was, as I'd heard before, a little "hard core" and probably not the best first-time overnight tramp (the Pinnacles in Coromandel is still the best first-time overnight hut tramp from Auckland, IMHO), but well worth doing.

Whanganui River Journey

Back in April I did the Whanganui River Journey with some friends and family. This is branded as one of New Zealand's "Great Walks", but it's actually five days of canoeing from Taumarunui in the central North Island to Pipiriki, covering 145km. We stayed at campsites for three nights and at the John Coull Hut along the way.

I hadn't done a river trip like this before but I really enjoyed it. Like pretty much everyone else on the river (apart from a few jetboats) we were using open-top Canadian canoes. We were able to carry more gear than we would have while tramping. It's less tiring overall but my arms seldom get such a workout. The scenery is quite different; no vistas from mountain-tops, but seeing gorges and river valleys from the river has its own charm. The river gives you regular challenges as you shoot rapids and risk capsizing. Three of our four canoes had a capsize during the trip, which is actually no big deal, and in retrospect was an experience worth having :-). I discovered that there's a big difference between "container waterproof enough to keep rain out" and "container waterproof while submerged in a fast-flowing river".

Along the way we stopped for a walk to the "Bridge To Nowhere", a big concrete bridge across a valley in the middle of the bush. A road was built to it to support settlement in the area, but was too difficult to maintain so eventually was abandoned and the settlers moved out. Apparently exactly one car ever drove over the bridge. It was unfortunate for the settlers, many of whom were World War I veterans, but Whanganui National Park is a real treasure now.

Our canoe rental company (Blazing Paddles) was good. They picked us up at Pipiriki and drove us and our gear back to base at Taumarunui. That road trip took a couple of hours and was a lot of fun; I talked extensively to the driver, whose family has been in the central North Island area (one of my favourite places) for a hundred years. There are also some spectacular views of the volcanic plateau and the mighty three volcanoes Tongariro, Ngauruhoe and Ruapehu.

The photos here were taken by my friends after my own camera got wet on the first day...

Friday, 10 June 2016

Some Dynamic Measurements Of Firefox On x86-64

This follows up on my previous measurements of static properties of Firefox code on x86-64 with some measurements of dynamic properties obtained by instrumenting code. These are mostly for my own amusement but intuitions about how programs behave at the machine level, grounded in data, have sometimes been unexpectedly useful.

Dynamic properties are highly workload-dependent. Media codecs are more SSE/AVX intensive than regular code so if you do nothing but watch videos you'd expect qualitatively different results than if you just load Web pages. I used a mixed workload that starts Firefox (multi-process enabled, optimized build), loads the NZ Herald, scrolls to the bottom, loads an article with a video, plays the video for several seconds, then quits. It ran for about 30 seconds under rr and executes about 60 billion instructions.

I repeated my register usage result analysis, this time weighted by dynamic execution count and taking into account implicit register usage such as push using rsp. The results differ significantly on whether you count the consecutive iterations of a repeated string instruction (e.g. rep movsb) as a single instruction execution or one instruction execution per iteration, so I show both. Unlike the static graphs, these results for all instructions executed anywhere in the process(es), including JITted code, not just libxul.

  • As expected, registers involved in string instructions get a big boost when you count string instruction repetitions individually. About 7 billion of the 64 billion instruction executions "with string repetitions" are iterations of string instructions. (In practice Intel CPUs can optimize these to execute 64 iterations at a time, under favourable conditions.)
  • As expected, sp is very frequently used once you consider its implicit uses.
  • String instructions aside, the dynamic results don't look very different from the static results. Registers R8 to R11 look a bit more used in this graph, which may be because they tend to be allocated in highly optimized leaf functions, which are more likely to be hot code.

  • The surprising thing about the results for SSE/AVX registers is that they still don't look very different to the static results. Even the bottom 8 registers still aren't frequently used compared to most general-purpose registers, even though I deliberately tried to exercise codec code.
  • I wonder why R5 is the least used bottom-8 register by a significant margin. Maybe these results are dominated by a few hot loops that by chance don't use that register much.

I was also interested in exploring the distribution of instruction execution frequencies:

A dot at position x, y on this graph means that fraction y of all instructions executed at least once is executed at most x times. So, we can see that about 19% of all instructions executed are executed only once. About 42% of instructions are executed at most 10 times. About 85% of instructions are executed at most 1000 times. These results treat consecutive iterations of a string instruction as a single execution. (It's hard to precisely define what it means for an instruction to "be the same" in the presence of dynamic loading and JITted code. I'm assuming that every execution of an instruction at a particular address in a particular address space is an execution of "the same instruction".)

Interestingly, the five most frequently executed instructions are executed about 160M times. Those instructions are for this line, which is simply filling a large buffer with 0xff000000. gcc is generating quite slow code:

132e7b2: cmp    %rax,%rdx
132e7b5: je     132e7d1 
132e7b7: movl   $0xff000000,(%r9,%rax,4)
132e7bf: inc    %rax
132e7c2: jmp    132e7b2 
That's five instructions executed for every four bytes written. This could be done a lot faster in a variety of different ways --- rep stosd or rep stosq would probably get the fast-string optimization, but SSE/AVX might be faster.

Are Dynamic Control-Flow Integrity Schemes Worth Deploying?

Most exploits against C/C++ code today rely on hijacking CPU-level control flow to execute the attacker's code. Researchers have developed schemes to defeat such attacks based on the idea of control flow integrity: characterize a program's "valid control flow", and prevent deviations from valid control flow at run time. There are lots of CFI schemes, employing combinations of static and dynamic techniques. Some of them don't even call themselves CFI, but I don't have a better term for the general definition I'm using here. Phrased in this general way, it includes control-transfer instrumentation (CCFIR etc), pointer obfuscation, shadow stacks, and even DEP and ASLR.

Vendors of C/C++ software need to consider whether to deploy CFI (and if so, which scheme). It's a cost/benefit analysis. The possible benefit is that many bugs may become significantly more difficult --- or even impossible --- to exploit. The costs are complexity and run-time overhead.

A key question when evaluating the benefit is, how difficult will it be for CFI-aware attackers to craft exploits that bypass CFI? That has two sub-questions: how often is it possible to weaponize a memory-safety bug that's exploited via control-flow hijacking today, with an exploit that is permitted by the CFI scheme? And, crucially, will it be possible to package such exploitation techniques so that weaponizing common C/C++ bugs into CFI-proof exploits becomes cheap? A very interesting paper at Oakland this year, and related work by other authors, suggests that the answer to the first sub-question is "very often" and the answer to the second sub-question is "don't bet against it".

Coincidentally, Intel has just unveiled a proposal to add some CFI features to their CPUs. It's a combination of shadow stacks with dynamic checking that the targets of indirect jumps/calls are explicitly marked as valid indirect destinations. Unlike some more precise CFI schemes, you only get one-bit target identification; a given program point is a valid destination for all indirect transfers or none.

So will CFI be worth deploying? It's hard to say. If you're offered a turnkey solution that "just works" with negligible cost, there may be no reason not to use it. However, complexity has a cost, and we've seen that sometimes complex security measures can even backfire. The tail end of Intel's document is rather terrifying; it tries to enumerate the interactions of their CFI feature with all the various execution modes that Intel currently supports, and leaves me with the impression that they're generally heading over the complexity event horizon.

Personally I'm skeptical that CFI will retain value over the long term. The Oakland DOP paper is compelling, and I think we generally have lots of evidence that once an attacker has a memory safety bug to work on, betting against the attacker's ingenuity is a loser's game. In an arms race between dynamic CFI (and its logical extension to dynamic data-flow integrity) and attackers, attackers will probably win, not least because every time you raise the CFI bar you'll pay with increased complexity and overhead. I suggest that if you do deploy CFI, you should do so in a way that lets you pull it out if the cost-benefit equation changes. Baking it into the CPU does not have that property...

One solution, of course, is to reduce the usage of C/C++ by writing code in a language whose static invariants are strong enough to give you CFI, and much stronger forms of integrity, "for free". Thanks to Rust, the old objections that memory-safe languages were slow, tied to run-time support and cost you control over resources don't apply anymore. Let's do it.

Monday, 6 June 2016

How To Track Down Divergence Bugs In rr

Brendan Dolan-Gavitt asked me to write about how I debug rr itself. That's an interesting topic because it's a challenging problem; as I wrote a while ago, you sometimes feel you're in debugging hell so that others can go to heaven.

Brendan was talking about divergence bugs, i.e. bugs where replaying a recorded execution goes "off the rails" because there's some source of nondeterminism that differs between recording and replay. These can be further divided into recording bugs, where nondeterminism wasn't recorded correctly, and replay bugs, where there's enough information in the trace to replay but we replayed incorrectly.

rr saves the values of all general-purpose registers, the program counter, and the conditional branch counter, at most trace events, i.e. signals, context switches, and system calls that don't take the syscall-buffering fast path. We typically detect divergence bugs when replay reaches the execution point of the next event but the replay register/counter state doesn't match the recorded state. This includes cases where a different system call is performed, and most cases where behavior changes during replay produce writes to stdout/stderr that differ from those during recording. Here's an example:

[ERROR /home/kfischer/rr-vanilla/src/Registers.cc:303:maybe_print_reg_mismatch() errno: SUCCESS] rdx 0 != 0xff (replaying vs. recorded)
[FATAL /home/kfischer/rr-vanilla/src/Registers.cc:423:compare_register_files() errno: SUCCESS]
 (task 24036 (rec:24003) at time 1132365)
 -> Assertion `!bail_error || match' failed to hold. Fatal register mismatch (ticks/rec:4473519809/4473519809)
Launch gdb with
  gdb /home/kfischer/.local/share/rr/latest-trace/mmap_hardlink_23_wine64
and attach to the rr debug server with:
  target remote :24036

In rr, most divergences are detected pretty soon after they are caused. In practice almost any change in control flow causes the conditional-branch counter to diverge. So the first thing I do is look for clues that something unusual happened just before the divergence was detected. One way to do that is to look at the trace leading up to the current point. If a rare system call or signal happened just before we diverged, it's worth looking closely at our handling of that! Another way is to use the emergency debugger mentioned above. This provides a gdbserver interface so you can inspect the state of the diverged process with gdb. Looking at the call stack can show that the process was doing something unusual that needs to be checked, especially if you have application source. Surprisingly often, with this data, I'm able to make some inspired guesses and write a standalone testcase that reproduces the problem, and from there it's usually easy.

When we have a divergence in one or a few register values, such as the issue linked above, that almost always means control flow did not diverge and we can make progress by reasoning backwards to figure out where that bad value must have come from. As you'd hope, rr itself is very useful for this; you can replay right up to the divergence point and reverse-stepi to see where the bad value is set. If reverse-execution is busted for some reason, we can tell rr to singlestep over some region of the trace and dump all register values at each step --- slow, but usually gives you the information you need. These techniques don't work if control flow diverged.

For harder bugs we generally need to be able to reproduce the bug by re-recording instead of just working off a single recorded trace, if we're going to make any progress. Re-recording with the system-call buffering optimization disabled is very helpful; it's a lot slower, but we'll check registers/counters at every single system call and catch divergences earlier. If the bug disappears entirely, we can safely bet it's in the syscall-buffering logic itself and we can easily narrow down which syscall wrapper is responsible. A similar but more aggressive approach which I've used in some desperate situations is to set the scheduler to context-switch every small number of conditional branches, and suppress noop-reschedule elision, so the trace records the register state very frequently ... effective, but often impractically slow.

Another tool in our toolbox is memory checksumming. To catch bugs where a memory location diverges and we don't catch the divergence until much later, rr has command-line options to periodically checksum memory regions and include those checksums in the trace. During replay we check the checksums to see if memory has diverged. For when you're close to the target, you can also dump actual memory contents and compare them. I don't use this facility very much, but when you need it, you really need it.

For bugs where we know we've recorded the correct data but replay goes wrong, we have the ability to record rr replay under rr (and, obviously, replay it), so we can actually use reverse-execution debugging to debug rr's replay code. I haven't used this a lot yet, since I got it working relatively recently, and sometimes it falls over ("and then you have two problems"), but when it works you get the feeling of lightning shooting out of your fingertips.

One general rule that I learned at Mozilla is to be very aggressive about using assertions to check invariants. This helps detect bugs closer to the root cause. Generally rr has assertions enabled all the time; the performance impact is not usually an issue since rr's efficiency (apart from the syscallbuf code) mostly comes down to switching from the application to rr as infrequently as possible. If we're running rr code, we're already losing.

Some of the hardest divergence bugs we've had in rr have been misbehavior in the hardware conditional-branch counter, either induced by OS/VM issues or by actual hardware bugs. These are agony. Definitely study Intel's errata sheets, read kernel source, and test carefully across different CPU/VM configurations.

I hope this helps! I'm always glad to answer questions about rr :-).

The Diving Bell And Twitter

Trying to conduct an intelligent conversation on Twitter makes me feel the tiniest bit like one of those people who are paralyzed and only able to communicate by blinking. Voluntarily restricting my communication to 140-character packets is absurd, and so are "part 1/N" hacks; fragmentation and reassembly are for network stacks, not people. On top of that, apparently a PhD in computer science is not enough to understand Twitter's conversation model.

I'd much rather communicate by blogging and blog comments, email, or even HN/Reddit. Unfortunately RSS is dead and a lot of people only read my blog posts if I announce them on Twitter. The risk is that people respond on Twitter and I get pulled into a conversation there like an arm caught in a woodchipper.

So, I will make every effort to resist replying to threads on Twitter. Please leave comments in my blog or send me email instead of tweeting.

Saturday, 4 June 2016

Research Needed: A Meta (Dis) Assembler

I'm working on a binary rewriting engine. A core component of such engines is a disassembler to parse machine code and extract necessary information about each instruction: its length, operands, and effects. For x86-64 models this is challenging because they have a vast number of instructions, the list keeps growing, and the semantics and encodings are complicated.

I'm standing on the shoulders of giants by using the DynamoRio standalone disassembler/assembler library. It's wonderful, but any general-purpose library for this must wrestle with a difficult tradeoff between generality and performance/footprint. I want to know for each instruction whether it belongs to one of a few narrow classes (e.g. direct control transfer, indirect control transfer, conditional branch, system call), but I don't otherwise care what the opcode is. I want to know for each instruction whether it reads or writes R8-R11, and I want to know whether it reads or writes a RIP-relative memory location, but I don't otherwise care what the operands are. I want to know whether it reads or writes certain flags, and unlike other DynamoRio clients, I want to distinguish "undefined" flag effects from flag writes. So when building a disassembler to serve multiple clients (which you do want to, because it's expensive to build and maintain), you have to produce a lot of information that any given client isn't going to use. This is a problem because disassembly is a critical component of the overhead for a high-performance binary rewriting engine.

DR mitigates this problem somewhat by giving you options to control how much information is produced and producing some information lazily. But I guess that that doesn't get close to the performance of a disassembler optimized for my particular client. For example, DR contains large tables containing opcodes and operand information for a vast number of SSE/AVX instructions, most of which information I don't care about. An optimized disassembler for my client could use much smaller tables.

So here's a goal: design a disassembler framework that can generate disassemblers highly optimized for specific clients.

Here's how I think this could work:

  • Design a rich AST for instructions containing all the information any client wants. A disassembler is a function from a string of bytes plus some state bits (e.g. CPU mode) to an Instruction object in this AST.
  • Create a domain-specific purely-functional language to express such functions. Make it look as much like vendor CPU documentation as possible. Make it as easy as possible to add new instructions. (DR is not bad but it's not quite as easy as it could perhaps be).
  • Actually write an x86-64 disassembler in this language.
  • For each client, write a function from Instruction objects to some client-specific data type containing just the information the client needs. Write this function in a tightly constrained DSL that focuses on simple projections --- dropping certain fields, mapping constants onto smaller sets of values, etc.
  • The disassembler you want is then the composition of the base disassembler with the per-client projection. Build an optimizing compiler for these composed functions.

I think this is tractable because the domain is very constrained. The input to the composed function is a string of a relatively small (bounded) number of bits. The x86-64 instruction set is baroque but it does have exploitable structure. Going crazy with solver-powered program synthesis should be very feasible.

Like any new disassembler, you'd want to test by generating output in standard text formats and comparing to existing disassemblers.

For a stretch goal, try generating an assembler from the disassembler description. There are actually two kinds of assembly needed by binary rewriting engines. They need to tweak disassembled instructions (e.g. replace one operand with another) and emit the results. They also need to assemble from scratch new instructions with a limited set of specific opcodes and operands. In the latter case, a function call to assemble an instruction can usually be reduced to a a very short sequence of bit-manipulation and store instructions. For both cases, in principle assembly is just the problem of finding a string of bits which disassembles to an Instruction object satisfying certain constraints. Often multiple encodings are possible, in which case you usually want the shortest instruction but hints might be needed to select the optimal form. My intuition is that generating efficient assemblers from the disassembler DSL should be feasible.

This seems like the sort of problem that researchers might have tackled before. The closest thing I can find is Ramsey and Fernández's SLED language in the New Jersey Machine-Code Toolkit from the mid-90s, which doesn't consider the composition and optimization problem.

Anyway, this seems like a pretty cool project. There's scope for doing cutting-edge program synthesis work, and the results would be incredibly useful. Unfortunately I don't have time to do it :-).