Tuesday, 13 November 2018

Comparing The Quality Of Debug Information Produced By Clang And Gcc

I've had an intuition that clang produces generally worse debuginfo than gcc for optimized C++ code. It seems that clang builds have more variables "optimized out" — i.e. when stopped inside a function where a variable is in scope, the compiler's generated debuginfo does not describe the value of the variable. This makes debuggers less effective, so I've attempted some qualitative analysis of the issue.

I chose to measure, for each parameter and local variable, the range of instruction bytes within its function over which the debuginfo can produce a value for this variable, and also the range of instruction bytes over which the debuginfo says the variable is in scope (i.e. the number of instruction bytes in the enclosing lexical block or function). I add those up over all variables, and compute the ratio of variable-defined-bytes to variable-in-scope-bytes. The higher this "definition coverage" ratio, the better.

This metric has some weaknesses. DWARF debuginfo doesn't give us accurate scopes for local variables; the defined-bytes for a variable defined halfway through its lexical scope will be about half of its in-scope-bytes, even if the debuginfo is perfect, so the ideal ratio is less than 1 (and unfortunately we can't compute it). In debug builds, and sometimes in optimized builds, compilers may give a single definition for the variable value that applies to the entire scope; this improves our metric even though the results are arguably worse. Sometimes compilers produce debuginfo that is simply incorrect; our metric doesn't account for that. Not all variables and functions are equally interesting for debugging, but this metric weighs them all equally. The metric assumes that the points of interest for a debugger are equally distributed over instruction bytes. On the other hand, the metric is relatively simple. It focuses on what we care about. It depends only on the debuginfo, not on the generated code or actual program executions. It's robust to constant scaling of code size. We can calculate the metric for any function or variable, which makes it easy to drill down into the results and lets us rank all functions by the quality of their debuginfo. We can compare the quality of debuginfo between different builds of the same binary at function granularity. The metric is sensitive to optimization decisions such as inlining; that's OK.

I built a debuginfo-quality tool in Rust to calculate this metric for an arbitrary ELF binary containing DWARF debuginfo. I applied it to the main Firefox binary libxul.so built with clang 8 (8.0.0-svn346538-1~exp1+0~20181109191347.1890~1.gbp6afd8e) and gcc 8 (8.2.1 20181105 (Red Hat 8.2.1-5)) using the default Mozilla build settings plus ac_add_options --enable-debug; for both compilers that sets the most relevant options to -g -Os -fno-omit-frame-pointer. I ignored the Rust compilation units in libxul since they use LLVM in both builds.

In our somewhat arbitrary metric, gcc is significantly ahead of clang for both parameters and local variables. "Parameters" includes the parameters of inlined functions. As mentioned above, the ideal ratio for local variables is actually less than 1, which explains at least part of the difference between parameters and local variables here.

gcc uses some debuginfo features that clang doesn't know about yet. An important one is DW_OP_GNU_entry_value (standardized as DW_OP_entry_value in DWARF 5). This defines a variable (usually a parameter) in terms of an expression to be evaluated at the moment the function was entered. A traditional debugger can often evaluate such expressions after entering the function, by inspecting the caller's stack frame; our Pernosco debugger has easy access to all program states, so such expressions are no problem at all. I evaluated the impact of DW_OP_GNU_entry_value and the related DW_OP_GNU_parameter_ref by configuring debuginfo-quality to treat definitions using those features as missing. (I'm assuming that gcc only uses those features when a variable value is not otherwise available.)

DW_OP_GNU_entry_value has a big impact on parameters but almost no impact on local variables. It accounts for the majority, but not all, of gcc's advantage over clang for parameters. DW_OP_GNU_parameter_ref has almost no impact at all. However, in most cases where DW_OP_GNU_entry_value would be useful, users can work around its absence by manually inspecting earlier stack frames, especially when time-travel is available. Therefore implementing DW_OP_GNU_entry_value may not be as high a priority as these numbers would suggest.

Improving the local variable numbers may be more useful. I used debuginfo-quality to compare two binaries (clang-built and gcc-built), computing, for each function, the difference in the function's definition coverage ratios, looking only at local variables and sorting functions according to that difference:

debuginfo-quality --language cpp --functions --only-locals ~/tmp/clang-8-libxul.so ~/tmp/gcc-8-libxul.so
This gives us a list of functions starting with those where clang is generating the worst local variable information compared to gcc (and ending with the reverse). There are a lot of functions where clang failed to generate any variable definitions at all while gcc managed to generate definitions covering the whole function. I wonder if anyone is interested in looking at these functions and figuring out what needs to be fixed in clang.

Designing and implementing this kind of analysis is error-prone. I've made my analysis tool source code available, so feel free to point out any improvements that could be made.

Sunday, 4 November 2018

What Is "Evil" Anyway?

I found this Twitter thread insightful, given its assumptions. I think that, perhaps inadvertently, it highlights the difficulties of honest discussion of evil in a secular context. The author laments:

It is beyond us, today, to conclude that we have enemies whose moral universe is such that loyalty to our own morality requires us to understand it and them as evil.
That is, evil means moral principles (and the people who hold them) which are incompatible with our own. That definition is honest and logical, and I think probably the best one can do under physicalist assumptions. Unfortunately it makes evil entirely subjective; it means other people can accurately describe us and our principles as evil in just the same way as we describe them as evil. All "evil" must be qualified as "evil according to me" or "evil according to you".

This is a major problem because (almost?) nobody actually thinks or talks about evil this way in day to day life, neither explicitly nor implicitly. Instead we think and act as if "evil" is an objective fact independent of the observer. Try tacking on to every expression of moral outrage the caveat "... but I acknowledge that other people have different moral assumptions which are objectively just as valid". It doesn't work.

Christians and many other monotheists avoid this problem by identifying a privileged frame of moral reference: God's. Our moral universe may or may not align with God's, or perhaps we don't care, or we may have trouble determining what God's is, but at least it lets us define evil objectively.

The Twitter thread raises a further issue: when one encounters evil people — people whose moral universe is incompatible with our own — what shall we do? Without a privileged frame of moral reference, one can't honestly seek to show them they are wrong. At best one can use non-rational means, including force, to encourage them to change their assumptions, or if that fails, perhaps they can be suppressed and their evil neutralized. This too is most unsatisfactory.

The Christian worldview is a lot more hopeful. We believe in a standard by which all will be measured. We believe in God's justice for transgressions. We believe in redemption through Jesus for those who fall short (i.e. everyone). We seek to love those who (for now) reject God's moral universe ... a group which sometimes includes ourselves. We see that even those most opposed or indifferent to God's purposes can change. These beliefs are not purely subjective, but grounded in objective truths about what God has done and is doing in the world.

Thursday, 1 November 2018

Comments on "REPT: Reverse Debugging of Failures in Deployed Software"

It's a pretty good paper! In fact, it won a "best paper" award.

The basic idea is to use Intel PT to gather a control-flow trace and keep just the last segment in a memory buffer, and dump that buffer as part of a crash memory dump. Then they perform some inference based on final memory values and the control-flow trace to figure out what the values of registers must have been during execution leading up to the crash. The inference is non-trivial; they have to use an iterative approach and sometimes make optimistic assumptions that later are corrected. With the register values and memory values they infer, they can provide some amount of reverse-execution debugging over the time interval leading up to the crash, with negligible overhead during normal execution. It's all done in the context of Windows and WinDbg.

One qualm I have about this paper's approach is that their optimistic assumptions mean in some cases they report incorrect data values. They're able to show that their values are correct most of the time, but I would hesitate to show users data that has a significant chance of being incorrect. There might be some room for improvement here, e.g. distinguishing known-correct values from maybe-incorrect values or using more sophisticated confidence estimation.

Overall using PT to capture control flow to be saved in crash dumps seems like a really good idea. Everyone should do that. The details of the interference are probably less important, but some kind of inference like in this paper seems really useful for crash dump analysis. I think you'll still want full rr-style recording when you can afford it, though. (I think it's possible that one day we'll be able to have full recording with negligible overhead ... a man can dream!)

I have a couple of quibbles. The paper doesn't describe how they handle the x86 EFLAGS register. This is important because if EFLAGS is part of their register state, then even simple instructions like ADD aren't reversible, because they reset flags; but if ELFAGS is not part of their register state, then they can't deduce the outputs for instructions like CMOVxx, SETxx, ADC etc. I asked the lead author and they confirmed they have special handling for EFLAGS.

Another quibble I have is that they don't describe how they chose the bugs to evaluate REPT against, therefore it's difficult to be confident that they didn't cherry-pick bugs where REPT performs well. Unfortunately, this is typical for computer science papers in areas where there is no accepted standard corpus of test subjects. Our field should raise standards by expecting authors to document protocols that avoid the obvious biases — not just test cherry-picking, but also tuning tools to work well on the same tests we then report results for.

Sunday, 28 October 2018

Auckland Half Marathon 2018

This morning I ran the Auckland Half Marathon, for the sixth year in a row, fifth year barefoot. I got my best time ever: official time 1:45:07, net time 1:44:35. I had to push hard, and I've been sore all day! Glad to get under 1:45 net.

Last year I applied climbing tape to the balls of my feet to avoid the skin wearing through. Unfortunately I lost that tape and forgot to get more, so I tried some small patches of duct tape but they came off very early in the race. Nevertheless my feet held up pretty well, perhaps because I've been running more consistently during the year and perhaps because my feet are gradually toughening up year after year. The road was wet today because it rained overnight, but I can't tell whether that is better or worse for me.

Thursday, 25 October 2018

Problems Scaling A Large Multi-Crate Rust Project

We have 85K lines of Rust code implementing the backend of our Pernosco debugger. To impose some modularity constraints and to reduce build times, from the beginning we organized our code as a large set of crates in a single Cargo workspace in a single Gitlab repository. Currently we have 48 crates. This has mostly worked pretty well but as the number of our crates keeps increasing, we have hit some serious scalability problems.

The most fundamental issue is that many crates build one or more executables — e.g. command-line tools to work with data managed by the crate — and most crates also build an executable containing tests (per standard Rust conventions). Each of these executables is statically linked, and since each crate on average depends on many other crates (both our own and third-party), the total size of the executables is growing at roughly the square of the number of crates. The problem is especially acute for debug builds with full debuginfo, which are about five times larger than release builds built with debug=1 (optimized builds with just enough debuginfo for stack traces to include inlined functions). To be concrete, our 85K line project builds 4.2G of executables in a debug build, and 750M of executables in release. There are 20 command-line tools and 81 test executables, of which 50 actually run no tests (the latter are small, only about 5.7M each).

The large size of these executables slows down builds and creates problems for our Gitlab CI as they have to be copied over the network between build and test phases. But I don't know what to do about the problem.

We could limit the number of test executables by moving all our integration tests into a single executable in some super-crate ... though that would slow down incremental builds because that super-crate would need to be rebuilt a lot, and it would be large.

We could limit the number of command-line tools in a similar way — combine all the tool executables a super-tool crate that uses the "Swiss Army knife" approach, deciding what its behavior should be by examining argv[0]. Again, this would penalize incremental builds.

Cargo supports a kind of dynamic linking with its dylib option, but I'm not sure how to get that to work. Maybe we could create a super-crate that reexports every single crate in our workspace, attach all tests and binary tools to that crate, and ask Cargo to link that crate as a dynamic library, so that all the tests and tools are linking to that library. This would also hurt incremental builds, but maybe not as much as the above approaches. Then again, I don't know if it would actually work.

Another option would be to break up the project into separate independently built subprojects, but that creates a lot of friction.

Another possibility is that we should simply use fewer, bigger crates. This is probably more viable than it was a couple of years ago, when we didn't have incremental rustc compilation.

I wonder if anyone has hit this problem before, and tried the above solutions or come up with any other solutions.

Wednesday, 24 October 2018

Harmful Clickbait Headline About IT Automation

Over the years a number of parents have asked me whether they should steer their children away from IT/programming careers in case those jobs are automated away in the future. I tell them that the nature of programming work will continue to change but programming will be one of the last jobs to be fully automated; if and when computers don't need to be programmed by people, they will be capable of almost anything. (It's not clear to me why some people are led to believe programming is particularly prone to automation; maybe because they see it as faceless?)

Therefore I was annoyed by the headline in the NZ Herald this morning: "Is AI about to unseat our programmers?" Obviously it's deliberate clickbait, and it wasn't chosen by Juha Saarinen, whose article is actually quite good. But I'm sure some of the parents I've talked to, and many others like them whom I'll never know, will have fears reinforced by this headline and perhaps some people will be turned away from programming careers to their detriment, and the detriment of New Zealands's tech industry.

Monday, 22 October 2018

The Fine Line Between Being A Good Parent And A Bad Parent

Two incidents will illustrate.

Early 2013: I take my quite-young kids to hike to the summit of Mt Taranaki. The ascent is more grueling than I expected; there's a long scree slope which is two steps forward, one step back. My kids start complaining, then crying. I have to decide whether to turn back or to cajole them onward. There are no safety issues (the weather is perfect and it's still early in the day), but the stakes feel high: if I keep pushing them forward but we eventually fail, I will have made them miserable for no good reason, and no-one likes a parent who bullies their kids. I roll the dice and press on. We make it! After the hike, we all feel it was a great achievement and the kids agree we did the right thing to carry on.

Two weeks ago: I take my kids and a couple of adult international students from our church on an overnight hiking trip to the Coromandel Peninsula. On Friday we hike for four hours to Crosbies Hut and stay there overnight. It's wonderful — we arrive at the hut around sunset in glorious weather, eat a good meal, and the night sky is awesome. The next day I return to our starting point, pick up the car and drive around to Whangaiterenga campsite in Kauaeranga Valley so my kids and our guests can descend into the valley by a different route that crosses Whangaiterenga stream a few times. I had called the visitor's centre on Friday to confirm that that track is open and the stream crossings are easy. My kids are now quite experienced (though our guests aren't) and should be able to easily handle this on their own. I get to the pickup point ahead of schedule, but two hours after I expected them to arrive, they still haven't :-(.

To cut the story short, at that point I get a text message from them and after some communication they eventually walk out five hours late. They were unable to pick up the trail after the first stream crossing (maybe it was washed out), and had to walk downstream for hours, also taking a detour up a hill to get phone reception temporarily. The kids made good decisions and gained a lot of confidence from handling an unexpected situation on their own.

What bothers me is that both of these situations could easily have turned out differently. In neither case would there have been any real harm — the weather in Coromandel was excellent and an unexpected night in the bush would have been perfectly safe given the gear they were carrying (if indeed they weren't found before dark). Nevertheless I can see that my decisions could have looked bad in hindsight. If we make a habit of taking these kinds of small risks — and I think we should! — then not all of them are going to pay off. I think, therefore, we should be forgiving of parents who take reasonable risks even if they go awry.

Tuesday, 2 October 2018

The Costs Of Programming Language Fragmentation

People keep inventing new programming languages. I'm surprised by how many brand-new languages are adopted by more than just their creators, despite the network effects that would seem to discourage such adoption. Good! Innovation and progress in programming languages depend on such adoption. However, let's not forget that fragmentation of programming languages reduces the sum of those beneficial network effects.

One example is library ecosystems. Every new language needs a set of libraries for commonly used functionality. Some of those libraries can be bindings to existing libraries in other languages, but it's common for new languages to trigger reimplementation of, e.g., container data structures, HTTP clients, and random number generators. If the new language did not exist, that effort could have been spent on improving existing libraries or some other useful endeavour.

Another example is community support. Every new language needs an online community (IRC, StackOverflow, etc) for developers to help one another with questions. Fragmenting users across communities makes it harder for people to find answers.

Obviously the efforts needed to implement and maintain languages and runtimes themselves represents a cost, since focusing efforts on a smaller number of languages would normally mean better results.

I understand the appeal of creating new programming languages from scratch; like other green-field development, the lure of freedom from other people's decisions is hard to resist. I understand that people's time is their own to spend. However, I hope people consider carefully the social costs of creating a new programming language especially if it becomes popular, and understand that in some cases creating a popular new language could actually be irresponsible.

Tuesday, 25 September 2018

More Realistic Goals For C++ Lifetimes 1.0

Over two years ago I wrote about the C++ Lifetimes proposal and some of my concerns about it. Just recently, version 1.0 was released with a blog post by Herb Sutter.

Comparing the two versions shows many important changes. The new version is much clearer and more worked-out, but there are also significant material changes. In particular the goal has changed dramatically. Consider the "Goal" section of version 0.9.1.2: (emphasis original)

Goal: Eliminate leaks and dangling for */&/iterators/views/ranges
We want freedom from leaks and dangling – not only for raw pointers and references, but all generalized Pointers such as iterators—while staying true to C++ and being adoptable:
1. We cannot tolerate leaks (failure to free) or dangling (use-after-free). For example, a safe std:: library must prevent dangling uses such as auto& bad = vec[0]; vec.push_back(); bad = 42;.
Version 1.0 doesn't have a "Goal" section, but its introduction says
This paper defines the Lifetime profile of the C++ Core Guidelines. It shows how to efficiently diagnose many common cases of dangling (use-after-free) in C++ code, using only local analysis to report them as deterministic readable errors at compile time.
The new goal is much more modest, I think much more reasonable, and highly desirable! (Partly because "modern C++" has introduced some extremely dangerous new idioms.)

The limited scope of this proposal becomes concrete when you consider its definition of "Owner". An Owner can own at most one type of data and it has to behave much like a container or smart pointer. For example, consider a data structure owning two types of data:

class X {
public:
    X() : a(new int(0)), b(new char(0)) {}
    int* get_a() { return &*a; }
    char* get_b() { return &*b; }
private:
    unique_ptr<int> a;
    unique_ptr<char> b;
};
This structure cannot be an Owner. It is also not an Aggregate (a struct/class with public fields whose fields are treated as separate variables for the purposes of analysis). It has to be a Value. The analysis has no way to refer to data owned by Values; as far as I can tell, there is no way to specify or infer accurate lifetimes for the return values of get_a and get_b, and apparently in this case the analysis defaults to conservative assumptions that do not warn. (The full example linked above has a trivial dangling pointer with no warnings.) I think this is the right approach, given the goal is to catch some common errors involving misuse of pointers, references and standard library features. However, people need to understand that code free of C++ Lifetime warnings can still easily cause memory corruption. (This vindicates the title of my previous blog post to some extent; insofar as C++ Lifetimes was intended to create a safe subset of C++, that promise has not eventuated.)

The new version has much more emphasis on annotation. The old version barely mentioned the existence of a [[lifetime]] annotation; the new version describes it and shows more examples. It's now clear you can use [[lifetime]] to group function parameters and into lifetime-equivalence classes, and you can also annotate return values and output parameters.

The new version comes with a partial Clang implementation, available on godbolt.org. Unfortunately that implementation seems to be very partial. For example the following buggy program is accepted without warnings:

int& f(int& a) {
    return a;
}
int& hello() {
    int x = 0;
    return f(x);
}
It's pretty clear from the spec that this should report a warning, and the corresponding program using pointers does produce a warning. OTOH there are some trivial false positives I don't understand:
int* hello(int*& a) {
    return a;
}
:2:5: warning: returning a dangling Pointer [-Wlifetime]
    return a;
    ^
:1:12: note: it was never initialized here
int* hello(int*& a) {
           ^
The state of this implementation makes it unreliable as a guide to how this proposal will work in practice, IMHO.

Monday, 17 September 2018

The Danger Of GMail's "Smart Replies"

At first I was annoyed by GMail's "Smart Reply" buttons because they represent a temptation to delegate (more) shaping of my human interactions to Google's AI ... a temptation that, for some reason known only to Google, can be disabled in the GMail mobile app but not the desktop Web client. I do not want the words I use to communicate, or the words others use to communicate to me, to be shaped by the suggestions of an algorithm that is most likely opaque even to its masters, let alone a mere consumer like me.

I just realized, though, that they're potentially a lot worse than that. I got an email suggesting I take an action, and the suggested "smart replies" are:

  • Sounds like a good idea.
  • I like that idea.
  • Yes, I agree.
But ... what if I don't agree? Does showing me only positive responses actually prime my brain to make me more likely to agree? Is it possible to tweak the wording of an email to ensure the algorithm produces responses of a particular type? (Probably.) More importantly, did anyone at Google actually consider and study such effects before rolling out this feature? Or did the team just roll out the feature, collect the bonus, and move on? If they did study it, are the results public and what were they? Wouldn't it be wise to require this kind of study and disclosure before subtly interfering with the cognitive processes of hundreds of millions of people?

For now I'm switching back to GMail Classic, and when (I assume) Google forces the new UI on me anyway, the path of least resistance will be to use a Firefox extension to block the Smart Reply buttons (yay Web!). Of course hundreds of millions of people will unwittingly submit to Google's reckless mental meddling.

Tuesday, 4 September 2018

"Crazy Rich Asians"

Pretty good movie. A few observations... (Spoilers!)

I don't know what the ultra-rich really get up to, but for me the most absurd part of the movie was the MJ scene. Eleanor's early hand was trash; there was no way she she could have amassed the pungs-and-bamboos winning hand she did, not with Rachel also collecting bamboos.

Maybe I misunderstood everything, but didn't the Astrid-Michael subplot undermine the main plot by proving Eleanor was right all along? Michael and Astrid set aside their different backgrounds and family disapproval to marry (presumably for love), but Michael couldn't cope with the pressure and ruins their marriage ... just like Eleanor fears will happen with Rachel. Main plot: true love wins! Subplot: ... er no it doesn't.

The entire movie screams "FIRST WORLD PROBLEMS". In particular the idea that a man like Michael could not simply be grateful for his situation is marginally plausible but nearly unforgivable.

My source tells me the actors' Cantonese was pretty bad.

I'd watch Michelle Yeoh read the phone book.

Saturday, 1 September 2018

Rangitoto Fog

Visiting Rangitoto is one of my favourite things to do in Auckland. Catch the 9:15am ferry from the downtown terminal, arrive on the island just before 10am, walk up to the top, see the incredible views over Auckland and the Hauraki Gulf, and then back down via the lava caves and easily make the 12:45pm ferry getting you back to the city by 1:30pm. You've experienced a unique 600-year-old island with extraordinary geology, flora and fauna, and had a good walk, in four hours.

Today was extra-special. Very thick fog blanketed the harbour and inner Gulf, and the ferry proceeded very slowly to the island; the trip that normally takes 30 minutes took 75. We passed a number of becalmed yachts that apparently were supposed to be racing, but instead were drifting aimlessly through the fog. It was surreal. Once we finally reached the island and headed inland, we almost immediately left the fog, but the fog left behind spiderwebs sparkling with dew and rocks steaming in the sun. From Rangitoto's summit we could still see large fog banks covering Waiheke Island, Motuihe Island, and much of the inner Gulf. It was wonderful!

Friday, 24 August 2018

Long Live The Desktop Computer

Eight years ago I bought a Dell Studio XPS 8100 desktop for a home computer at a moderate price (NZD 3,100). I've just replaced a failing 1TB hard drive with a 500GB SSD, but other than that I've done no upgrades. What's interesting to me is that it's still a perfectly good machine: quad-core i7, 12GB RAM, NVIDIA GPU with 2GB VRAM. Everything I do, this machine could still do well, including software development for work. I guess if I wanted to play the latest AAA game titles or use a 4K monitor on it, I'd be unhappy, but I can't think of anything else I'd even consider doing that would be a problem, and those could be addressed by upgrading the video card. If this machine doesn't fail catastrophically I can see us continuing to use it for many more years. (I run Linux on it; the situation might be different if it was Windows.)

This is interesting because up until 2010 I'd been in the habit of upgrading computers at least every five years because they would improve dramatically over that time in ways that mattered to me. That stopped happening. It hasn't entirely stopped for everyone — Mozilla developers are getting new desktops with double-digit numbers of cores to speed up Firefox builds — but I run my heavy-duty workloads in the cloud now, because really big machines aren't efficiently utilized by a single developer. I guess the economics of utilization and colocation will making cloud-based heavy lifting (not necessarily public clouds) increasingly prevalent over time.

One of the implications is that declining desktop sales don't necessarily mean declining desktop usage. I think they must at least partly reflect longer upgrade cycles.

Another implication is that component reliability for desktops is becoming more important. It doesn't really matter if parts wear out after five years, if you're going to replace the whole machine before then anyway. If the expected lifespan of a machine is fifteen years, it's worth buying more reliable parts.

Another implication is longevity bottlenecks might shift to relatively minor features like what types of USB ports your machine has. I guess some of this can be alleviated by upgrades and dongles but it's worth thinking about.

Friday, 17 August 2018

ASAN And LSAN Work In rr

AddressSanitizer has worked in rr for a while. I just found that LeakSanitizer wasn't working and landed a fix for that. This means you can record an ASAN build and if there's an ASAN error, or LSAN finds a leak, you can replay it in rr knowing the exact addresses of the data that leaked — along with the usual rr goodness of reverse execution, watchpoints, etc. Well, hopefully. Report an issue if you find more problems.

Interestingly, LSAN doesn't work under gdb, but it does work under rr! LSAN uses the ptrace() API to examine threads when it looks for leaks, and it can't ptrace a thread that gdb is already ptracing (the ptrace design deeply relies on there being only one ptracer per thread). rr uses ptrace too, but when one rr tracee thread tries to ptrace another rr tracee thread, rr emulates the ptrace calls so that they work as if rr wasn't present.

Tuesday, 14 August 2018

Diagnosing A Weak Memory Ordering Bug

For the first time in my life I tracked a real bug's root cause to incorrect usage of weak memory orderings. Until now weak memory bugs were something I knew about but had subconciously felt were only relevant to wizards coding on big iron, partly because until recently I've spent most of my career using desktop x86 machines.

Under heavy load a Pernosco service would assert in Rust's std::thread::Thread::unpark() with the error "inconsistent state in unpark". Inspecting the code led to the disturbing conclusion that the only way to trigger this assertion was memory corruption; the value of self.inner.state should always be between 0 and 2 inclusive, and if so then we shouldn't be able to reach the panic. The problem was nondeterministic but I was able to extract a test workload that reproduced the bug every few minutes. I tried recording it in rr chaos mode but was unable to reproduce it there (which is not surprising in hindsight since rr imposes sequential consistency).

With a custom panic handler I was able to suspend the process in the panic handler and attach gdb to inspect the state. Everything looked fine; in particular the value of self.inner.state was PARKED so we should not have reached the panic. I disassembled unpark() and decided I'd like to see the values of registers in unpark() to try to determine why we took the panic path, in particular the value of self.inner (a pointer) loaded into RCX and the value of self.inner.state loaded into RAX. Calling into the panic handler wiped those registers, so I manually edited the binary to replace the first instruction of the panic handler with UD2 to trigger an immediate core-dump before registers were modified.

The core-dump showed that RCX pointed to some random memory and was not equal to self.inner, even though we had clearly just loaded it from there! The value of state in RAX was loaded correctly via RCX, but was garbage because we were loading from the wrong address. At this point I formed the theory the issue was a low-level data race, possibly involving relaxed memory orderings — particularly because the call to unpark() came from the Crossbeam implementation of Michael-Scott lock-free queues. I inspected the code and didn't see an obvious memory ordering bug, but I also looked at the commit log for Crossbeam and found that a couple of memory ordering bugs had been fixed a long time ago; we were stuck on version 0.2 while the released version is 0.4. Upgrading Crossbeam indeed fixed our bug.

Observation #1: stick to sequential consistency unless you really need the performance edge of weaker orderings.

Observation #2: stick to sequential consistency unless you are really, really smart and have really really smart people checking your work.

Observation #3: it would be really great to have user-friendly tools to verify the correctness of unsafe, weak-memory-dependent code like Crossbeam's.

Observation #4: we need a better way of detecting when dependent crates have known subtle correctness bugs like this (security bugs too). It would be cool if the crates.io registry knew about deprecated crate versions and cargo build warned about them.

Monday, 13 August 2018

The Parallel Stream Multiplexing Problem

Imagine we have a client and a server. The client wants to create logical connections to the server (think of them as "queries"); the client sends a small amount of data when it opens a connection, then the server sends a sequence of response messages and closes the connection. The responses must be delivered in-order, but the order of responses in different connections is irrelevant. It's important to minimize the start-to-finish latency of connections, and the latency between the server generating a response and the client receiving it. There could be hundreds of connections opened per second and some connections produce thousands of response messages. The server uses many threads; a connection's responses are generated by a specific server thread. The client may be single-threaded or use many threads; in the latter case a connection's responses are received by a specific client thread. What's a good way to implement this when both client and server are running in the same OS instance? What if they're communicating over a network?

This problem seems quite common: the network case closely resembles a Web browser fetching resources from a single server via HTTP. The system I'm currently working on contains an instance of this internally, and communication between the Web front end and the server also looks like this. Yet even though the problem is common, as far as I know it's not obvious or well-known what the best solutions are.

A standard way to handle this would be to multiplex the logical connections into a single transport. In the local case, we could use a pair of OS pipes as the transport, a client-to-server pipe to send requests and a server-to-client pipe to return responses. The client allocates connection IDs and the server attaches connection IDs to response messages. Short connections can be very efficient: a write syscall to open a connection, a write syscall to send a response, maybe another write syscall to send a close message, and corresponding read syscalls. One possible problem is server write contention: multiple threads sending responses must make sure the messages are written atomically. In Linux this happens "for free" if your messages are all smaller than PIPE_BUF (4096), but if they aren't you have to do something more complicated, the simplest being to hold a lock while writing to the pipe, which could become a bottleneck for very parallel servers. There is a similar problem with client read contention, which is mixed up with the question of how you dispatch received responses to the thread reading from a connection.

A better local approach might be for the client to use an AF_UNIX socket to send requests to the server, and with each request message pass a file descriptor for a fresh pipe that the server should use to respond to the client. It requires a few more syscalls but client threads require no user-space synchronization, and server threads require no synchronization after the dispatch of a request to a server thread. A pool of pipes in the client might help.

The network case is harder. A naive approach is to multiplex the logical connections over a TCP stream. This suffers from head-of-line-blocking: a lost packet can cause delivery of all messages to be blocked while the packet is retransmitted, because all messages across all connections must be received in the order they were sent. You can use UDP to avoid that problem, but you need encryption, retransmits, congestion control, etc so you probably want to use QUIC or something similar.

The Web client case is interesting. You can multiplex over a WebSocket much like a TCP stream, with the same disadvantages. You could issue an HTTP request for each logical connection, but this would limit the number of open connections to some unknown maximum, and could have even worse performance than the Websocket if the browser and server don't negotiate QUIC + HTTP2. A good solution might be to multiplex the connections into a RTCDataChannel in non-ordered mode. This is probably quite simple to implement in the client, but fairly complex to implement in the server because the RTCDataChannel protocol is complicated (for good reasons AFAIK).

This multiplexing problem seems quite common, and its solutions interesting. Maybe there are known best practices or libraries for this, but I haven't found them yet.

Monday, 30 July 2018

Gerv

I'm sad that Gerv is no longer with us, but I'm also glad because I'm confident he is in the presence of Jesus, awaiting the final resurrection.

I never spent very much time with him, but I really appreciated getting together at Mozilla events with Gerv and a small group of other Mozilla Christians to pray every morning. That tradition continues, and long may it do so!

I have always been inspired by the way Gerv and his family lived their lives to the full, to the glory of God, in the face of his long illness. I've had a sheltered life of little contact with sickness and death, but that will probably not last, and I expect in times to come I will treasure Gerv's example.

Wednesday, 11 July 2018

Why Isn't Debugging Treated As A First-Class Activity?

Mark Côté has published a "vision for engineering workflow at Mozilla": part 2, part 3. It sounds really good. These are its points:

  • Checking out the full mozilla-central source is fast
  • Source code and history is easily navigable
  • Installing a development environment is fast and easy
  • Building is fast
  • Reviews are straightforward and streamlined
  • Code is landed automatically
  • Bug handling is easy, fast, and friendly
  • Metrics are comprehensive, discoverable, and understandable
  • Information on “code flow” is clear and discoverable

Consider also Gitlab's advertised features:

  • Regardless of your process, GitLab provides powerful planning tools to keep everyone synchronized.
  • Create, view, and manage code and project data through powerful branching tools.
  • Keep strict quality standards for production code with automatic testing and reporting.
  • Deploy quickly at massive scale with integrated Docker Container Registry.
  • GitLab's integrated CI/CD allows you to ship code quickly, be it on one - or one thousand servers.
  • Configure your applications and infrastructure.
  • Automatically monitor metrics so you know how any change in code impacts your production environment.
  • Security capabilities, integrated into your development lifecycle.

One thing developers spend a lot of time on is completely absent from both of these lists: debugging! Gitlab doesn't even list anything debugging-related in its missing features. Why isn't debugging treated as worthy of attention? I genuinely don't know — I'd like to hear your theories!

One of my theories is that debugging is ignored because people working on these systems aren't aware of anything they could do to improve it. "If there's no solution, there's no problem." With Pernosco we need to raise awareness that progress is possible and therefore debugging does demand investment. Not only is progress possible, but debugging solutions can deeply integrate into the increasingly cloud-based development workflows described above.

Another of my theories is that many developers have abandoned interactive debuggers because they're a very poor fit for many debugging problems (e.g. multiprocess, time-sensitive and remote workloads — especially cloud and mobile applications). Record-and-replay debugging solves most of those problems, but perhaps people who have stopped using a class of tools altogether stop looking for better tools in the class. Perhaps people equate "debugging" with "using an interactive debugger", so when trapped in "add logging, build, deploy, analyze logs" cycles they look for ways to improve those steps, but not for tools to short-circuit the process. Update This HN comment is a great example of the attitude that if you're not using a debugger, you're not debugging.

Sunday, 24 June 2018

Yosemite: Clouds Rest And Half Dome

On Saturday morning, immediately after the Mozilla All Hands, I went with some friends to Yosemite for an outstanding five-night, five-day hiking-and-camping trip! We hiked from the Cathedral Lakes trailhead all the way down to Yosemite Valley, ascending Clouds Rest and Half Dome along the way. The itinerary:

  • Saturday night: camped at Tuolumne Meadows
  • Sunday: hiked from Cathedral Lakes trailhead past the Cathedral Lakes to Sunrise High Sierra Camp
  • Monday: hiked from Sunrise HSC past the Sunrise Lakes to camp just north of Clouds Rest
  • Tuesday: hiked up and over Clouds Rest and camped just north of the trail leading up to Half Dome
  • Wednesday: left most of our gear in camp, climbed Half Dome, returned to camp, and hiked down to camp in Little Yosemite Valley
  • Thursday: hiked out to Yosemite Valley

Apart from the first day, each day was relatively short in terms of distance, but the first few days were quite strenuous regardless because of the altitude. I've never spent much time above 2500m and I was definitely unusually short of breath. The highest points on the trail were around 3000m, where the air pressure was down to 700 millibars.

The weather was (predictably) good: cold at night the first couple of nights, warmer later, but always warm and sunny during the day.

We saw lots of animals — deer, marmots, chipmunks, woodpeckers, other birds, lizards, and other animals you don't see in New Zealand. Also lots of interesting trees, flowers and other plants.

The mosquitoes at Sunrise HSC were terrible in the morning! My friend said it was the worst he'd ever seen, even having grown up in South Florida.

I've never camped for so many consecutive nights before — in New Zealand we usually stay in huts. I got to use my "squeeze bag" mechanical water filter a lot; it works very well and doesn't have the latency of the chemical purifiers.

Swimming in the Merced River at Little Yosemite Valley after a hot day felt very good!

I thought my fear of heights would kick in climbing the cables to get to the top of Half Dome, but it didn't at all. The real challenge was upper body strength, using my arms to pull myself up the cables — my strength is all in my legs.

Needless to say, Clouds Rest and Half Dome had amazing views and they deserve their iconic status. I'm very thankful to have had the chance to visit them.

My companions on the trip were also great, a mix of old friends and new. Thank you.

Monday, 11 June 2018

Bay Area Visit

I'm on my way to San Francisco for a guest visit to the Mozilla All Hands ... thanks, Mozilla!

After that, I'll be taking a break for a few days and going hiking with some friends. Then I'll spending a week or so visiting more people to talk about rr and Pernosco. Looking forward to all of it!

Sunday, 10 June 2018

Crypto-Christians In Tech

This is not about cryptocurrencies; for that, watch this. Nor is it about cryptography. It's about the hidden Christians working in tech.

I sometimes get notes from Christian tech people thanking me for being open about my Christian commitment, because they feel that few of their colleagues are. That matches my experience, but it's a combination of factors: most tech people aren't Christians, but more are than you think — they're just not talking about it. Both of these are sad, but I expect the former. The latter is more problematic. I would encourage my brothers and sisters in tech to shine brighter. Here are some concerns I've had — or heard — over the years:

What can I do without being a jerk?
When asked what I did during the weekend, I say I worshiped the Creator. Sometimes I just say I went to church.

Sometimes I write blog posts about Christ. People don't have to read them if they don't want to.

I used to put Christian quotes in my email signature, but I got bashed over that and decided it wasn't worth fighting over. Now my email signatures are obscured. Those who seek, find. I should try emoji.

Sometimes I'm probably a jerk. Sorry!

Won't my career suffer?
It may. People I've worked with (but not closely) have told me they look down on me because I'm a Christian. Surely more have thought so, but not said so. But Jesus is super-clear that we need to take this on the chin and respond with love.

I don't want to be associated with THOSE OTHER Christians.
I know, right? This is a tough one because the easy path is to disavow Christians who embarrass us, but I think that is often a mistake. I could write a whole post about this, but Christians need unity and that sometimes means gritting our teeth and acknowledging our relationship with people who are right about Christ and wrong about everything else.

Another side of this is that if your colleagues only know of THOSE OTHER Christians (or perhaps just those who are particularly thick-skinned or combative), they need you to show them an alternative.

Woah, persecution!
No. Claiming I've ever experienced persecution would embarrass me among my brothers and sisters who really have.

People are generally very good about it, especially in person. People who are jerks about it generally turn out to be jerks to everyone. In the long run it will reduce the number of awkward conversations people have around you about how awful those Christians are, not knowing where you stand. But this is not about our comfort anyway.

What if I screw up and give people a bad impression?
Bad news: you will. Good news: if you were perfect, you might give people the false impression that Christianity is about being a good person (or worse, trying to make other people "good"). But of course it isn't: it's about us recognizing our sin, seeking reconciliation with people and God, and obtaining forgiveness through Christ; not just once, but every day. How can we demonstrate that if we never fail?

Saturday, 26 May 2018

rr 5.2.0 Released

I released rr 5.2.0. This is a minor update that fixes some bugs, improves chaos mode a bit, and adds some features to help with trace portability. If rr's working for you, you probably don't need to upgrade.

Friday, 25 May 2018

Intel CPU Bug Affecting rr Watchpoints

I investigated an rr bug report and discovered an annoying Intel CPU bug that affects rr replay using data watchpoints. It doesn't seem to be hit very often in practice, which is good because I don't know any way to work around it. It turns out that the bug is probably covered by an existing Intel erratum for Skylake and Kaby Lake (and probably later generations, but I'm not sure), which I even blogged about previously! However, the erratum does not mention watchpoints and the bug I've found definitely depends on data watchpoints being set.

I was able to write a stand-alone testcase to characterize the bug. The issue seems to be that if a rep stos (and probably rep movs) instruction writes between 1 and 64 bytes (inclusive), and you have a read or write watchpoint in the range [64, 128) bytes from the start of the writes (i.e., not triggered by the instruction), then one spurious retired conditional branch is (usually) counted. The alignment of the writes does not matter, and it's not related to speculative execution.

If you find rr failing during replay with watchpoints set, and the failures go away if you remove the watchpoints, it could well be this bug. Broadwell and earlier don't seem to have the bug.

A possible workaround would be to disable "fast-string optimization" in the kernel at boot time. I don't think there's any way for users to force this to happen in Linux, currently, but someone could write a kernel patch adding a command-line option for that and send it upstream. It would be great if they did!

Fortunately this sort of bug does not affect Pernosco.

Update: Pernosco

In the two years since I left Mozilla I've been working on a new debugger (with Kyle Huey, most of that time). We call it Pernosco — "know thoroughly". Pernosco takes rr recordings of failing runs, analyzes them in the cloud, and provides a Web interface for developers to debug them. It uses components of rr but the debugger implementation is completely different. Pernosco's data-oriented approach has features and performance that rr's approach cannot match — and nor can any other debugger. We've reached the point where external users have used Pernosco to fix real bugs, and their response has been very enthusiastic. It's exciting!

Developers spend a lot of time figuring out bugs. We think Pernosco will create a lot of value for software development organizations. We intend to capture some of that value by offering Pernosco as a paid cloud service. We have many ideas for making debugging easier and more fun, and we need a sustainable business model to make them happen. It's especially important because debugging is an area that has been chronically under-invested in across the industry.

Saturday, 12 May 2018

rr Chaos Mode Improvements

rr's chaos mode introduces nondeterminism while recording application execution, to try to make intermittent bugs more reproducible. I'm always interested in hearing about bugs that cannot be reproduced under chaos mode, especially if those bugs have been diagnosed. If we can figure out why a bug was not reproducible under chaos mode, we can often extend chaos mode to make it reproducible, and this improves chaos mode for everyone. If you encounter such a bug, please file an rr issue about it.

I just landed one such improvement. To trigger a specific Spidermonkey JS engine bug, some thread X had to do a FUTEX_WAKE to wake up thread Y, then immediately yield to let thread Y run for a while without X running any further. rr chaos mode assigns random priorities to threads and strictly adheres to them, so in some runs it would assign X a low priority and Y a high priority and schedule Y whenever both were runnable. However, rr's syscall buffering optimization means the rr supervisor process is not notified after the FUTEX_WAKE and has no opportunity to interrupt X and schedule Y instead, so we keep running the lower-priority X thread, violating our scheduling policy. (Chaos mode randomizes scheduling intervals so it was possible for X to yield at the right point, but very unlikely because the "window of vulnerability" is very small.) The fix is quite easy: in chaos mode, FUTEX_WAKE should not use the syscall buffering optimization. This adds some overhead, but hopefully not all that much, because every FUTEX_WAKE is normally paired with a FUTEX_WAIT (futex-using code should not issue a FUTEX_WAKE if there are no waiters), and a FUTEX_WAIT yields, which is already an expensive operation.

The same sorts of issues exist for other system calls that can make another higher-priority thread runnable, and I've added some slightly more elaborate fixes for those.

One day I should do a proper evaluation of these techniques and publish them...

Wednesday, 9 May 2018

Research Wishlist: A Filesystem For Efficient Host-Guest File Sharing

Suppose you have a guest virtual machine and you'd like to give it read-only access to a set of (large) files created and managed in the host OS (or possibly a different guest VM). Current options for this are inefficient. You can use a network filesystem like 9pfs or NFS to share the files, but that means every read copies the data from host to guest, which means the data exists in two places in system RAM, and throughput is limited. Latency is also bad in practice. Alternatively you can copy the files into a filesystem on the guest, after which access is efficient, but the copy itself is slow especially if the guest won't read all the data, plus of course you have to manage replication and you need twice the persistent storage.

It seems to me a custom-designed host-guest filesystem could share data pages between host and guest, allowing very efficient I/O with no extra copies or memory footprint, as well as fully supporting mmap(), even for writing (i.e., you can have shared-memory communication between applications in different VMs). I imagine the host kernel would mmap() files into guest physical memory and the guest filesystem driver would use those pages as the data pages for its view of the files, similar to the way persistent memory storage devices work. Non-read/write operations such as open, close, directory reading, etc, could be hypercalls from the guest into the host, which I imagine can be quite fast. With this approach, guest reads and writes would have minimal added latency over host reads and writes: for guest-cached pages, none at all, and for guest-cache misses, just the cost of nested page faults.

Even a read-only version of this would be very useful to us. I'm surprised that it doesn't already exist. Maybe it does and it's just not well known? If it doesn't exist, it seems to me this would be an excellent grad-student OS research project. It doesn't sound very difficult. Stanford's Disco project had something like this 20 years ago.

Update Kata Containers (aka Intel Clear Containers) describes something pretty close to what I want, using DAX and QEMU's nvdimm virtualization to map some host files into the guest. But it sounds like you can only map a filesystem image file into the guest, rather than accessing host files on a file by file basis, and then it wouldn't be possible to update files from the host.

Wednesday, 2 May 2018

Priority Is Overrated

We give scientists great credit for priority, i.e. being the first to discover something important — often far too much, because priority is often a matter of happenstance and does not reflect the ability of the discoverer.

Let me use my own work as an example (neatly projecting false humility while being self-promoting). Probably my most influential paper was the PPoPP '03 paper I wrote with Jong-Deok Choi on hybrid dynamic data race detection — 400+ citations so far. (Incidentally that paper was rejected by PLDI, a good reminder that publication reviews are a bit of a lottery.) The key idea was to combine two different existing dynamic race detection methods, lockset and happens-before, applying the expensive happens-before detection to a "synchronizing" subset of memory accesses. AFAIK we were the first to publish that idea, and it has been used by most practical race detectors since. Our paper is the top hit when I search for "hybrid race detection". Does describing it first imply that Jong and I were especially talented? Absolutely not.

For one thing, a very similar approach combining happens-before checks with lock-based analysis was present in the paper "Detecting Access Anomalies In Programs With Critical Sections" by Dinning and Schonberg in 1991 (which we cited). Our work was new and good, but it wasn't unprecedented.

More importantly, the whole field of dynamic data race detection was picking up steam. Both dynamic happens-before and lockset algorithms were prominent and it was only a matter of time before someone combined them. If we hadn't published our work, others would certainly soon have published something similar. We happened to be in the right place at the right time.

One could argue that "being in the right place at the right time" is a very important skill, and that our paper is evidence Jong and I were good at selecting timely problems that would let us write influential papers. That's a useful skill if your goal is to write papers and don't particularly care what they're about. It's less useful if you have a specific problem to solve.

This matters when you need to evaluate the ability of researchers. When making hiring or funding decisions with some specific problem in mind, be careful not to overweight "first to discover X" facts as a predictor of future success. Ditto if you're looking for teaching or advice.

Tuesday, 1 May 2018

rr Trace Portability: x87 "Data Pointer" Broken On Skylake

x87 math instructions have an odd feature where the address of the last memory operand of an x87 instruction is saved in an "FPU Data Pointer" register. This value is stored into memory by an FXSAVE, XSAVE or similar instruction. (For some instructions only the low 32 bits are stored.) On a Broadwell machine this works as expected, but on my Skylake laptop the value is zero even immediately after executing a simple FLD instruction. This program shows the bug:

#include <stdio.h>
#include <stdint.h>
static char buf[10] = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 };
static char fxsave_buf[512];
int main() {
  int ebx;
  asm("cpuid" : "=b"(ebx) : "a"(7), "c"(0));
  printf("cpuid(7,0) ebx=0x%x; FDP_EXCPTN_ONLY=%s\n",
         ebx, (ebx & (1 << 6)) ? "true" : "false");
  asm("fld (%%rbx)\n"
      "fxsave (%%rcx)\n"
      : : "b"(buf), "c"(fxsave_buf));
  uint32_t fdp = *(uint32_t*)(fxsave_buf + 16);
  printf("fdp = 0x%x, buf addr = %p; %s\n",
         fdp, buf,
         fdp == (uint32_t)(uintptr_t)buf ? "OK" : "ERR");
  return 0;
}

A mysterious barely-documented CPUID feature is related. EAX=7, ECX=0:

EBX Bit 06: FDP_EXCPTN_ONLY. x87 FPU Data Pointer updated only on x87 exceptions if 1.
That bit is 0 on my Skylake machines, but Intel erratum SKL087 explains that Skylake behaves as if this bit was 1. I haven't got newer CPUs to test on but I suspect the FDP_EXCPTN_ONLY bit is 1 on newer CPUs, i.e. Intel just codified the Skylake bug as expected behaviour.

This is a problem for rr trace portability, since when FXSAVE or some XSAVE variant are used (and on x86-64 the glibc dynamic loader always uses one of them), if the FDP is different between recording and replay this difference can leak into stack memory and then into uninitialized stack locations loaded into registers, after which rr may detect a divergence.

I'm not sure what, if anything, can be done about this. x86-64 applications seldom use x87 instructions, but they do get used. For example my Fedora 27 libstdc++ std::__detail::_Prime_rehash_policy::_M_next_bkt(unsigned long) has been compiled to a mix of x87 and SSE instructions.

Sunday, 29 April 2018

CPUID Features, XSAVE, And rr Trace Portability

Last year we made rr use "CPUID faulting" to record and replay the results of CPUID instructions. This has enabled a reasonable level of rr trace portability in practice. However, we've run into some interesting limitations, some of which we've addressed and others that I don't know how to address.

Sometimes the recording CPUs support features that are missing on the CPUs you want to eventually replay on. If your application detects and uses those features, replay won't work. Fortunately, because CPUID faulting lets us control the results of application CPUID invocations, it was easy to extend rr with the ability to mask off CPUID feature bits during recording. I added recording options --disable-cpuid-features, --disable-cpuid-features-ext, and --disable-cpuid-features-xsave to do this. To make it easier to determine which bits to mask off, I also added a new command rr cpufeatures which you can run on a replay machine to print the command line options you should use for recording, to disable features unsupported on the replay machine. This seems to work reasonably well.

Unfortunately there's a portability hazard related to the XSAVE instruction. The size of the memory range written by XSAVE depends on the XSAVE features supported by the CPU and enabled by the OS via the XCR0 register. This can be different on the recording and replay machines and there's nothing rr can do about it! User-space code can (and should!) use CPUID queries to determine what that size is, but overriding what we report doesn't change what the CPU will actually write. Even though applications shouldn't really depend on the what's written by XSAVE, as long as XSAVE/XRSTOR pairs actually restore state as expected, in practice the size of XSAVE writes affects uninitialized values on the stack which leak into registers which cause rr to detect divergence and abort.

A possible work-around is to mask off XSAVE feature bit in CPUID results, so well-behaved user-space programs don't use XSAVE at all. Unfortunately that also effectively disables AVX and other XSAVE-dependent features :-(.

A possible fix is to add a Linux kernel feature to let us set the XCR0 value for specific processes, i.e. disable XSAVE state saving for some XSAVE components. Then during recording we could limit the XSAVE components to those that are supported by the replay machine. Unfortunately, from experience we know that adding new a kernel API upstream is quite difficult, especially those that require adding code to context switches.

A simpler kernel patch would be to provide a boot command-line option to mask bits out of XCR0 for all processes.

Another possible work-around would be to record in a virtual machine whose XCR0 is specifically configured to match the replay machine's. I haven't tried it but I assume it's possible; VM migration would require this.

For now I plan to just have rr print an error during replay when XSAVE is enabled and the replay machine's XCR0 features are not equal to the recording machine's.

Update I realized that the XSAVEC instruction ("XSAVE compressed") avoids the above issues. State components that are not in use do not affect what is written by XSAVEC. If a state component is actually in use during recording but not supported during replay, replay is already impossible. Therefore applications that stick to XSAVEC and avoid XSAVE will not incur the above problems. The only user-level use of XSAVE I currently know of is in the glibc dynamic loader, which prefers XSAVEC if available; therefore recording (and replaying) on machines supporting XSAVEC should generally work fine regardless of what XSAVE features are enabled via XCR0. The situation is not as bad as I feared.

Saturday, 21 April 2018

Heaphy Track #2

Last week I did the Heaphy Track again with some friends and family. We did the track west to east this time, staying overnight in Westport on Saturday and spending the following nights on the trail in Heaphy Hut, Mckay Hut, and Perry Saddle Hut. The weather forecast was pretty bad from outset, predicting heavy rain and high winds for Monday, Tuesday and Wednesday. In the end the weather wasn't as bad as forecast, but it did rain a fair bit on those days and we had a major thunderstorm on Sunday night — glad to be in the hut! For some of our group it was their first multi-day tramp, and covering 80km in four days in those conditions wasn't the easiest introduction, but I'm confident everyone had at least a memorable time, and probably a good one :-). We had a bit of everything: beaches, forest, rivers, giant carnivorous snails, lots of wekas, limestone caves (next to Gouland Downs Hut), wetas, rain, sun, views of snow-capped hills. In huts we played lots of Bang, some Citadels and San Juan, talked to interesting people from many countries, and cooked some pretty good food.

Tuesday, 10 April 2018

Payment Express's "Account2Account" Is Bad For Security

Today I discovered that an Australian company called Payment Express has started offering, in addition to credit-card payment processing, a feature called "Account2Account". With this feature, customers enter their online banking credentials into Payment Express' Web site which then performs a payment transaction on the customer's behalf. This is insane and I don't know why banks allow it.

The security FAQ presented to customers (which I can't find a public URL for) emphasizes that Payment Express does not store the customer's credentials or other information. Good, but the problem is that even if customers can completely trust Payment Express (and I don't know why they should; Payment Express' terms and conditions disclaim all liability), any workflow that trains customers to enter their banking credentials into a Web site other than their bank's site makes them vulnerable to phishing attacks. Online banking phishing attacks are ubiquitous. Even worse, Payment Express advertises "Payment page look and feel customisable via an online wizard", which suggests that the appearance of pages presented to customers can vary, making those customers even more vulnerable to phishing. Payment Express doesn't even use an EV certificate.

Maybe banks allow this because they get a bigger cut of the transactions than they do with a credit-card transaction, enough to cover the cost to them of any increase in phishing losses? If so I don't know how they're calculating the cost to our entire social ecosystem of people being trained to enter critical login credentials into random Web sites.

Thursday, 29 March 2018

Speeding Up `dwarfdump` With Rust

Writing a debugger for C++ on Linux, you spend a lot of time examining pretty-printed DWARF debug information using tools like readelf, objdump or dwarfdump. Unfortunately this can be quite slow. For Firefox's libxul.so, dwarfdump's pretty-printed output of just the main .debug_info is 25GiB. The standard objdump and readelf tools take about three minutes to print it to /dev/null.

The Rust gimli crate includes a Rust-implemented version of dwarfdump. Initially it took about eight minutes to dump libxul, although the comparison is unfair because dwarfdump dumps more data than readelf. I decided to try to speed dwarfdump up. TL;DR: I reduced the dump time from 506s to 26s by fixing some simple issues and taking advantage of Rust "fearless parallelism". I think there are interesting opportunities for speeding up many kinds of command-line tools using Rust and parallelism.

Initially gimli dwarfdump was using println! to print every line of output. Every println! call temporarily locks the Rust stdout stream and performs a write system call, even when redirected to a file. This is extremely inefficient; it's much more efficient to buffer output so multiple lines are written with each system call. Since dwarfdump only used one thread, it can take the lock at the start and hold it permanently. These changes reduced the run time to 215s.

I did some profiling to look for hotspots amenable to micro-optimizations. I found that we spent a lot of CPU time in Rust's formatted-string padding code. dwarfdump used right-padding formatting of an empty string ("{:indent$}") to insert a specified number of spaces to indent each line. This is quite slow because the padding implementation writes the padded value to a temporary buffer, measures the length of the resulting string, and then writes the buffered value and the padding bytes. I changed dwarfdump to indent by printing slices from a large string of spaces, reducing the run time to 142s. I also found that where some non-empty string tokens were being right-padded, we could speed things up emitting the padding as a string slice, taking advantage of the fact that these are static strings whose lengths we know. That reduced the run time to 99s.

There are probably opportunities to improve the performance of Rust formatted text output by combining static knowledge of the format string and the types and other details of the parameters.

I then turned to parallelism. dwarfdump prints the contents of .debug_info by looping over each compilation unit, and this is easy to do in parallel — in Rust, at least. In Rust, a method that takes an immutable &self reference guarantees it can be called safely on the same object from multiple threads, so it's evident from a library's API types whether and how it can be used with multiple threads, and the compiler checks your usage. Better still, since immutability is the default, in practice Rust libraries tend to work well with multithreading, and gimli is no exception.

One important detail is that we want to output the results for compilation units in the correct order: the output for a compilation unit has to be buffered and printed to stdout after the output for all previous compilation units. You don't want to buffer the output of too many compilation units at once; each of those buffers can be tens of megabytes. I created a parallel-output utility function to handle that. It assigns compilation units to a fixed number of worker threads (N) in a strict round-robin order and ensures that a worker thread doesn't start working on a new compilation unit until the results for its previous compilation unit have been printed. Thus at most N output buffers are required. With eight worker threads (for my quad-core, eight-hardware-threads Skylake laptop) this reduced the run time to 26s.

Even with a single worker thread, run time dropped to 77s; some of the improvement actually came from changing writes from a BufWriter to a Vec<u8>. That may be due to io::Result error propagation and checking. Two threads give 47s, and four threads give 30s. Exceeding the physical cores provides negligible returns.

At peak performance gimli dwarfdump produces 1GB of output per second. It's interesting that even for relatively simple pretty-printing and such high data volume, serialized large writes to stdout are not the bottleneck. This suggests that even simple Unixy stdio-pipeline tools might benefit from internal parallelism.

Achieving this performance by improving existing C tools would have been a lot more difficult than with gimli and Rust. There's no way to be sure that the relevant DWARF processing code, e.g. binutils/dwarf.c, is safe for use on multiple threads. (Given the presence of many unadorned static variables, it probably isn't.) Efficiently switching output backends would have been more difficult than in Rust, where it's idiomatic to parameterize output code on the static type of the output backend, e.g. fn dump_info<W: Writer>(w: &mut W, ...).

Tuesday, 20 March 2018

Too Many DWARF Packaging Options

There are too many ways in which DWARF debuginfo can be packaged, and it makes building DWARF-consuming tools a nightmare.

Debuginfo can be packaged into the executable file, or into a separate external debuginfo file This debuginfo file can be referenced by a build id stored in the .note.gnu-build-id section, or via a filename stored in the .gnu_debuglink section.

Either of those kinds of debuginfo file can use "split DWARF". With "split DWARF", most of the debuginfo is not present in the file itself. Instead most of the debuginfo is left in the in object files that were input to the linker, and the debuginfo binary references those files. Unfortunately there are two flavours of this scheme: the DWARF5 standard "DWO" and the GNU variant "DWZ", and they are different. You can merge the debuginfo from multiple DWO/DWZ files into a single file, although I don't think that adds to the complexity of debuginfo-consuming tools.

On top of that, there is also "multi-file DWZ". This lets you extract DWARF data that's common to multiple binaries into a single "alternative debuginfo" file referenced by multiple binaries via a .gnu_debugaltlink section. This has been standardized as "supplementary object files". Fedora is using the GNU variant in its debuginfo packages already.

You can optionally gzip-compress individual ELF sections in any of those files.

These techniques could, in theory, be combined. Fedora uses external debuginfo files with multi-file DWZ. I think you could have a binary that uses both "supplementary object files" and "split DWARF". I think the DWARF spec rules out a "supplementary object file" itself using "split DWARF", or vice versa, but it's a bit vague.

It's unfortunate that "supplementary object files" and "split DWARF" are different.

It's also very unfortunate that many new DWARF features exist in a GNU flavour and a standard flavour. In practice DWARF-consuming tools need to support both, which is extra work. The DWARF community could learn from the painfully-won wisdom of the Web standards community.

I apologise if I got any of the above wrong. It's complicated and confusing, and documentation is scattered or in some cases nonexistent.

Update Turns out that the ELF per-section compression is more complex than I knew. Any section may be compressed with SHF_COMPRESSED, in which case it starts with a Elf32_External_Chdr or Elf64_External_Chdr. Some Fedora packages use this. However, you can also have compressed sections containing a different sort of header, "ZLIB" followed by an 8-byte big-endian uncompressed size.

Also, it appears the GNU tools will accept .zdebug variants of every .debug. The "z" is supposed to indicate compression, but nothing seems to require that .zdebug be compressed or .debug uncompressed.

Tuesday, 6 March 2018

"Zach": AI Fraud In Christchurch

David Farrier's story is amazing. Go and read it.

I have no doubt whatsoever that this is all a ridiculous fraud. Apart from the implausibility of it all, the purported technical details make no sense. Serious AI outfits use racks of cheap servers, not expensive supercomputers like the Cray XC50. The XC50 is cooled with water, not liquid nitrogen. Training an AI by sending it emails in English would only work if it has already achieved human-level intelligence, in which case a) impressive effort by Albi Whale, purported "boy genius" b) why bother taking the time of medical professionals to train it, when you could be learning much faster from Wikipedia and other resources on the Internet c) why are you applying the pinnacle of mankind's technological achievement to transcribing medical records in Christchurch?

David Whale's emails to Farrier are full of bluster, someone who doesn't know much about computers trying to impress someone whom he thinks also doesn't know much. He's not telling the truth. Robert Seddon-Smith and John Pickering are either active fraudsters or victims.

Most concerning is that it sounds like government health boards are either on the verge of funneling funds to this fraud, or are already doing so. That needs to be stopped.

Friday, 2 March 2018

Tongariro Northern Circuit #2

A couple of weeks ago I went with a friend to do the Tongariro Northern Circuit for the second time in less than a year. The weather wasn't quite as good as last time but we had another great trip.

We drove down to the mountains, stopping at Orakei Korako on the way — a pretty good thermal area. We spent Friday night in Taupo so we could arrive at the first hut, Mangatepopo, in good time on Saturday. We had a swim in Lake Taupo — surprisingly warm. On Saturday morning we drove to Whakapapa with a stop for a two-hour walk around Lake Rotopounamu — lovely and peaceful.

We knew Cyclone Gita was scheduled to hit New Zealand on Tuesday, the planned fourth day of our walk, so I checked at the DoC office at the start of our walk in Whakapapa. They advised us to just go ahead, that if necessary on the fourth day we could walk out to the Desert Road in 1.5 hours instead of returning to Whakapapa over the exposed saddle between Ruapehu and Tongariro.

The first day to Mangatepopo Hut was a bit rainy and the track is in poor condition ... perhaps the worst one-day section of track in the entire Great Walks system. Nevertheless the area is full of pretty wildflowers this time of year, and we got to Mangatepopo in good time, just after 3pm. Having just two in our group — instead of ten on my last tramp! — meant we talked to a lot more people, most of whom seemed to be Americans for some reason. We had a great time, made popcorn and shared it out, taught San Juan to a few people, and got to know some trampers we'd see a lot of for the rest of the walk.

Sunday's walk across Tongariro to Oturere Hut was busy as expected, as we shared the track with hundreds of people doing the Tongariro Crossing that day. Nevertheless the landscape never ceases to amaze. Oturere is quite cramped (we heard it's scheduled for an upgrade in 2019) but we had another great time: more San Juan, more popcorn, and finally a clear view of Ngauruhoe.

On Monday morning at Oturere I overhead one man propounding "religious people think they have all the answers ... must be comforting (to be so ignorant)". That's totally contrary to my experience. Who contemplates the mystery of the Trinity, or their own sin, for example, and comes away thinking they have all the answers? I resisted interjecting.

The forecast for Tuesday was that with cyclone Gita approaching we'd have 100+km/hour winds and heavy rain. That would be a bad combination for a four-hour walk in very exposed terrain between Ruapehu and Ngauruhoe. We decided to accelerate plans and complete Tuesday's leg on Monday. So, we had a two-and-a-half hour walk to Waihohonu hut — still the best hut in New Zealand AFAIK — and after a half-hour break carried on all the way back to Whakapapa (five and a half hours) with short sides trip to Lower Tama Lake and Taranaki Falls. Rain was forecast for the afternoon but we only had a light sprinkle before we arrived at Whakapapa around 5pm. We drove all the way back to Auckland that night without difficulty. All in all, another excellent tramp.

Sunday, 21 January 2018

Neal Stephenson's "Seveneves" (Mild Spoilers)

There's much discussion of orbital mechanics, disguised as a story. The rest isn't as good.

OK, actually I rather enjoyed it, but only because I'm a sucker for apocalyptic fiction and hard-ish science, and I gave immense credit for the chutzpah of his opening sentence, in which the moon explodes for no reason.

I found his treatment of religion more annoying than usual for sci-fi. His atheist wish-fulfillment fantasy "then everyone realized there's no God" is par for the course. Projecting thousands of years of human development without belief in God recurring, and with no other apparent solution to the meaning of life, is sloppy but also usual. What really grates is the ending, which reveals that — surprise! — people do care about having a supernatural purpose and, oddly, a powerful cabal has found one but they're keeping it secret. It reminded me of Contact where after relentlessly bashing religious rubes, at the very end Sagan reveals that the universe has been designed by, if not God, something seriously God-like. I find their lack of faith in lack of faith disturbing.

Wednesday, 17 January 2018

Long-Term Consequences Of Spectre And Its Mitigations

The dust is settling on the initial wave of responses to Spectre and Meltdown. Meltdown was relatively simple to deal with; we can consider it fixed. Spectre is much more difficult and has far-reaching consequences for the software ecosystem.

The community is treating Spectre as two different issues, "variant 1" involving code speculatively executed after a conditional branch, and "variant 2" involving code speculatively executed via an indirect branch whose predicted destination is attacker-controlled. I wish these had better names, but c'est la vie.

Spectre variant 1 mitigations

Proposals for mitigating variant 1 have emerged from Webkit, the Linux kernel, and Microsoft. The former two propose similar ideas: masking array indices so that even speculative array loads can't load out-of-bounds. MSVC takes a different approach, introducing LFENCE instructions to block speculative execution when the load address appears to be guarded by a range check. Unfortunately Microsoft says

It is important to note that there are limits to the analysis that MSVC and compilers in general can perform when attempting to identify instances of variant 1. As such, there is no guarantee that all possible instances of variant 1 will be instrumented under /Qspectre.
This seems to be a great weakness, as developers won't know whether this mitigation is actually effective on their code.

The Webkit and Linux kernel approaches have the virtue of being predictable, but at the cost of requiring manual code changes. The fundamental problem is that in C/C++ the compiler generally does not know with certainty the array length associated with an array lookup, thus the masking code must be introduced manually. Webkit goes further and adds protection against speculative loads guarded by dynamic type checks, but again this must be done manually in many cases since C/C++ have no built-in tagged union type.

I think "safe" languages like Rust should generalize the idea behind Webkit's mitigations: require that speculatively executed code adhere to the memory safety constraints imposed by the type system. This would make Spectre variant 1 a lot harder to exploit. It would subsume every variant 1 mitigation I've seen so far, and could be automatic for safe code. Unsafe Rust code would need to be updated.

Having said that, there could be variant-1 attacks that don't circumvent the type system, that none of these mitigations would block. Consider a browser running JS code:

let x = bigArray[iframeElem.contentWindow.someProperty];
Conceivably that could get compiled to some mix of JIT code and C++ that does
  if (iframeElemOrigin == selfDocumentOrigin) {
    index = ... get someProperty ...
    x = bigArray[index];
  } else {
    ... error ...
  }
The speculatively executed code violates no type system invariants, but could leak the value of the property across origins. This example suggests that complete protection against Spectre variant 1 will require draconian mitigations, either pervasive and expensive code instrumentation or deep (and probably error-prone) analysis.

Spectre variant 2 mitigations

There are two approaches here. One is microcode and silicon changes to CPUs to enable flushing and/or disabling of indirect branch predictors. The other is "retpolines" — replace indirect branches with an instruction sequence that doesn't trigger the indirect branch predictor. (More precisely, that doesn't use the BTB; the RSB prediction is used instead, but its prediction is directed to a safe destination address.) Apparently the Linux community is advising all compilers and assembly writers to avoid all indirect branches on Intel even in user-space. This means, for example, that we should update rr's handwritten assembly to avoid indirect branches. On the other hand, Microsoft is not giving such advice and apparently is not planning to introduce retpoline support in MSVC. I don't know why this difference is occurring, but it seems like a problem.

Assuming the Linux community advice is followed, things get even more complicated. Future CPUs can be secure against variant 2 without requiring retpolines. We will want to avoid retpolines on those CPUs for performance reasons. Also, Intel's future CET control-flow-integrity hardware will not work with retpolines, so we'll want to turn retpolines off for security! So software will need to determine at run-time whether retpolines should be used. JITs and handwritten assembly will need to add code to do that. This is going to be a burden on lots of software developers for a very long time.

Security/performance tradeoffs

There is now a significant performance penalty for running untrusted code. If you know for sure there is no malicious code running in your (virtual) machine you can turn off these mitigations and get significant performance wins. This wasn't really true before. (Unikernels reaped some performance benefits but created too many other problems to be generally useful.) Inventorying the entire collection of software running in your VM to verify that it's all trusted may be difficult in practice and reduces defense-in-depth ... but no doubt people will be tempted to do it.

We could see increased interest in source-based distributions like Gentoo. Recompiling your software stack to include just the mitigations that you need could bring performance benefits.

Javascript implications

The isolation boundary between Javascript and a browser content process' native code is not visible to the CPU, which makes hardware mitigations difficult to use for JS — and any other system running code in the same process with different levels of trust. It's hard to say what the immediate implications of this are, but I think it makes "one site per process" policies in browsers more appealing in the long term, at least as an option to deploy in case some future difficult-to-mitigate vulnerability hits. Right now browsers are trying to keep the problem manageable by making it difficult for JS to extract information from the timing channel (by limiting timer resolution and disabling features like SharedArrayBuffer that can be used to implement high-resolution timers), but this unfortunately limits the power of Web applications compared to native applications. For example, as long as it lasts we can't run idiomatic parallel Rust code in browsers via WebAssembly :-(. Also I suspect in the medium term attackers will find other ways to read the timing channel that will be less feasible to disable.

I think it would be a grave mistake to simply give up on mixing code with different trust labels in the same address space. Apart from having to redesign lot of software, that would set a hard lower bound on the cost of transitioning between trust zones. It would be much better if hardware mitigations can be designed to be usable within a single address space.

Other attacks

Perhaps the biggest question is whether we are seeing just the start of a flood of serious attacks based on Spectre-like ideas. I think it's entirely possible, and if so, then dealing with those attacks piecemeal as they surface is going to be incredibly expensive and painful. There is even a possibility that the cost of mitigations will compound as mitigations interfere with one another and fewer and fewer people are capable of understanding what's going on. Therefore I hope and pray that people in positions of power — CPU vendors, big software vendors, etc — work together to come up with comprehensive, preventative fixes that simply rule out these classes of attacks, and don't let themselves be entirely consumed by demands for immediate responses to zero-day vulnerabilities. I applaud the sentiment of RISC-V's statement to this end, self-serving as it is.