Monday, 20 November 2017

Tararua Southern Crossing

I spent three days last week hiking the Tararua Southern Crossing in the Tararua Ranges north of Wellington with one of my children, and had a great time!

On the first day we took the train from Wellington city up the coast to Waikanae on the Kapiti Coast, then were driven east into the ranges, to Otaki Forks where the track starts. It took about five hours to walk to Kime Hut, at about 1400m — an altitude gain of about 1200m. From there we watched a spectacular sunset over the northern end of the South Island, which was wholly visible from the Marlborough Sounds in the west to the Kaikoura Ranges in the east. Kime has no heating and has a reputation for being cold, but it was upgraded in 2014 and although there was a hard frost outside, we were fine sleeping in our clothes inside our sleeping bags.

The next day we hiked over "the tops" — semi-alpine terrain above the tree line. This area is notoriously windy and potentially dangerous in bad weather, but the weather was fine and the wind tolerable. The DoC office had warned us not to take the route if the wind forecast for Powell Hut further north was 50km/h or higher, and I was able to download a fresh forecast in the morning from near Kime Hut which forecast 35km/h. In some places, notably the south side of Mount Hector, the wind was a little unnerving, but it was never a real problem. In places the track is narrow with steep drops on both sides, and although I'm generally not good with heights, I was fine just keeping my eyes on the track; wielding a couple of hiking poles helped. We were cautious and it took us about five hours to reach that day's destination, Alpha Hut. Along the way we had spectacular views from the Kapiti Coast in the west, around to the Hutt Valley, Wellington, and snow-capped peaks in the South Island to the south, and over to the Wairarapa valley in the east. It was a wonderful day.

The final day was a long walk through the bush along the Marchant Ridge down to rural Kaitoke. It took us about eight hours of walking, after which we had a taxi take us to Upper Hutt and then a train back to Wellington. It's a bit of a slog, but has more good views and generally fun. I tripped on a root near the end and banged my knee, which is still sore a few days later, but so it goes.

I'd never been tramping in the Tararuas before, and from the visitor books it looked like most trampers in this area are locals. I'm glad I got a chance to try it. We were lucky that the weather was excellent; it's often wet and/or foggy. Although the area between Kime and Alpha is classed as a "route" and most of the rest is "tramping track" grade, we found almost all the track well-defined and in good condition, though a lot of stepping up and down rocks and tree roots is required!

After the tramp we met up with the rest of the family for a couple of days in Wellington, in particular to visit the World War I exhibitions at Te Papa museum and the Dominion Museum. They are excellent, though required a significant amount of more walking which probably didn't help my knee.

Sunday, 29 October 2017

Auckland Half Marathon 2017

Not as good a time as last year, but OK. I seemed to run out of steam a bit towards the end this year; I'm not sure why. Maybe I pushed too hard too early, or just had a bad day. I should try running more fast 10Ks in the lead-up next year.

I followed Alfredo's advice to reduce the wear on my bare feet by sticking some small pieces of climbing tape to the balls of my feet. It worked really well. The tape almost entirely wore off during the race, and my feet didn't feel any different while I was running, but the skin didn't wear away. In experiments on training runs I discovered that strips of tape parallel to the side of the foot work better than strips across the foot because in the latter case, the flexing of the foot seems to pull the tape off.

Thursday, 19 October 2017

Microsoft's Chrome Exploitation And The Limitations Of Control Flow Integrity

Microsoft published an interesting blog post about exploiting a V8 bug to achieve arbitrary code execution in a Chrome content sandbox. They rightly point out that then even if you don't escape the sandbox, you can break important Web security properties (e.g., assuming the process is allowed to host content from more than one origin, you can break same-origin restrictions). However, the message we're supposed to take away from this article is that Microsoft's CFI would prevent similar bugs in Edge from having the same impact. I think that message is basically wrong.

The problem is, once you've achieved arbitrary memory read/write from Javascript, it's very likely you can break those Web security properties without running arbitrary machine code, without using ROP, and without violating CFI at all. For example if you want to violate same-origin restrictions, your JS code could find the location in memory where the origin of the current document is stored and rewrite it to be a different origin. In practice it would quite a lot more complicated than that, but the basic idea should work, and once you've implemented the technique it could be used to exploit any arbitrary read/write bug. It might even be easier to write some exploits this way than using traditional arbitrary code execution; JS is a more convenient programming language than ROP gadgets.

The underlying technical problem is that once you've achieved arbitrary read/write you can almost completely violate data-flow integrity within the process. As I recently wrote, DFI is extremely important and (unlike CFI) it's probably impossible to dynamically enforce with low overhead in the presence of arbitrary read/write, with any reasonable granularity.

I think there's also an underlying cultural problem here, which is that traditionally "Remote Code Execution" — of unconstrained machine code — has been the gold standard for a working exploit, which is why techniques to prevent that, like CFI, have attracted so much attention. But Javascript (or some other interpreter, or even some Turing-complete interpreter-like behavior) armed with an arbitrary memory read/write primitive is just as bad in a lot of cases.

Sunday, 15 October 2017

"Slow To Become Angry"

James 1:19-20

My dear brothers and sisters, take note of this: Everyone should be quick to listen, slow to speak and slow to become angry, because human anger does not produce the righteousness that God desires.

Online media exploit the intoxicating effects of righteous anger to generate "engagement". But even — or especially — when the targets of one's anger richly deserve it, becoming a person whose life is characterized by anger is contrary to God's will.

Something to remember when I'm tempted to click on yet another link about some villain getting what he deserves.

Monday, 9 October 2017

Type Safety And Data Flow Integrity

We talk a lot about memory safety assuming everyone knows what it is, but I think that can be confusing and sell short the benefits of safety in modern programming languages. It's probably better to talk about "type safety". This can be formalized in various ways, but intuitively a language's type system proposes constraints on what is allowed to happen at run-time — constraints that programmers assume when reasoning about their programs; type-safe code actually obeys those constraints. This includes classic memory safety features such as avoidance of buffer overflows: writing past the end of an array has effects on the data after the array that the type system does not allow for. But type safety also means, for example, that (in most languages) a field of an object cannot be read or written except through pointers/references created by explicit access to that field. With this loose definition, type safety of a piece of code can be achieved in different ways. The compiler might enforce it, or you might prove the required properties mechanically or by hand, or you might just test it until you've fixed all the bugs.

One implication of this is that type-safe code provides data-flow integrity. A type system provides intuitive constraints on how data can flow from one part of the program to another. For example, if your code has private fields that the language only lets you access through a limited set of methods, then at run time it's true that all accesses to those fields are by those methods (or due to unsafe code).

Type-safe code also provides control-flow integrity, because any reasonable type system also suggests fine-grained constraints on control flow.

Data-flow integrity is very important. Most information-disclosure bugs (e.g. Heartbleed) violate data-flow integrity, but usually don't violate control-flow integrity. "Wild write" bugs are a very powerful primitive for attackers because they allow massive violation of data-flow integrity; most security-relevant decisions can be compromised if you can corrupt their inputs.

A lot of work has been done to enforce CFI for C/C++ using dynamic checks with reasonably low overhead. That's good and important work. But attackers will move to attacking DFI, and that's going to be a lot harder to solve for C/C++. For example the checking performed by ASAN is only a subset of what would be required to enforce the C++ type system, and ASAN's overhead is already too high. You would never choose C/C++ for performance reasons if you had to run under ASAN. (I guess you could reduce ASAN's overhead if you dropped all the support for debugging, but it would still be too high.)

Note 1: people often say "even type safe programs still have correctness bugs, so you're just solving one class of bugs which is not a big deal" (or, "... so you should just use C and prove everything correct"). This underestimates the power of type safety with a reasonably rich type system. Having fine-grained CFI and DFI, and generally being able to trust the assumptions the type system suggests to you, are essential for sound reasoning about programs. Then you can leverage the type system to build abstractions that let you check more properties; e.g. you can enforce separation between trusted and untrusted data by giving untrusted user input different types and access methods to trusted data. The more your code is type-safe, the stronger is your confidence in those properties.

Note 2: C/C++ could be considered "type safe" just because the specification says any program executing undefined behavior gets no behavioral constraints whatsoever. However, in practice, programmers reasoning about C/C++ code must (and do) assume the constraint "no undefined behavior occurs"; type-safe C/C++ code must ensure this.

Note 3: the presence of unsafe code within a hardware-enforced protection domain can undermine the properties of type-safe code within the same domain, but minimizing the amount of such unsafe code is still worthwhile, because it reduces your attack surface.

Legacy Code Strikes Again

This blog post describes using binary diffing to find security-relevant bugs in Windows 7 that were fixed in Windows 10. It's an interesting example of the problems you can get into if you don't fix bugs across all your product versions at about the same time.

CVE-2017-8685 is particularly sad. The system call NtGdiEngCreatePalette can leak uninitialized values from kernel memory to user-space; these leaks are quite serious security issues. This is basically the old GDI CreatePalette function. It's a system call because in Windows NT 4.0 (1996) Microsoft moved the GDI subsystem into the kernel for performance reasons. That may have made sense at the time, but at that time GDI was already a mess of complicated legacy code, so this was a large increase in attack surface that's been hurting security and stability for Microsoft users ever since.

What's especially sad about this function is that CreatePalette is only useful for palette-based displays, which became obsolete in the 1990s, around the time NT 4.0 came out. It's been about 20 years since this API was useful for anything other than compatibility with even older software ... or kernel memory disclosure!

Sunday, 8 October 2017

Thoughts On Microsoft's Time-Travel Debugger

I'm excited that Microsoft's TTD is finally available to the public. Congratulations to the team! The video is well worth watching. I haven't used TTD myself yet since I don't have a Windows system at hand, but I've talked to Mozilla developers who've tried it on Firefox.

The most important and obvious difference between TTD and rr is that TTD is for Windows and rr is for Linux (though a few crazy people have had success debugging Windows applications in Wine under rr).

TTD supports recording of multiple threads in parallel, while rr is limited to a single core. On the other hand, per-thread recording overhead seems to be much higher in TTD than in rr. It's hard to make a direct comparison, but a simple "start Firefox, display mozilla.org, shut down" test run on similar hardware takes about 250 seconds under TTD and 26 seconds under rr. This is not surprising given TTD relies on pervasive binary instrumentation and rr was designed not to. This means recording extremely parallel workloads might be faster under TTD, but for many workloads rr recording will be faster. Starting up a large application really stresses binary translation frameworks, so it's a bit of a worst-case scenario for TTD — though a common one for developers. TTD's multicore recording might be better at reproducing certain kinds of concurrency bugs, though rr's chaos mode helps mitigate that problem — and lower recording overhead means you can churn through test iterations faster.

Therefore for Firefox-like workloads, on Linux, I still think rr's recording approach is superior. Note that when the technology behind TTD was first developed the hardware and OS features needed to support an rr-like approach did not exist.

TTD's ability to attach to arbitrary processes and start recording sounds great and would mitigate some of the slow-recording problem. This would be nice to have with rr, but hard to implement. (Currently we require reserving a couple of pages at specific addresses that might not be available when attaching to an arbitrary process.)

Some of the performance overhead of TTD comes from it copying all loaded libraries into the trace file, to ensure traces are portable across machines. rr doesn't do that by default; instead you have to run rr pack to make traces self-contained. I still like our approach, especially in scenarios where you repeatedly re-record a testcase until it fails.

The video mentions that TTD supports shared memory and async I/O and suggests rr doesn't. It can be confusing, but to clarify: rr supports shared memory as long as you record all the processes that are using the shared memory; for example Firefox and Chromium communicate with subprocesses using shared memory and work fine under rr. Async I/O is pretty rare in Linux; where it has come up so far (V4L2) we have been able to handle it.

Supporting unlimited data breakpoints is a nice touch. I assume that's done using their binary instrumentation.

TTD's replay looks fast in the demo videos but they mention that it can be slower than live debugging. They have an offline index build step, though it's not clear to me yet what exactly those indexes contain. It would be interesting to compare TTD and rr replay speed, especially for reverse execution.

The TTD trace querying tools look cool. A lot more can be done in this area.

rr+gdb supports running application functions at debug time (e.g. to dump data structures), while TTD does not. This feature is very important to some rr users, so it might be worthwhile for the TTD people to look at.

Friday, 6 October 2017

Building On Rock, Not Sand

This quote is telling:

Billions of devices run dnsmasq, and it had been through multiple security audits before now. Simon had done the best job possible, I think. He got beat. No human and no amount of budget would have found these problems before now, and now we face the worldwide costs, yet again, of something ubiquitous now, vulnerable.

Some of this is quite accurate. Human beings can't write safe C code. Bug-finding tools and security audits catch some problems but miss a lot of others. But on the other hand, this message and its followup betray mistaken assumptions. There are languages running on commodity hardware that provide much better security properties than C. In particular, all three remote code execution vulnerabilities would have been prevented by Rust, Go or even Java. Those languages would have also made the other bugs much more unlikely. Contrary to the quote, given a finite "amount of budget", dnsmasq could have been Rewritten In Rust and these problems avoided.

I understand that for legacy code like dnsmasq, even that amount of budget might not be available. My sincere hope is that people will at least stop choosing C for new projects. At this point, doing so is professional negligence.

What about C++? In my circle I seldom see enthusiasm for C, yet there is still great enthusiasm for C++, which inherits C's core security weaknesses. Are the C++ projects of today going to be the vulnerability-ridden legacy codebases of tomorrow? (In some cases, e.g. browsers, they already are...) C++ proponents seem to believe that C++ libraries and analysis tools, including efforts such as C++ Core Guidelines: Lifetimes, plus mitigations such as control-flow integrity, will be "good enough". Personally, I'm pessimistic. C++ is a fantastically complex language and that complexity is growing steadily. Much more effort is going into increasing its complexity than addressing safety issues. It's now nearly two years since the Lifetimes document had any sort of update, and at CppCon 2017 just one of 99 talks focused on improving C++ safety.

Those of us building code to last owe it to the world to build on rock, not sand. C is sand. C++ is better, but it's far from a solid foundation.

Microsoft Using Chromium On Android Is Bad For The Web

Microsoft is releasing "Edge for Android" and it uses Chromium. That is bad for the Web.

It's bad because engine diversity is really essential for the open Web. Having some users, even a relatively small number, using the Edge engine on Android would have been a good step. Going with Chromium increases Web developer expectations that all browsers on Android are — or even should be — Chromium. The less thoughtful sort of developer (i.e., pretty much everyone) will say "Microsoft takes this path, so why doesn't Mozilla too, so we can have the instant gratification of compatibility thanks to a single engine?" The slow accumulation of unfixable bugs due to de facto standardization will not register until the platform has thoroughly rotted; the only escape being alternative single-vendor platforms where developers are even more beholden to the vendor.

Sure, it would have been quite a lot of work to port Edge to Android, but Microsoft has the resources, and porting a browser engine isn't a research problem. If Microsoft would rather save resources than promote their own browser engine, perhaps they'll be switching to Chromium on Windows next. Of course that would be even worse for the Web, but it's not hard to believe Microsoft has stopped caring about that, to the extent they ever did.

(Of course Edge uses Webkit on iOS, and that's also bad, but it's Apple's ongoing decision to force browsers to use the least secure engine, so nothing new there.)

Sunday, 24 September 2017

Complaining About Twitter Again

I already complained about Twitter, but it's time to do it again.

People used to complain that Powerpoint teaches people to think in bullet points. Twitter is similar or worse. There's no room for references, context, or nuance. It's almost tailor-made to tempt people into quoting you out of context.

I craft precise (and concise!) messages and then butcher them into illegible ambiguity to fit into the character limit. It reminds me of my dismay when IBM's lawyers turned my beautiful technical papers into monstrous pustules of legal verbiage.

I complained about fragmentation last year, but the "1/N" thread trend is now completely out of control.

Don't get me started about the ridiculous "I'll just post a JPEG of the text I couldn't fit into the tweet". Every single one of them announces a failure of the system.

Twitter's issues wouldn't be so bad if so many of them weren't intentional design choices.

I try to not get involved in conversations on Twitter, but it's amazingly difficult when people I care about talk about subjects where I feel I can contribute.

Twitter doesn't have the clout of Facebook or Google; perhaps that's why it escapes the opprobium heaped on other big Internet companies. I think it's a cancer on the body politic.

Sunday, 17 September 2017

Dreaming The Singularity

This is an actual dream I had last night. I'm not sure of the order of some of the events, but it otherwise felt clear.

I walked around my neighbourhood and suddenly felt a unique sensation in my head — a sort of overflowing energy that quickly increased in intensity until I was overwhelmed and fell down.

I came to, wondering what had happened; the answer came subconsciously, like remembering: You have changed. You have new abilities.

Focusing my attention internally or externally seemed to generate data and explanations from outside myself. I sensed something in my mind like apps, but I seemed unable to use them.

I asked who changed me. Artificial entities.

How advanced are they? They hold themselves back, waiting for people to follow.

How was I changed? Nanotech.

What about my family? They will follow.

Am I still a Christian? Your beliefs remain in place.

Expect more changes, ever more quickly, becoming a continuous stream.

I was filled with fear at not knowing what I would become, and woke up still filled with it. Oddly, though, in the dream I never considered whether the guiding entities were benign. In retrospect that seems suspicious...

In fact the entire dream was odd considering I don't really believe in the Singularity.

Saturday, 16 September 2017

Facebook's "Explaining React's License" Doesn't

This post "Explaining React's license" confuses me. In particular:

As our business has become successful, we've become a larger target for meritless patent litigation. This type of litigation can be extremely costly in terms of both resources and attention. It would have been easy for us to stop contributing to open source, or to do what some other large companies do and only release software that isn't used in our most successful products, but we decided to take a different approach.

This seems to claim that a) contributing to open source makes Facebook a larger target for meritless patent litigation and (later) b) the explicit patent license compensates for this somehow. The post does not spell out a rationale for a). Is it that exposing valuable source code make it easier for people to identify potential infringements of their (meritless?) patents? If so, I'm curious what evidence exists for that. And how does b) work? We know countersuits don't work against so-called Non-Practicing Entities — that's the point of NPEs. So is the goal only to deter meritless patent litigation by Practicing Entities, hoping that they'll use Facebook code and depend on a Facebook patent license? If that's the approach, how does applying the patent license to projects like zstd, where it appears Facebook doesn't have any patents, help? Just by increasing the general fear in a Practicing Entity that somewhere they use Facebook patents?

For a post claiming to be an explanation, it seems to leave a lot unexplained.

Thursday, 14 September 2017

Some Opinions On The History Of Web Audio

People complain that Web Audio provides implementations of numerous canned processing features, but they very often don't do exactly what you want, and working around those limitations by writing your own audio processing code in JS is difficult or impossible.

This was an obvious pitfall from the moment the Web Audio API was proposed by Chris Rogers (at Google, at that time). I personally fought pretty hard in the Audio WG for an API that would be based on JS audio processing (with allowance for popular effects to be replaced with browser-implemented modules). I invested enough to write a draft spec for my alternative and implement a lot of that spec in Firefox, including Worker-based JS sample processing.

My efforts went nowhere for several reasons. My views on making JS sample manipulation a priority were not shared by the Audio WG. Here's my very first response to Chris Rogers' reveal of the Web Audio draft; you can read the resulting discussion there. The main arguments against prioritizing JS sample processing were that JS sample manipulation would be too slow, and JS GC (or other non-realtime behaviour) would make audio too glitchy. Furthermore, audio professionals like Chris Rogers assured me they had identified a set of primitives that would suffice for most use cases. Since most of the Audio WG were audio professionals and I wasn't, I didn't have much defense against "audio professionals say..." arguments.

The Web Audio API proceeded mostly unchanged because there wasn't anyone other than me trying to make significant changes. After an initial burst of interest Apple's WG participation declined dramatically, perhaps because they were getting Chris Rogers' Webkit implementation "for free" and had nothing to gain from further discussion. I begged Microsoft people to get involved but they never did; in this and other areas they were (are?) apparently content for Mozilla and Google to spend energy to thrash out a decent spec that they later implement.

However, the main reason that Web Audio was eventually standardized without major changes is because Google and Apple shipped it long before the spec was done. They shipped it with a "webkit" prefix, but they evangelized it to developers who of course started using it, and so pretty soon Mozilla had to cave.

Ironically, soon after Web Audio won, the "extensible Web" become a hot buzzword. Web Audio had a TAG review at which it was clear Web Audio was pretty much the antithesis of "extensible Web", but by then it was too late to do anything about it.

What could I have done better? I probably should have reduced the scope of my spec proposal to exclude MediaStream/HTMLMediaElement integration. But I don't think that, or anything else I can think of, would have changed the outcome.

Monday, 11 September 2017

Sonny The Prophet

Yesterday, Sunday, as people mingled after our church service, one of our church members brought a visitor to me, introduced me to him, and quickly disappeared. Our visitor introduced himself as Sonny, "a prophet of God".

I have leadership responsibilities so the reflexive urge to palm him off on someone else was not an option. Besides, I'm not a committed cessationist so the possibility of encountering a real prophet exists, however remote...

Sonny explained that he visits different churches every week and God had instructed him to visit our church yesterday so he could bring us the word of God. Sonny noticed that we advertise a table tennis club at our church on Sunday afternoons (105 Vincent St, 3pm, all welcome!) and advised me that this is contrary to God's command to keep the Sabbath holy. I asked him why he thought so, so he opened his Bible at Exodus 20:8. I suggested that that text prohibits work, not games. He switched to Isaiah 58:13, but I continued to express doubt that that excludes table tennis (... never mind the questions of how Israelite Sabbath law and practice apply to a Gentile congregation in the new covenant).

Remaining calm, Sonny proceeded to tell me he was having a vision in which he saw maggots falling on me. At this point I thought it fair to ask Sonny, just as calmly, how he wished to confirm that he was a true prophet, not a false prophet. Unfortunately his only answer was to repeat his testimony about himself, and I was unable to investigate further because I genuinely had to leave for for lunch.

I'd just taught a Sunday school class about our duty to love other Christians (John 13:34-35), our enemies (Matthew 5:43-48), and everyone (Mark 12:28), and here God presented me with someone who was possibly all three, so I did not feel able to get worked up about the situation. On the other hand I don't know how to show love to Sonny, other than praying for him, given I probably won't see him again.

Thursday, 7 September 2017

rr 5.0 Released

I've released rr 5.0. Kyle convinced me that trace stability and portability were worth a major version bump.

Release notes:

  • Introduction of Cap'n Proto to stabilize the trace format. Recordings created in this rr release should be replayable in any future rr release. This is a plan, not a promise, since we don't know what might happen in the future, but I'm hopeful.
  • New rr pack command makes recordings self-contained.
  • Packed recordings from one machine can be replayed on a different machine by trapping CPUID instructions when supported on the replay machine. We don't have much experience with this yet but so far so good.
  • Brotli compression for smaller traces and lower recording overhead.
  • Improvements to the rr command-line arguments to ease debugger/IDE integration. rr replay now accepts a -- argument; all following arguments are passed to the debugger verbatim. Also, the bare rr command is now smarter about choosing a default subcommand; if the following argument is a directory, the default subcommand is replay, otherwise it is record.
  • Performance improvements, especially for pathological cases with lots of switching to and from the rr supervisor process.
  • Syscall support expanded.
  • Many bugs fixed.

Enjoy!

Friday, 1 September 2017

rr Trace Portability

We want to be able to record an rr trace on one machine but copy it to another machine for replay. For example, you might record a failing test on one machine and copy the trace to a developer's machine for debugging. Or, you might record a failure locally and upload the trace to some cloud service for analysis. In short: on rr master, this works!

It turned out there were only two big issues to solve. We needed a way to make traces fully self-contained, because for efficiency we don't always copy all needed files into the trace during recording. rr pack addressed that. rr pack also compacts the trace by eliminating duplicate copies of the same file. Switching to brotli also reduced trace size, as did using Cap'n Proto for trace data.

The other big issue was handling CPUID instructions. We needed a way to ensure that during replay CPUID instructions returned the same results as they did during recording — they generally won't if you switch machines. Modern Intel hardware supports "CPUID faulting", i.e. you can configure the CPU to trap every time a CPUID instruction occurs. Linux didn't expose this capability to user-space, so last year Kyle Huey did the hard work of adding a Linux system-call API to expose it: the ARCH_GET/SET_CPUID subfeature of arch_prctl. It works very much like the existing PR_GET/SET_TSC, which give control over the faulting of RDTSC/RDTSCP instructions. Getting the feature into the upstream kernel was a bit of an ordeal, but that's a story for another day. It finally shipped in the 4.12 kernel.

When CPUID faulting is available, rr recording stores the results of all CPUID instructions in the trace, and rr replay intercepts all CPUID instructions and takes their results from the trace. With this in place, we're able to move traces from one machine/distro/kernel to another and replay them successfully.

We also support situations where CPUID faulting is not available on the recording machine but is on the replay machine. At the start of recording we save all available CPUID data (there are only a relatively small number of possible CPUID "leaves"), and then rr replay traps CPUID instructions and emulates them using the stored data.

Caveat: the user is responsible for ensuring the destination machine supports all instructions and other CPU features used by the recorded program. At some point we could add an rr feature to mask the CPUID values reported during recording so you can limit the CPU features a recorded program uses. (We actually already do this internally so that applications running under rr believe that RTM transactional memory and RDRAND, which rr can't handle, are not available.)

CPUID faulting is supported on most modern Intel CPUs, at least on Ivy Bridge and its successor Core architectures. Kyle also added support to upstream Xen and KVM to virtualize it, and even emulate it regardless of whether the underlying hardware supports it. However, VM guests running on older Xen or KVM hypervisors, or on other hypervisors, probably can't use it. And as mentioned, you will need a Linux 4.12 kernel or later.

Tuesday, 29 August 2017

Fedora/Ubuntu Kernels Work With rr Again

Ubuntu finally released a kernel update (4.4.0-93) that fixes the regression that broke rr. It took a month after the regression was fixed upstream. The slow update cycle has been frustrating, and it's a bit worrying: I hope security fixes are treated with more urgency!

Fedora updated F25 last week (updating to a 4.12 kernel) and F26 was fixed earlier, so I think we're mostly out of the woods on this issue. I will prepare an rr 4.6.0 release.

Sunday, 27 August 2017

Igloos Are Hard

This weekend I had a bit of an adventure. My friend wanted to try making an igloo using igloo-making tools he had obtained, so we drove down to Mount Ruapehu early Saturday morning (four hours) and caught a shuttle up to the bottom of the Whakapapa ski area, then hiked through the snow for a couple of hours to reach his preferred site near the NZ Alpine Club's Ruapehu Hut, near Knoll Ridge, at an altitude of around 2040m. It's a wonderful spot, completely exposed on a rock ridge with great views of the surrounding area.

For various reasons, including congestion further down the mountain and a visit to the Knoll Ridge Cafe about 20 minutes walk from our site, we didn't start working on the igloo until about 4:30pm. A couple of hours of hard work later, we had completed at most one-third of the igloo and it was dark, getting colder, and I, for one, was exhausted. So we decided to abandon completing our igloo and just sleep out in the open in our bivvy bags, inside the ring we'd built. (A bivvy bag is a kind of tiny tent that's waterproof and insulating, just big enough to sleep in.) We boiled some water, had some hot drinks and rehydrated food, and went to bed around 8pm.

During the night light rain fell for a while, and later of course it got really cold. I was wearing a lot of clothing inside my sleeping bag and I still wasn't quite comfortably warm, and it was hard to find a comfortable sleeping position, and the bivvy bag was (by design, I think) quite stuffy, making breathing a bit more laboured than I'm used to. I did not sleep very well. The wind stayed light, which helped, but whatever transpired we would never have been in any actual danger, since the Alpine Club's hut was only about 30 metres away and the occupants would presumably have let us in if we'd begged them to save our lives.

Due to the rain and freezing temperatures overnight, when we got up around 7am our bivvy bags and the gear not inside them were all covered in ice. It took me a while to just open the zipper on my bag because of the ice — I pondered how embarrassing it would be to be trapped inside my bag! Then I realized that although I had taken into my bivvy bag everything I thought needed to be kept dry, I hadn't thought through the consequences of my other gear freezing. For example my waterproof-but-frozen camera stopped working (but fortunately after I thawed and recharged the battery, it works again). Frozen and ice-encrusted bags, shoelaces, straps, and zippers are no fun to manipulate with numb fingers!

Nevertheless it was a beautiful morning on the mountain and it wasn't too difficult to pack everything up and move out. Getting down the mountain and driving home was relatively uneventful.

This was my first time using crampons and an ice-axe. I didn't have much trouble but the conditions were not very difficult. Just as in non-icy conditions, I'm tentative going down steep slopes but hopefully I can get better at that. If I go hiking in the snow again I would definitely plan to sleep in a hut or a tent, or in an emergency a bivvy bag, not in a snow cave or igloo that require a lot of work to construct after reaching one's destination.

Tuesday, 22 August 2017

Epsom Electorate Town Hall Meeting

New Zealand's general election is in a month. Tonight I went along to a town-hall meeting with the candidates standing in the Epsom electorate where I live. I thought all the candidates were good, except perhaps New Zealand First's candidate who seemed a bit green (but he was only in his twenties). The candidates were eloquent, witty, mostly respectful, mostly made reasonable proposals to fix problems, all showed a grasp of facts and figures, and all seemed fit to serve as a Member of Parliament.

Epsom has complex electoral dynamics. Being reputedly one of the most right-wing electorates in New Zealand, the two small parties to the right of National (the big centre-right party) focus most of their energy on winning the Epsom electoral vote; under NZ's MMP system, this entitles their parties to receive seats proportional to their party vote even if that party vote is less than 5%, in which case they would normally receive no seats. National traditionally games the system a little bit by encouraging their voters to give their party vote to National but their electoral vote to the ACT candidate, on the expectation that this helps National because the otherwise "wasted" ACT party votes will put ACT MPs in Parliament who will align with National. Therefore tonight we had candidates from National (Paul Goldsmith), Labour (David Parker), the Greens (Barry Coates), and New Zealand First (Julian Paul), and also the ACT and Conservative party leaders (David Seymour and Leighton Baker) standing as candidates. National polling has the left-wing and right-wing coalitions reasonably close at this stage, with the Greens only just short of the 5% threshold and New Zealand First over it, so all but the Conservative party has a realistic chance of being part of a governing coalition.

It's sad there are no women candidates in Epsom this time around. David Parker claimed half of Labour's candidates are women.

On the issues, the candidates mostly said what you'd expect. People seemed to agree about problems — wealth inequality, housing, transport, education, environment (especially water quality), youth suicide — except that Paul Goldsmith obviously had to paint a more optimistic picture than the others. They often (not always) disagreed about the best way to solve them. I was surprised to learn that ACT supports a carbon tax.

In several cases candidates obviously did quick research on their phones to gather data before their turn to respond to a question, or to follow up on a previous answer. That was cool.

I wanted to ask why New Zealand should sign the "TPP as is, minus USA" deal as National is proposing, given the intellectual property concessions that are mainly there for the benefit of US companies, which could only make sense for us if we got something in return from the USA ... but I didn't get a chance to ask it :-(.

Given Epsom's reputation as a haven for right-wing parties, it was interesting that when David Seymour described Labour as more left-wing than it has been for years, a lot of people cheered.

Overall I'd say the Green and Conservative candidates impressed me the most, partly because I had lower expectations for them. Leighton Baker has some interesting ideas I hadn't heard before, like offering trade-oriented high school streams, which I think sounds great except it won't fly because people overvalue university degrees. Barry Coates came across as informed and capable, so I wonder why the Greens are wasting him in Epsom. David Parker came across as a bit over-snarky but I think he made what was for me the most compelling argument of the night: that New Zealand's tax structure favours property investment over business investment and Labour will do a better job than National of fixing this.

I think New Zealanders should be pretty proud of the quality of our political system and politicians.

Monday, 14 August 2017

Public Service Announcement: "localhost" Is Not Necessarily Local

Today I learned that there exist systems (presumably misconfigured, but I'm not sure in what way) where the hostname "localhost" does not resolve locally but is sent to some remote DNS server, and then in some cases the DNS server returns a remote address (e.g. a server providing landing pages stuffed with ads).

This was breaking rr, since rr tells gdb to use (e.g.) the command "target extended-remote :1234", and apparently gdb resolves "localhost" to get the address to connect to. I've fixed rr to pass "127.0.0.1" as an explicit local address, but who knows what other software is broken in such a configuration — possibly in insecure ways?

Sunday, 13 August 2017

When Virtue Fails

This quote, popularly (but incorrectly) attributed to Marcus Aurelius, proposes indifference to religion:

Live a good life. If there are gods and they are just, then they will not care how devout you have been, but will welcome you based on the virtues you have lived by. If there are gods, but unjust, then you should not want to worship them. If there are no gods, then you will be gone, but ... will have lived a noble life that will live on in the memories of your loved ones.

But this raises the question — what if you fail to live a truly good life, in the eyes of the god(s)?

The gospel — literally, the good news — about Jesus is that indeed we all fall short, but God sent Jesus into the world to take the punishment that we deserve, and through him we can receive forgiveness.

There is, of course, a catch. We have to accept that forgiveness, and that requires taking Jesus seriously. Thus psuedo-Aurelian indifference breaks down.

Monday, 7 August 2017

Stabilizing The rr Trace Format With Cap’n Proto

In the past we've modified the rr trace format quite frequently, and there has been no backward or forward compatibility. In particular most of the time when people update rr — and definitely when updating between releases — all their existing traces become unreplayable. This is a problem for rr-based services, so over the last few weeks I've been fixing it.

Prior to stabilization I made all the trace format updates that were obviously already desirable. I extended the event counter to 64 bits since a pathological testcase could overflow 2^31 events in less than a day. I simplified the event types to eliminate some unnecessary or redundant events. I switched the compression algorithm from zlib to brotli.

Of course it's not realistic to expect that the trace format is now perfect and won't ever need to be updated again. We need an extensible format so that future versions of rr can add to it and still be able to read older traces. Enter Cap’n Proto! Cap’n Proto lets us write a schema describing types for our trace records and then update that schema over time in constrained ways. Cap’n Proto generates code to read and write records and guarantees that data using older versions of the schema is readable by implementations using newer versions. (It also has guarantees in the other direction, but we're not planning to rely on them.)

This has all landed now, so the next rr release should be the last one to break compatibility with old traces. I say should, because something could still go wrong!

One issue that wasn't obvious to me when I started writing the schema is that rr can't use Cap’n Proto's Text type — because that requires text be valid UTF-8, and most of rr's strings are data like Linux pathnames which are not guaranteed to be valid UTF-8. For those I had to use the Data type instead (an array of bytes).

Another interesting issue involves choosing between signed and unsigned integers. For example a file descriptor can't be negative, but Unix file descriptors are given type int in kernel APIs ... so should the schema declare them signed or not? I made them signed, on the grounds that we can then check while reading traces that the values are non-negative, and when using the file descriptor we don't have to worry about the value overflowing as we coerce it to an int.

I wrote a microbenchmark to evaluate the performance impact of this change. It performs 500K trivial (non-buffered) system calls, producing 1M events (an 'entry' and 'exit' event per system call). My initial Cap’n Proto implementation (using "packed messages") slowed rr recording down from 12 to 14 seconds. After some profiling and small optimizations, it slows rr recording down from 9.5 to 10.5 seconds — most of the optimizations benefited both configurations. I don't think this overhead will have any practical impact: any workload with such a high frequency of non-buffered system calls is already performing very poorly under rr (the non-rr time for this test is only about 20 milliseconds), and if it occurred in practice we'd buffer the relevant system calls.

One surprising datum is that using Cap’n Proto made the event data significantly smaller — from 7.0MB to 5.0MB (both after compression with brotli-5). I do not have an explanation for this.

Another happy side effect of this change is that it's now a bit easier to read rr traces from other languages supported by Cap’n Proto.

Monday, 31 July 2017

Selecting A Compression Algorithm For rr

rr's traces are large. Memory-mapped files account for a lot of that, but the most efficient way to handle them is "zero copy" file cloning so we don't want to compress them during recording. Most of the rest is recordings of data copied into tracee address spaces by the kernel, and plus snapshots of registers, and these data are extremely compressible — often containing long runs of zeroes, for example. For a long time rr has used zlib to compress the non-mapped-file trace data, and zlib's 'deflate' algorithm often achieves compression of 8x or more.

Of course zlib is pretty old and significantly better algorithms exist, so now seems like a good time to reevaluate that decision. I used the Squash framework to compare zlib to two contenders, brotli and zstd, on actual rr trace data: a single, fairly short Firefox run, 828MB uncompressed, treated as independent 1MB chunks because that's what rr does. Here are the results:

I've omitted compression levels that took more than 20 seconds to compress the data. Currently rr uses zlib level 6, which takes just over 12 seconds to compress the data. Data compression occurs in parallel with the rest of recording, and uses multiple cores when it needs to, so in practice is seldom a performance bottleneck.

On this data, both brotli and zstd beat zlib by a significant margin, so we're definitely leaving some performance on the table by sticking with zlib. In particular, given the same time budget, zstd can shave 14% off the size of the non-mapped-file trace data, and brotli can shave off 17%. Alternatively, for the same trace size we could use much less CPU time — zstd level 1 compresses slightly better than zlib level 6, at 10x the speed!

For rr I think brotli level 5 is an obvious choice. For some reason there's a sudden improvement in compression at level 5, where it passes zstd and reaches roughly its optimal compression given a reasonable time budget. At level 5 we're shaving 17% off the current size and also taking 32% off the CPU time.

Apart from the performance, brotli also has a better licensing story. zstd has Facebook's standard patent license, which terminates if you sue Facebook for any patent infringement, and some organisations aren't comfortable with that. Apparently people have done patent searches and haven't found any Facebook patents covering zstd, but that is not wholly reassuring (while also being mystifying — if they're not applying for relevant patents, why not make that clear?). On the other hand, Google has made a clear commitment to license brotli royalty-free with no such conditions. Of course there could be third-party patents, but if they became a problem it would be easy to switch rr's algorithm (especially compared to the trouble they would cause for Web sites and browsers!).

Of course there are lots of other compression algorithms I could evaluate, but I guess if there are any further gains to be had, they would be very minor.

Update Unfortunately Ubuntu doesn't have a brotli library package. (Fedora does.) So, using brotli would mean everyone building rr on Ubuntu has to build brotli themselves first, or we vendor brotli into rr (or we do something truly insane like have rr pull and build brotli at build time if necessary). None of these approaches are appealing :-(. I guess there's also "rewrite rr in Rust so we can use cargo to have reasonable dependency management", which is appealing but not currently practical.

I'm leaning towards vendoring brotli into rr.

Saturday, 29 July 2017

Upstream Stable Kernels Work With rr Again

Greg K-H has released stable Linux kernels 3.18.63, 4.4.79, 4.9.40, and 4.12.4, containing a (backout) fix for the regression that broke rr. 4.13-rc2 also contains the fix. 4.11 was just declared end-of-life so it will not ever be officially fixed.

Obviously distros still have to produce kernel updates containing the fix, so we're not quite out of the woods yet, but that should be soon.

I'm holding off doing the rr 4.6.0 release until distro updates that work with rr have been out for a little while. To the (limited) extent possible I'd like to avoid people trying rr while it doesn't work on their kernel.

Thursday, 27 July 2017

Let's Never Create An Ad-Hoc Text Format Again

Recently I needed code to store a small amount of data in a file. Instinctively I started doing what I've always done before, which is create a trivial custom text-based format using stdio or C++ streams. But at that moment I had an epiphany: since I was using Rust, it would actually be more convenient to use the serde library. I put the data in a custom struct (EpochManifest), added #[derive(Serialize, Deserialize)] to EpochManifest, and then just had to write:

    let f = File::create(manifest_path).expect("Can't create manifest");
    serde_json::to_writer_pretty(f, &manifest).unwrap();
and
    let f = File::open(&p).expect(&format!("opening {:?}", &p));
    let manifest = serde_json::from_reader(f).unwrap();

This is more convenient than hand-writing even the most trivial text (un)parser. It's almost guaranteed to be correct. It's more robust and maintainable. If I decided to give up on human readability in exchange for smaller size and faster I/O, it would only take a couple of changed lines to switch to bincode's compact binary encoding. It prevents the classic trap where the stored data grows in complexity and an originally simple ad-hoc text format evolves into a baroque monstrosity.

There are libraries to do this sort of thing in C/C++ but I've never used them, perhaps because importing a foreign library and managing that dependency is a significant amount of work in C/C++, whereas cargo makes it trivial in Rust. Perhaps that's why the ISO C++ wiki page on serialization provides lots of gory details about how to implement serialization rather than just telling you to use a library.

As long as I get to keep using Rust I should never create an ad-hoc text format again.

Monday, 17 July 2017

Confession Of A C/C++ Programmer

I've been programming in C and C++ for over 25 years. I have a PhD in Computer Science from a top-ranked program, and I was a Distinguished Engineer at Mozilla where for over ten years my main job was developing and reviewing C++ code. I cannot consistently write safe C/C++ code. I'm not ashamed of that; I don't know anyone else who can. I've heard maybe Daniel J. Bernstein can, but I'm convinced that, even at the elite level, such people are few and far between.

I see a lot of people assert that safety issues (leading to exploitable bugs) with C and C++ only afflict "incompetent" or "mediocre" programmers, and one need only hire "skilled" programmers (such as, presumably, the asserters) and the problems go away. I suspect such assertions are examples of the Dunning-Kruger effect, since I have never heard them made by someone I know to be a highly skilled programmer.

I imagine that many developers successfully create C/C++ programs that work for a given task, and no-one ever fuzzes or otherwise tries to find exploitable bugs in those programs, so those developers naturally assume their programs are robust and free of exploitable bugs, creating false optimism about their own abilities. Maybe it would be useful to have an online coding exercise where you are given some apparently-simple task, you write a C/C++ program to solve it, and then your solution is rigorously fuzzed for exploitable bugs. If any such bugs are found then you are demoted to the rank of "incompetent C/C++ programmer".

Saturday, 15 July 2017

Usenix ATC 2017

During the last few days I attended the Usenix ATC 2017 conference. This is a fairly eclectic but mostly systems-focused conference, largely focused on academic research but with a smattering of other sorts of projects.

On Thursday I presented my talk about rr. I only had twenty minutes, and my usual rr talk is more like an hour, so I cut a lot and talked fast, but apparently it came across reasonably well. There were some good questions, and Brendan Dolan-Gavitt was kind enough to slip in a mention that rr has saved his colleague a month of work. They apparently have a pretty good rr-based workflow for diagnosing divergence bugs in their PANDA QEMU-based record and replay system. A number of people approached me before and after the talk to discuss rr and its relationship to their projects.

One particularly relevant project presented at the conference was H3, a record-and-replay project following on from previous work by the same group. They do efficient multicore record and replay by replaying threads on all available cores, gathering observations of control flow using Intel's Processor Trace, and then formulating the problem of matching reads from shared memory with their associated writes as a constraint system which they then solve using Z3. One nice thing about this approach is that they can come up with replay behaviors involving weaker memory models than sequential consistency. They get good results on small programs but the scalability of their approach to really large applications is still unproven. I think this line of research has potential, because there are all sorts of ways to improve it: gathering more kinds of observations (especially data values), being more selective about which observations to gather, or introducing periodic stop-the-world synchronization to simplify the constraint sets. It might also be possible to combine this technique with MMU-based page ownership approaches, so that for pages that see little sharing (mostly accessed by one thread at a time) no constraint solving is required, but constraint solving is used for pages that are intensively shared. Partly because of my discussions with this group, I'm become gradually more optimistic about the prospects for multicore record-and-replay on commodity hardware, though there's a lot more work to be done.

It's hard for me to make really accurate judgements about the bulk of the research presented, because most of it was not in areas I know a lot about, but it seemed to me that like most academic conferences there were too many papers solving artificial problems that probably don't matter. Also like most academic conferences, there were no negative results — apart from, as usual, the introductions of papers that described the shortcomings of previous research being addressed in the new papers. This needs to change.

I met some new people that I was hoping to meet, but also caught up with some old friends and acquaintances — Angela Demke Brown, Bianca Schroeder, Jim Larus, Carl Waldspurger and others. It was a good time.

An Inflection Point In The Evolution Of Programming Langauges

Two recent Rust-related papers are very exciting.

Rustbelt formalizes (a simplified version of) Rust's semantics and type system and provides a soundness proof that accounts for unsafe code. This is a very important step towards confidence in the soundness of safe Rust, and towards understanding what it means for unsafe code to be valid — and building tools to check that.

This systems paper is about exploiting Rust's remarkably strong control of aliasing to solve a few different OS-related problems.

It's not often you see a language genuinely attractive to the systems research community (and programmers in general, as the Rust community shows) being placed on a firm theoretical foundation. (It's pretty common to see programming languages being advocated to the systems community by their creators, but this is not that.) Whatever Rust's future may be, it is setting a benchmark against which future systems programming languages should be measured. Key Rust features — memory safety, data-race freedom, unique ownership, and minimal space/time overhead, justified by theory — should from now on be considered table stakes.

Thursday, 6 July 2017

Bay Area Progress Report

I'm in the middle of a three-week work trip to the USA.

Monday last week I met with a couple of the people behind the Julia programming language, who have been using and contributing to rr. We had good discussions and it's good to put faces to names.

Monday night through to Friday night I was honoured to be a guest at Mozilla's all-hands meeting in San Francisco. I had a really great time catching up with a lot of old friends. I was pleased to find more Mozilla developers using rr than I had known about or expected; they're mostly very happy with the tool and some of them have been using esoteric features like chaos mode with success. We had a talk about rr and I demoed some of the new stuff Kyle and I have been working on, and talked about which directions might be most useful to Mozilla.

Saturday through Monday I went camping in Yosemite National Park with some friends. We camped in the valley on Saturday night, then on Sunday hiked down from Tioga Road (near the Lukens Lake trailhead) along Yosemite Creek to north of Yosemite Falls and camped there overnight. The next day we went up Eagle Peak for a great view over Yosemite Valley, then hiked down past the Falls back to the valley. It's a beautiful place and rather unlike any New Zealand tramping I've done — hot, dry, various sorts of unusual animals, and ever-present reminders about bears. There were a huge number of people in the valley for the holiday weekend!

Tuesday was a bit more relaxed. Being the 4th of July, I spent the afternoon with friends playing games — Bang and Lords of Waterdeep, two of my favourites.

Today I visited Brendan and his team at Brave to talk about rr. On Friday I'll give a talk at Stanford. On Monday I'll be up in Seattle giving a talk at the University of Washington, then on Tuesday I'll be visiting Amazon to talk about the prospects for rr in the cloud. Then on Wednesday through Friday I'll be attending Usenix ATC in Santa Clara and giving yet another rr talk! On Saturday I'll finally go home.

I really enjoy talking to people about my work, and learning more about people's debugging needs, and I really really enjoy spending time with my American friends, but I do miss my family a lot.

Friday, 30 June 2017

Patch On Linux Kernel Stable Branches Breaks rr

A change in 4.12rc5 breaks rr. We're trying to get it fixed before 4.12 is released, and I think that will be OK. Unfortunately that change has already been backported to 3.18.57, 4.4.72, 4.9.32 and 4.11.5 :-( (all released on June 14, and I guess arriving in distros a bit later). Obviously we'll try to get the 4.12 fix also backported to those branches, but that will take a little while.

The symptoms are that long, complex replays fail with "overshot target ticks=... by " where N is generally a pretty large number (> 1000). If you look in the trace file, the value N will usually be similar to the difference between the target ticks and the previous ticks value for that task --- i.e. we tried to stop after N ticks but we actually stopped after about N*2 ticks. Unfortunately, rr tests don't seem to be affected.

I'm not sure if there's a reasonable workaround we can use in rr, or if there is one, whether it's worth effort to deploy. That may depend on how the conversation with upstream goes.

Friday, 23 June 2017

Rising Tolerance For Static Analysis False Positives?

When I was a young graduate student working on static analysis tools, conventional wisdom was that a static analysis tool needed to have a low false-positive rate or no-one would use it. Minimizing false-positives was a large chunk of the effort in every static analysis project.

It seems that times have changed and there are now communities of developers willing to tolerate high false positive rates, at least in some domains. For example:

It will also miss really obvious bugs apparently at random, and flag non-bugs equally randomly. The price of the tool is astronomical, but it's still worthwhile if it catches bugs that human developers miss.
Indeed, I've noticed people in various projects ploughing through reams of false positives in case there are any real issues.

I'm not sure what changed here. Perhaps people have just become more appreciative of the value of subtle bugs caught by static analysis tools. Maybe it's a good time to be a developer of such tools.

Monday, 19 June 2017

Lazy Religion Tropes In Mass Media

I'm enjoying season two of Orphan Black. I hope Tatiana Maslany got paid a separate salary for each role, because she deserves every penny. Unfortunately the show falls down where so many others do: every religiously-identified person is either a chump or a psychopath. Disappointing, but expected; with few exceptions, the Christians portrayed in mass media are like none I've ever known.

I've just finished China Miéville's The Last Days Of New Paris. It's a bit over-Miévillian for me: one hundred and fifty pages of rampant imagination, but too many clever linguistic obscurities for maximal reading enjoyment. It hosts a very crisp example of the "hell exists but heaven does not" trope. Miéville frames it explicitly:

Everyone knew it helped it have a priest perform certain absurdities of his trade if it was demons you had to fight. "Why?" Thibaut asked Élise when they left again. "Why do you think it does work? It's not as if any of this stuff is true."
(Miéville's Parisian demons are explicitly of Hell.) Isn't this ... unfair? Yet it's a very common trope. Likewise I remain a huge fan of Buffy The Vampire Slayer, but it feels like a raw deal that the heroes wield crucifixes but the few ostensibly "religious" characters are villains.

I know, I know; the only way to fix this is to write my own novels and hit TV show. I'll get onto it.

Thursday, 15 June 2017

Is The x86 Architecture Sustainable?

My earlier post pointed to one strange little corner of the x86 instruction set. There is much, much more where that came from. My latest copy of Intel's "Combined Volumes" Software Developer’s Manual runs to 4,684 pages and it doesn't include new features they've announced such as CET. There's a good paper here describing some of the new features and the increasing complexity they bring. It also touches on the "feature interaction problem"; i.e., the complexity of the ISA grows superlinearly as Intel adds more features. One aspect it doesn't touch is how many legacy compatibility features x86 processors carry, which continue to interact with new features. One of the key risks for Intel and its ecosystem is that many of the new features may end up being little-used — just like many of the legacy compatibility features — so their complexity ends up being "dead weight".

Is there a problem here for anyone other than Intel? Yes. Unnecessarily complex hardware costs us all in the end because those CPUs won't be as efficient, cheap, reliable or secure as they otherwise could be — especially as we seem to be hitting a process wall so that we can't rely on increasing transistor density to bail us out. Another issue is that complex ISAs make ISA-dependent tools — assemblers, disassemblers, debuggers, binary instrumentation tools, and even hypervisors and kernels — more difficult to build. Super-complex ISAs encourage software to depend on obscure features unnecessarily, which builds a competitive moat for Intel but makes competition at the CPU level more difficult.

The obvious technically-appealing approach — starting over with a clean-sheet architecture — by itself doesn't solve these problems in the long run (or the short run!). We have to acknowledge that architectures will inevitably grow features in response to market demand (or perceived value) and that some of those features will turn into legacy baggage. The question is, can we adjust the ecosystem to make it possible to drop legacy baggage, and is it worth doing so?

I'm not an ARM expert but the phone and embedded markets already seem to have moved in this direction. Phones use a plethora of ARM derivatives with different architectural features (though in spite of that, ARM has accumulated its own share of weirdnesses!) Can we do it in the cloud/desktop/laptop spaces? I don't see why not. Even software that reaches down to a relatively low level like a Web browser (with a complex tiered JIT, etc) doesn't rely on many complex legacy architectural features. Porting Firefox to a brand new processor architecture is a lot of work — e.g., implementing new JIT backends, FFI glue, and translating a pile of handwritten assembler (e.g. in codecs) to the new architecture. However, porting Firefox to an x86 or ARM variant minus unnecessary legacy baggage would be a lot easier. Your variant may need kernel and bootloader changes but that's certainly doable for Linux derivatives.

So, I think there's room in the market for architectures which may or may not be based on existing architectures but lack legacy features and offer less-than-total binary compatibility with a popular architecture. Perhaps the forthcoming ARM Snapdragon Windows laptops are a step in that direction.

Update Many commenters on Hackernews suggested that x86 complexity is a non-issue because decoding legacy instructions to uops requires insignificant die area. But this really misses the point: x86 complexity is not just about legacy instructions which can be decoded down to some microcode. It's about architectural features like SGX, CET, MPX, TSX, VT — plus the legacy stuff like segment registers and 286 call gates and virtual 8086 mode and so on and so on. It's about the processor state required to support those features, how those features all interact with each other, and how they increase the complexity of context switching, OS support, and so on.

New "rr pack" Command

I think there's huge potential to use rr for debugging cloud services. Apparently right now interactive debugging is mostly not used in the cloud, which makes sense — it's hard to identify the right process to debug, much less connect to it, and even if you could, stopping it for interactive analysis would likely interfere too much with your distributed system. However, with rr you could record any number of process executions without breaking your system, identify the failed runs after the fact, and debug them at your leisure.

Unfortunately there are a couple of problems making that difficult right now. One is that the largest cloud providers don't support the hardware performance counter rr needs. I'm excited to hear that Amazon has recently enabled some HW performance counters on dedicated hosts — hopefully they can be persuaded to add the retired-conditional-branch counter to their whitelist (and someone can fix the Xen PMU virtualization bug that breaks rr). Another problem is that rr's traces aren't easy to move from one machine to another. I've started addressing this problem by implementing a new rr command, rr pack.

There are two problems with rr traces. One is that on filesystems that do not support "reflink" file copies, to keep recording overhead low we sometimes hardlink files into the trace, or for system libraries we just assume they won't change even if we can't hardlink them. This means traces are not fully self-contained in the latter case, and in the former case the recording can be invalidated if the files change. The other problem is that every time an mmap occurs we clone/link a new file into the trace, even if a previous mmap mapped the same file, because we have no fast way of telling if the file has changed or not. This means traces appear to contain large numbers of large files but many of those files are duplicates.

rr pack fixes both of those problems. You run it on a trace directory in-place. It deduplicates trace files by computing a cryptographic hash (BLAKE2b, 256 bits) of each file and keeping only one file for any given hash. It identifies needed files outside the trace directory, and hardlinks to files outside the trace directory, and copies them into the trace directory. It rewrites trace records (the mmaps file) to refer to the new files, so the trace format hasn't changed. You should be able to copy around the resulting trace, and modify any files outside the trace, without breaking it. I tried pretty hard to ensure that interrupted rr pack commands leave the trace intact (using fsync and atomic rename); of course, an interrupted rr pack may not fully pack the trace so the operation should be repeated. Successful rr pack commands are idempotent.

We haven't really experimented with trace portability yet so I can't say how easy it will be to just zip up a trace directory and replay it on a different computer. We know that currently replaying on a machine with different CPUID values is likely to fail, but we have a solution in the works for that — Kyle's patches to add ARCH_SET_CPUID to control "CPUID faulting" are in Linux kernel 4.12 and will let rr record and replay CPUID values.

How I Found A 20-Year-Old Linux Kernel Bug

Recently I improved the rr tests to test that the kernel doesn't write more memory than we expect during system calls. We place syscall output buffers at the ends of pages so that the end of the buffer is immediately followed by an unmapped page. If the kernel reads or writes too much memory then the system call will fail with EFAULT (or something worse will happen).

This found a few bugs in rr's assumptions about how much memory the kernel reads and writes. Interestingly, it also exposed a very old kernel bug: certain wireless ioctls are supposed to take a 32-byte memory parameter but the kernel actually fails with EFAULT if less than 40 bytes are available. (The extra bytes are not modified.) I tried to be a good citizen by reporting the bug and I'm pleased to say that it's actually being fixed!

The bug was apparently introduced in Linux 2.1.15, released December 12, 1996. It's interesting that it wasn't found and fixed until now. I guess not many programs use these ioctls, and those that do probably use buffers that are always followed by at least eight more bytes of data, e.g. any buffer on the stack. Then again, if you wrote a program that allocated those buffers using "malloc", I guess once in a while it would fail if your allocator happens to land one at the end of a page.

This class of bugs --- "small overrunning read that doesn't get used" --- could also be a problem in user-space. I don't recall seeing other bugs in this class, though.

Friday, 9 June 2017

Another Case Of Obscure CPU Nondeterminism

One of "big bets" of rr is that commodity CPUs running user-space code really are deterministic in practice, or at least the nondeterminism can be efficiently detected and controlled, without restorting to pervasive binary instrumentation. We're mostly winning that bet on x86 (but not ARM), but I'm uncomfortable knowing that otherwise-benign architectural changes could break us at any time. (One reason I'm keen to increase rr's user base is make system designers care about not breaking us.)

I recently discovered the obscure XGETBV instruction. This is a problematic instruction for rr because it returns the contents of internal system registers that we can't guarantee will be unchanged from run to run. (It's a bit weird that the contents of these registers are allowed to leak to userspace, but anyway...) Fortunately it's barely used in practice; the only use I've seen of it so far is by the dynamic loader ld.so, to return the contents of the XINUSE register. This obscure register tracks, for each task, whether certain CPU components are known to be in their default states. For example if your x86-64 process has never used any (legacy floating-point) x87 instructions, bit 0 of XINUSE should be clear. That should be OK for rr, since whether certain kinds of instructions have executed should be the same between recording and replay. Unfortunately I have observed cases where the has-used-x87 bit gets unexpectedly flipped on; by singlestepping the tracee it's pretty clear the bit is being set in code that has nothing to do with x87, and the point where it is set, or if it happens at all, varies from run to run. (I have no way to tell whether the bit is being set by hardware or the kernel.) This is disturbing because it means that XGETBV is effectively a nondeterministic instruction. Fortunately, the ld.so users are just testing to see whether AVX has been used, so they mask off the x87 bit almost immediately, eliminating the nondeterminism from the program state before it can do any damage (well, unless you were super unlucky and took a context switch between XGETBV and the mask operation). I haven't seen any bits other than the x87 bit being unexpectedly set.

Fortunately it seems we can work around the problem completely in rr by setting the x87-in-use bit immediately after every exec. If the bit is always set, it doesn't matter if something intermittently sets it again. Another day, another bullet dodged!

Thursday, 1 June 2017

WebAssembly: Mozilla Won

Mozilla staff are being very diplomatic and restrained by allowing WebAssembly to be portrayed as a compromise between the approaches of asm.js and PNaCl. They have good reasons for being so, but I can be a bit less restrained. asm.js and PNaCl represented quite different visions for how C/C++ code should be supported on the Web, and I think WebAssembly is a big victory for asm.js and Mozilla's vision.

Considered as a Web platform feature, PNaCl had three major problems from the beginning, two of which it inherited from NaCl. By design, in NaCl and PNaCl, application code was highly isolated from the world of Javascript and HTML, running in its own process and behaving much like an opaque plugin with very limited interactions with its host page (since any interactions would have to happen over IPC). This led to a much bigger problem, which is that to interact with the outside world (P)NaCl code needed some kind of platform API, and Google decided to create an entirely brand-new platform API — "Pepper" — which necessarily duplicated a lot of the functionality of standard Web platform APIs. To make things even worse, neither Pepper nor PNaCl's LLVM-based bytecode had a proper specification, let alone one with multi-vendor buy-in.

Therefore any non-Chrome-based browser wishing to implement PNaCl would have had to reverse-engineer a Chromium-bug-compatible Pepper spec and reimplement it, or more likely import a large amount of Chromium code to implement Pepper/NaCl. Likewise they'd have had to import a large amount of LLVM code for the PNaCl compiler. Both imports would have to stay in sync with whatever Google did. This would mean lots more code bloat, maintenance and spec work, and more work for Web developers too, not to mention being a severe blow to Web standards. Mozilla people (including me) explained the unacceptability of all this to relevant Google people early on, but to no avail.

Mozilla responded with the asm.js project to show that one could achieve similar goals with minimal extensions to the standard-based Web platform. asm.js sets up a Javascript array to represent memory and compiles C/C++ to Javascript code operating on the array. It avoids those big PNaCl issues: asm.js code can interact with JS and the DOM easily since it shares the JS virtual machine; it specifies no new platform APIs, instead relying on the (already standardized and interoperably implemented) Web platform APIs; and very little new specification work had to be done, since it mostly relies on already-specified JS semantics.

In these key areas WebAssembly follows asm.js, not PNaCl. WebAssembly applications or components interact freely with JS and AFAIK in all browsers WebAssembly is implemented as part of the JS VM. WebAssembly defines no new platform APIs other than some APIs for loading and linking WebAssembly code, relying on standards-based Web APIs for everything else. WebAssembly differs from asm.js by defining a bytecode format with some new operations JS doesn't have, so some spec work was required (and has been done!). Like asm.js, WebAssembly application call-stacks are maintained by the JS VM, outside the memory addressable by the application, which reduces the exploitability of application bugs. (Though, again like asm.js and unlike PNaCl, the compiler is trusted.)

I'm not belittling the contributions of Google and others to WebAssembly. They have done a lot of good work, and ditching PNaCl was an admirable decision for the good of the Web — thank you! Often in these contests proclaiming a "winner" is unimportant or even counterproductive, and it's true that in a sense Web developers are the real winners. I'm calling it out here because I think the good and essential work that Mozilla continues to do to improve the standards-based Web platform is too often overlooked. I also want credit to go to Mozilla's amazing asm.js/WebAssembly team, who went up against Google with far fewer resources but a better approach, and won.

Tuesday, 30 May 2017

Should Debuggers Report Idempotent Writes?

gdb's data watchpoints are designed to only trigger when a written value actually changes; i.e., writes to memory that leave the value unchanged are deliberately ignored. (rr has to follow gdb's behavior.) I don't know why gdb does this, but it may be for historical reasons: it's impractical for a normal debugger to observe all possible writes to memory (e.g. those performed by the kernel), so a traditional strategy for implementing watchpoints is simply to re-read the watched location frequently and report changes, and obviously this can't detect idempotent updates.

Over the last year Kyle Huey and I have been building a brand-new record-and-replay-based debugger. One of the most intellectually challenging aspects of this project is sorting through the decisions made by existing debuggers, distinguishing those which are truly desirable from those made under the duress of implementation constraints that no longer apply. Handling of idempotent writes is one such decision. During rr replay we can observe all memory writes, so reporting all idempotent writes is perfectly feasible.

There are some clear cases where idempotent writes should not be ignored. For example, consider a series of printfs repeatedly overwriting the contents an output buffer. If, at the end of the series, the user wants to understand how the final value of some byte in the buffer was obtained, we should report the last printf to write to that location, even if the byte value it happened to write was the same as that written to the location by some previous printf. Reporting the previous printf would be hopelessly confusing --- though this is, in fact, what rr+gdb do.

On the other hand, I'm cautious. There may be some good reason to ignore idempotent writes that we just haven't figured out yet. It could possibly involve optimization — an optimizing compiler is free to introduce spurious writes if it knows they'll be idempotent — but I haven't seen or thought of a plausible example where this would be a problem.

Thursday, 25 May 2017

A Couple Of Papers About Commodity Multicore Record And Replay, And A Possible Way Forward

Efficient record-and-replay of parallel programs with potential data races on commodity multicore hardware is a holy-grail research problem. I think that the best papers in this area so far, though they're a bit old, are SMP-Revirt and DoublePlay.

DoublePlay is a really interesting approach but I'm skeptical it would work well in practice. Without going into too many details, for parallel applications that scale well the minimum overhead it can obtain is 100%. Obtaining that overhead relies on breaking application execution into epochs that are reasonably long, but externally-visible output must be suppressed within each epoch, which is going to be difficult or impossible to do for many applications. For example, for the Apache Web server they buffer all network output in the kernel during an epoch, but this could cause problematic side effects for a more complicated application.

Therefore I think the SMP-Revirt approach is more robust and more likely to lead to a practical system. This is a fairly simple idea: apply the well-known concurrent-read, exclusive-write (CREW) model at the page level, using hardware page protections to trigger ownership changes. If you imagine a herd of single-threaded processes all sharing a memory block, then each page is mapped read-only in all processes (unowned), or read-write in one process (the owner) and no-access in all the others. When a process writes to a page it doesn't own, a trap occurs and it takes ownership; when it reads from a page that has a different owner, a trap occurs and the page becomes unowned. SMP-Revirt worked at the hypervisor level; dthreads does something similar at user level.

A problem with SMP-Revirt is that when there is a lot of read/write sharing, the overhead of these transitions becomes very high. In the paper, for three of their six benchmarks recording the benchmark with four-core parallelism is slower than just recording on a single core rr-style!

I think it would be quite interesting for someone to implement a page-CREW approach in the Linux kernel. One of the problems dthreads has is that it forces all threads to be separate processes so they can have distinct memory mappings. Another problem is that mprotect on individual pages creates lots of kernel VMAs. Both of these add overhead. Instead, working in the kernel you could keep the normal process/thread setup and avoid having to create a new VMA when a page assignment changes, storing ownership values (CPU index) in per-page data. Then a page state transition from unowned to owned requires a page fault to the kernel, a change to the ownership value, and a synchronous inter-processor-interrupt to CPUs with a TLB mapping for the page; a transition from owned to unowned (or a different owner) requires just an IPI to the previous owner CPU. These IPIs (and their replies) would carry per-CPU timestamp values that get logged (to per-CPU logs) to ensure the order of ownership changes can be deterministically replayed.

Then some interesting questions are: how well does this perform over a range of workloads --- not just the standard parallel benchmarks but applications like Web browsers, memcached, JVMs, etc? Can we detect on-the-fly that some cluster of threads is experiencing severe overhead and degrade gracefully so we're at least no worse than running single-core? Are there other optimizations that we could do? (E.g. the unowned concurrent-read state could be bad for pages that are frequently written, because you have to send an IPI to all CPUs instead of just one when transitioning from unowned to owned; you could avoid that by forcing such pages to always be exclusively owned. When a CPU takes exclusive ownership you might want to block other CPUs from taking ownership away for a short period of time. Etc...) I think one could do quite a bit better than SMP-Revirt --- partly because kernel-level sharing is no longer an issue, since that's no longer being recorded and replayed.

To be clear: I don't want to work on this. I'm a lot more interested in the applications of record-and-replay than the machinery that does it. I want someone else to do it for us :-).

Friday, 19 May 2017

rr Usenix Paper And Technical Report

Our paper about rr has been accepted to appear at the Usenix Annual Technical Conference. Thanks to Dawn for suggesting that conference, and to the ATC program committee for accepting it :-). I'm scheduled to present it on July 13. The paper is similar to the version we previously uploaded to arXiv.

Some of the reviewers requested more material: additional technical details, explanations of some of our design choices compared to alternatives, and reflection on our "real world" experience with rr. There wasn't space for that within the conference page limits, so our shepherd suggested publishing a tech-report version of the paper with the extra content. I've done that and uploaded "Engineering Record And Replay For Deployability: Extended Technical Report". I hope some people find it interesting and useful.

Thursday, 11 May 2017

Obscurity Inhibits Persuasion

I'm a huge fan of China Miéville's work. "The City & The City" is one of my favourite books of all time. Particular features of his writing are his delightful linguistic obscurity and inventiveness, even though they tax the mind. Few writers could use the word "squiddity" in earnest and get away with it!

However, when one's goal is to communicate and persuade instead of entertain, simpler language would serve better. Miéville's essay "The Limits Of Utopia" would be more comprehensible to a broader audience if he had restrained his literary style. (For example, the words "coagulate", "imbricated", "malefic" and "peonage" all have simpler substitutes that would serve just as well.) A subject of this importance deserves the clearest and broadest communication; Miéville rejecting such discipline suggests that he values the his or his readers' literary gratification more highly than the influence he could exert through his essay.

It's not just Miéville, of course. All sorts of communities --- including scientists, activists, and religionists --- use specialized vocabulary, sometimes for brevity and precision, but often to identify as in-group. When they try to reach a broad audience, they'd be better off to be simple and clear --- especially when addressing people hostile to the in-group. Yet I'm amazed how often people fail to do this --- I guess due to habit, but sometimes perhaps due to insincerity.

Saturday, 6 May 2017

Perceptions Of Violent Crime

I spent last week as a parent helper at my child's school camp at Tui Ridge near Rotorua. I found it exciting and challenging in various ways, not least just having to exercise social skills non-stop for five days solid. I spent a fair bit of time socializing with the other parents. One conversation came around to the topic of violent crime, and some of the other parents talked about how violent crime is getting worse and there seemed to be a murder every day in Auckland. I ventured to suggest that violent crime, especially murder, is declining here (and elsewhere), to general skepticism. Unfortunately I didn't have data memorized and I drowned my phone recently, so I left it at that.

On returning home it didn't take long to confirm my point. For 2014 (the latest year for which I can find data) the police reported 66 "homicide and related offenses" nationally. That is the lowest number since that dataset began (1994), when the numbers were much higher. (I don't know why numbers for 2015 and 2016 aren't available in that dataset, but I guess bureaucratic inefficiency. There appears to be more recent data at policedata.nz, but it's behind a Flash app so I can't be bothered.) The idea that "a murder occurs every day in Auckland" was off by an order of magnitude, in 2014 at least, and there seems no reason to believe things have gotten worse.

Wikipedia has a good discussion of how in New Zealand (and other Western countries) people tend to believe violent crime is increasing when in fact it has been generally decreasing. I've read all sorts of theories and surveys of theories about the causes of this decrease, from "increased abortion" to "declining lead levels", but it appears no-one really knows; there are probably multiple causes. It's certainly welcome!

Sunday, 30 April 2017

Call Out China For Their Treatment Of NK Escapees

Nothing really new here in the stories of North Korea's prison camps, but I think one important aspect is consistently underemphasized:

Several weeks later, guards and inmates were ordered to gather as the pair, their bodies battered from torture, were dragged back behind the barbed wire. They had been caught in China and returned to the repressive regime.

"The two brothers were beheaded in front of everyone," said Lim. "They called everyone to watch as a warning not to flee. The other prisoners then had to throw stones at them."

Rounding up North Korean escapees and sending them back for torture and execution is barbaric and indefensible and should be embarrassing to China and Chinese people. The escapees want to reach South Korea; surely NGOs could take them there at no greater cost or inconvenience to China than the current policy.

It seems to me human rights organizations, news organizations and political leaders should be making this point at every opportunity ... but it rarely seems to come up. I don't know why.

Sunday, 23 April 2017

One Does Simply Walk Into Mordor

... on the Tongariro Northern Circuit. This central North Island "Great Walk" is a loop circumnavigating Mount Ngauruhoe (aka "Mount Doom") and much of the Tongariro volcanic complex. The terrain varies but is mostly a mixture of lava fields and somewhat barren alpine plains scoured by old pyroclastic flows (most recently by the "super-colossal" ~1800-year-ago Taupo eruption). It doesn't take much imagination, or sophisticated marketing, to see it as (a peaceful and beautiful version of) Mordor.

We completed the Circuit yesterday, after four days of hiking. It's certainly possible to do it in three or even two days, but we took it easy and had time to do all the side tracks: the Tongariro summit track on day two (Mangatepopo Hut to Oturere Hut), the Ohinepango springs near Waihohonu on day three, and the Upper Tama Lake track on day four (Waihohonu Hut to Whakapapa Village).

Thank God, we had perfect weather. We also had a great time lounging around at the huts talking to other trampers of all ages, and even playing a variety of games. Overall it was another wonderful "Great Walk" experience. The only improvement would have been if one of the volcanoes was erupting, but you can't have everything!

The new Waihohonu Hut was impressive. It's probably the best DoC hut I've yet seen: multiple outdoor decks with picnic tables, well insulated, a very spacious common area with wood-burning stove, eight gas cooking stoves, solar-powered LED lighting, and even solar water heating!

After destroying a couple of cameras by getting them wet while tramping (only cheap point-and-shoots), I bought a waterproof camera. On this trip I took some underwater photos and they turned out quite well!