Thursday, 25 May 2017

A Couple Of Papers About Commodity Multicore Record And Replay, And A Possible Way Forward

Efficient record-and-replay of parallel programs with potential data races on commodity multicore hardware is a holy-grail research problem. I think that the best papers in this area so far, though they're a bit old, are SMP-Revirt and DoublePlay.

DoublePlay is a really interesting approach but I'm skeptical it would work well in practice. Without going into too many details, for parallel applications that scale well the minimum overhead it can obtain is 100%. Obtaining that overhead relies on breaking application execution into epochs that are reasonably long, but externally-visible output must be suppressed within each epoch, which is going to be difficult or impossible to do for many applications. For example, for the Apache Web server they buffer all network output in the kernel during an epoch, but this could cause problematic side effects for a more complicated application.

Therefore I think the SMP-Revirt approach is more robust and more likely to lead to a practical system. This is a fairly simple idea: apply the well-known concurrent-read, exclusive-write (CREW) model at the page level, using hardware page protections to trigger ownership changes. If you imagine a herd of single-threaded processes all sharing a memory block, then each page is mapped read-only in all processes (unowned), or read-write in one process (the owner) and no-access in all the others. When a process writes to a page it doesn't own, a trap occurs and it takes ownership; when it reads from a page that has a different owner, a trap occurs and the page becomes unowned. SMP-Revirt worked at the hypervisor level; dthreads does something similar at user level.

A problem with SMP-Revirt is that when there is a lot of read/write sharing, the overhead of these transitions becomes very high. In the paper, for three of their six benchmarks recording the benchmark with four-core parallelism is slower than just recording on a single core rr-style!

I think it would be quite interesting for someone to implement a page-CREW approach in the Linux kernel. One of the problems dthreads has is that it forces all threads to be separate processes so they can have distinct memory mappings. Another problem is that mprotect on individual pages creates lots of kernel VMAs. Both of these add overhead. Instead, working in the kernel you could keep the normal process/thread setup and avoid having to create a new VMA when a page assignment changes, storing ownership values (CPU index) in per-page data. Then a page state transition from unowned to owned requires a page fault to the kernel, a change to the ownership value, and a synchronous inter-processor-interrupt to CPUs with a TLB mapping for the page; a transition from owned to unowned (or a different owner) requires just an IPI to the previous owner CPU. These IPIs (and their replies) would carry per-CPU timestamp values that get logged (to per-CPU logs) to ensure the order of ownership changes can be deterministically replayed.

Then some interesting questions are: how well does this perform over a range of workloads --- not just the standard parallel benchmarks but applications like Web browsers, memcached, JVMs, etc? Can we detect on-the-fly that some cluster of threads is experiencing severe overhead and degrade gracefully so we're at least no worse than running single-core? Are there other optimizations that we could do? (E.g. the unowned concurrent-read state could be bad for pages that are frequently written, because you have to send an IPI to all CPUs instead of just one when transitioning from unowned to owned; you could avoid that by forcing such pages to always be exclusively owned. When a CPU takes exclusive ownership you might want to block other CPUs from taking ownership away for a short period of time. Etc...) I think one could do quite a bit better than SMP-Revirt --- partly because kernel-level sharing is no longer an issue, since that's no longer being recorded and replayed.

To be clear: I don't want to work on this. I'm a lot more interested in the applications of record-and-replay than the machinery that does it. I want someone else to do it for us :-).

Friday, 19 May 2017

rr Usenix Paper And Technical Report

Our paper about rr has been accepted to appear at the Usenix Annual Technical Conference. Thanks to Dawn for suggesting that conference, and to the ATC program committee for accepting it :-). I'm scheduled to present it on July 13. The paper is similar to the version we previously uploaded to arXiv.

Some of the reviewers requested more material: additional technical details, explanations of some of our design choices compared to alternatives, and reflection on our "real world" experience with rr. There wasn't space for that within the conference page limits, so our shepherd suggested publishing a tech-report version of the paper with the extra content. I've done that and uploaded "Engineering Record And Replay For Deployability: Extended Technical Report". I hope some people find it interesting and useful.

Thursday, 11 May 2017

Obscurity Inhibits Persuasion

I'm a huge fan of China Miéville's work. "The City & The City" is one of my favourite books of all time. Particular features of his writing are his delightful linguistic obscurity and inventiveness, even though they tax the mind. Few writers could use the word "squiddity" in earnest and get away with it!

However, when one's goal is to communicate and persuade instead of entertain, simpler language would serve better. Miéville's essay "The Limits Of Utopia" would be more comprehensible to a broader audience if he had restrained his literary style. (For example, the words "coagulate", "imbricated", "malefic" and "peonage" all have simpler substitutes that would serve just as well.) A subject of this importance deserves the clearest and broadest communication; Miéville rejecting such discipline suggests that he values the his or his readers' literary gratification more highly than the influence he could exert through his essay.

It's not just Miéville, of course. All sorts of communities --- including scientists, activists, and religionists --- use specialized vocabulary, sometimes for brevity and precision, but often to identify as in-group. When they try to reach a broad audience, they'd be better off to be simple and clear --- especially when addressing people hostile to the in-group. Yet I'm amazed how often people fail to do this --- I guess due to habit, but sometimes perhaps due to insincerity.

Saturday, 6 May 2017

Perceptions Of Violent Crime

I spent last week as a parent helper at my child's school camp at Tui Ridge near Rotorua. I found it exciting and challenging in various ways, not least just having to exercise social skills non-stop for five days solid. I spent a fair bit of time socializing with the other parents. One conversation came around to the topic of violent crime, and some of the other parents talked about how violent crime is getting worse and there seemed to be a murder every day in Auckland. I ventured to suggest that violent crime, especially murder, is declining here (and elsewhere), to general skepticism. Unfortunately I didn't have data memorized and I drowned my phone recently, so I left it at that.

On returning home it didn't take long to confirm my point. For 2014 (the latest year for which I can find data) the police reported 66 "homicide and related offenses" nationally. That is the lowest number since that dataset began (1994), when the numbers were much higher. (I don't know why numbers for 2015 and 2016 aren't available in that dataset, but I guess bureaucratic inefficiency. There appears to be more recent data at policedata.nz, but it's behind a Flash app so I can't be bothered.) The idea that "a murder occurs every day in Auckland" was off by an order of magnitude, in 2014 at least, and there seems no reason to believe things have gotten worse.

Wikipedia has a good discussion of how in New Zealand (and other Western countries) people tend to believe violent crime is increasing when in fact it has been generally decreasing. I've read all sorts of theories and surveys of theories about the causes of this decrease, from "increased abortion" to "declining lead levels", but it appears no-one really knows; there are probably multiple causes. It's certainly welcome!

Sunday, 30 April 2017

Call Out China For Their Treatment Of NK Escapees

Nothing really new here in the stories of North Korea's prison camps, but I think one important aspect is consistently underemphasized:

Several weeks later, guards and inmates were ordered to gather as the pair, their bodies battered from torture, were dragged back behind the barbed wire. They had been caught in China and returned to the repressive regime.

"The two brothers were beheaded in front of everyone," said Lim. "They called everyone to watch as a warning not to flee. The other prisoners then had to throw stones at them."

Rounding up North Korean escapees and sending them back for torture and execution is barbaric and indefensible and should be embarrassing to China and Chinese people. The escapees want to reach South Korea; surely NGOs could take them there at no greater cost or inconvenience to China than the current policy.

It seems to me human rights organizations, news organizations and political leaders should be making this point at every opportunity ... but it rarely seems to come up. I don't know why.

Sunday, 23 April 2017

One Does Simply Walk Into Mordor

... on the Tongariro Northern Circuit. This central North Island "Great Walk" is a loop circumnavigating Mount Ngauruhoe (aka "Mount Doom") and much of the Tongariro volcanic complex. The terrain varies but is mostly a mixture of lava fields and somewhat barren alpine plains scoured by old pyroclastic flows (most recently by the "super-colossal" ~1800-year-ago Taupo eruption). It doesn't take much imagination, or sophisticated marketing, to see it as (a peaceful and beautiful version of) Mordor.

We completed the Circuit yesterday, after four days of hiking. It's certainly possible to do it in three or even two days, but we took it easy and had time to do all the side tracks: the Tongariro summit track on day two (Mangatepopo Hut to Oturere Hut), the Ohinepango springs near Waihohonu on day three, and the Upper Tama Lake track on day four (Waihohonu Hut to Whakapapa Village).

Thank God, we had perfect weather. We also had a great time lounging around at the huts talking to other trampers of all ages, and even playing a variety of games. Overall it was another wonderful "Great Walk" experience. The only improvement would have been if one of the volcanoes was erupting, but you can't have everything!

The new Waihohonu Hut was impressive. It's probably the best DoC hut I've yet seen: multiple outdoor decks with picnic tables, well insulated, a very spacious common area with wood-burning stove, eight gas cooking stoves, solar-powered LED lighting, and even solar water heating!

After destroying a couple of cameras by getting them wet while tramping (only cheap point-and-shoots), I bought a waterproof camera. On this trip I took some underwater photos and they turned out quite well!

Wednesday, 5 April 2017

Rust Optimizations That C++ Can't Do (Version 2)

My last post was wrong in the details, so let me show a better example. Rust code:

pub extern fn foo(v: &u64, callback: fn()) -> u64 {
  let mut sum = 0;
  for _ in 0..100 {
    sum += *v;
    callback();
  }
  sum
}
C++ version here.

The inner loop of the Rust version turns into:

.LBB0_1:
        inc     ebx
        call    r14
        add     r12, r15
        cmp     ebx, 100
        jl      .LBB0_1
sum is in r12 and the evaluation of *v has been hoisted out of the loop with the result in r15. In the compiled C++ code the value has not been hoisted:
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        add     rbx, qword ptr [r15]
        call    r14
        dec     ebp
        jne     .LBB0_1

Of course the performance difference in this case would be negligible, but if *v was replaced with a more complicated expression, the difference would increase. (No, I don't know why the Rust code counts up instead of counting down like C++ to get slightly better code; someone should fix that.)

Here it's really clear that the semantics of Rust are making this optimization possible. In C++ v could be a reference to a global variable which is modified by the callback function, in which case hoisting the load would be incorrect.

Several people pointed out that with LTO/WPO a C++ compiler might be able to determine that such cases don't happen, in this case that no function passed into callback directly or indirectly modifies v. That's what I meant in my previous post by "a C++ compiler could do some analysis to show that can't happen" ... but as I continued, "those analyses don't scale". The larger and more complicated your code gets, the more likely it is that static analysis will fail to prove the invariants you care about. Sometimes, due to dynamic linking or run-time code generation, relevant code just isn't available to be analyzed. In other cases the analysis algorithms aren't sophisticated enough to prove what you care about (often because no such algorithm is known), or adequate algorithms exist but they were too expensive to deploy. Perhaps the worst cases for sophisticated global analysis are when the property you care about isn't even true, because you accidentally violated it in some part of the program, but you'll never know, because the only effect of that accident is that performance got a little bit worse somewhere else. (E.g. with this example if you accidentally change one of the callback functions to indirectly modify v, your hypothetical WPO compiler does a slightly worse job and you won't know why.)

What we need are language-level invariants that apply globally and support aggressive optimization using only local analysis. C++ has some, but Rust's are stronger. We also need those invariants to be checked and enforced, not just "it's undefined behavior if you do this"; (safe) Rust wins again on that.