Friday, 22 April 2016

Android's Update System Is Ridiculous

My ancient Nexus S phone got wet and died on the Whanganui River. I was planning to get a new phone anyway so I can run a supported OS, so I went out and bought a Nexus 5X. It's the latest, greatest, pure Google experience, so it should be awesome, right?

Well then... after going through the FTU and starting to use the phone, the first thing that happens is that it asks me if I want to upgrade to Android 6.0.1. Sure, that's why I bought this phone, no problem. The update downloads, installs, the phone restarts, some applications get "optimized", and eventually we're back in business. This takes about five minutes. So far, so OK.

What happens next is mind-boggling. After restart, the phone offers me another OS update, "January 2016". Download, install, restart, applications optimized, five minutes later we're back. Then it happens again for, you guessed it, "February 2016". Then "March 2016". Then "April 2016". So I've wasted about half an hour babysitting my phone while it crawls through this process.

This is stupid. Surely the geniuses at Google can see it would be a vastly better first-run experience to deliver all the updates together so you only have to go through a single update-restart cycle. What are they thinking?

Friday, 15 April 2016

Leveraging Modern Filesystems In rr

During recording, rr needs to make sure that the data read by all mmap and read calls is saved to the trace so it can be reproduced during replay. For I/O-intensive applications (e.g. file copying) this can be expensive in time, because we have to copy the data to the trace file and we also compress it, and in space, because the data is being duplicated.

Modern filesystems such Btrfs and XFS have features that make this much better: the ability to clone files and parts of files. These cloned files or blocks have copy-on-write semantics: i.e. the underlying storage is shared until one of the copies is written to. In common cases no future writes happen or they happen later when copying isn't a performance bottleneck. So I've extended rr to support these features.

mmap was pretty easy to handle. We already supported hardlinking mmapped files into the trace directory as a kind of hacky/incomplete copy-on-write approximation, so I just extended that code to try BTRFS_IOC_CLONE first. Works great.

read was harder to handle. We extend syscall buffering to give each thread its own "cloned-data" trace file, and every time a large block-aligned read occurs, we first try to clone those blocks from the source file descriptor into that trace file. If that works, we then read the data for real but don't save it to the syscall buffer. During replay, we read the data from the cloned-data trace file instead of the original file descriptor. The details are a bit tricky because we have to execute the same code during recording as replay.

This approach introduces a race: if some process writes to the input file between the tracee cloning the blocks and actually reading from the file, the data read during replay will not match what was read during recording. I think this is not a big deal since in Linux such read/write races can result in the reader seeing an arbitrary mix of old and new data, i.e. this is almost certainly a severe bug in the application, and I suspect such bugs are relatively rare. Personally I've never seen one. We could eliminate the race by reading from the cloned-data file instead of the original input file, but that performs very poorly because it defeats the kernel's readahead optimizations.

Naturally this optimization only works if you have the right sort of filesystem and the trace and the input files are on the same filesystem. So I'm formatting each of my machines with a single large btrfs partition.

Here are some results doing "cp -a" of an 827MB Samba repository, on my laptop with SSD:

The space savings are even better:

The cloned btrfs data is not actually using any space until the contents of the original Samba directory are modified. And of course, if you make multiple recordings they continue to share space even after the original files are modified. Note that some data is still being recorded (and compressed) normally, since for simplicity the cloning optimization doesn't apply to small reads or reads that are cut short by end-of-file.

Thursday, 7 April 2016

Skylake Erratum Affecting rr

I've encountered a Skylake bug that affects rr. Turns out it's (vaguely) documented:

The performance monitoring events INST_RETIRED (Event C0H; any Umask value) and BR_INST_RETIRED (Event C4H; any Umask value) count instructions retired and branches retired, respectively. However, due to this erratum, these events may overcount in certain conditions when:
  • Executing VMASKMOV* instructions with at least one masked vector element
  • Executing REP MOVS or REP STOS with Fast Strings enabled (IA32_MISC_ENABLES MSR (1A0H), bit 0 set)
  • An MPX #BR exception occurred on BNDLDX/BNDSTX instructions and the BR_INST_RETIRED (Event C4H; Umask is 00H or 04H) is used.

Fortunately, in my tests, this overcount only happens when you configure more than one hardware performance counter for the same task, and rr only needs one ("conditional branches retired"). Well, except for one rr test: the new record_replay. In that test, rr records rr replaying a simple program, and the recorder process and the replayer process each configure a counter to count conditional branches in that program. Because they're counting the same event, we can avoid using two hardware counters by having the recorder emulate the replayer's perf_event_open and related syscalls, returning results using its own underlying conditional branch counter. This is a bit annoying to have to do, and it adds overhead to rr-in-rr scenarios, but I think it's worth doing to keep the record_replay test working in Skylake. So this is on master.

On the bright side, I'm glad to see that miscounted performance events warrant errata. That means people other than rr users care about event counts being correct, increasing the likelihood they'll continue to be usable for rr. Maybe one day instructions-retired will be usable enough we can dispense with the workarounds for not having it!

Sunday, 3 April 2016

GNOME High-DPI Issues

Most sources advise using gnome-tweak-tool to set "Window Scaling" to 2 (or whatever). This alone doesn't work adequately for me; when I make my external 4K monitor primary, gnome-settings-daemon decides to set Xft.dpi to 96 because it thinks my monitor DPI is too low for scaling, even though I have set window scaling to 2. Arguably this is a gnome-settings-daemon bug. The result is that most apps look fine but some, including gnome-shell, care about Xft.dpi and don't scale. I fixed the problem by manually setting

gsettings set org.gnome.desktop.interface scaling-factor 2
and now everything's working reasonably well. Some apps still don't support proper scaling (e.g. Eclipse) but at least they're usable when they weren't before. Still, on the Web we really did this right: make all apps work at high-DPI out of the box, scaling up individual bitmaps when necessary and rendering at high resolution automatically when we can (e.g. text), and letting DPI-aware apps opt into higher-resolution bitmaps.

My Dell XPS15 with 4K screen, and an external Dell 4K monitor, look sweet. The main problem remaining is that you can't drive a 4K monitor at 60Hz over HDMI, and this laptop doesn't have DisplayPort built in; I need a Thunderbolt3-to-DisplayPort connector which is supposed to exist but you can hardly find anywhere yet. Downside of bleeding edge tech.

Google Photos does the Right Thing by scaling images on the server to the specific window size. So you can upload original-size photos and it'll automatically pull down a massive image when you're viewing fullscreen on a 4K monitor. Now I can really tell the difference between my camera and a good camera.

Enabling touch events in Firefox and launching with MOZ_USE_XINPUT2=1 gives me touch event support in my laptop (both for Web apps and async scrolling); nice.

Forcing UI scale factors to be an integer is a mistake. In Gecko we can scale Web content by non-integer amounts and it almost never fails (or at least the same failure would have occurred with an integer scale), so I'm pretty sure there's no reason regular desktop toolkits can't do the same.

Friday, 1 April 2016

Using rr To Debug rr

Working on rr has often been frustrating because although rr is immensely useful for understanding bugs, it often has complex bugs of its own and you can't use rr to debug them. You sometimes feel you're in debugging hell so that others can go to heaven.

I realized it might not be too hard to get rr to record and replay rr doing a replay. rr's behavior during replay is much simpler than during recording, partly because most kernel state does not exist during replay (file descriptors aren't really open, etc). Achieving this could be valuable because most of the work I plan to do on rr involves doing more interesting things during replay. So I've spent about a week working on that --- and it works! On master you can run:

rr record ~/rr/obj/bin/simple
rr record rr replay -a ~/.local/share/rr/simple-0
rr replay -a
The final line is rr replaying rr replaying simple. I've added a test for this.

It's about 2,000 lines of new code, much of which is tests exercising the new ptrace features I had to implement. Fleshing out support for ptrace (including PTRACE_SYSCALL, PTRACE_SYSEMU, PTRACE_SINGLESTEP etc) was the majority of the work. (A process can't have two ptracers at the same time, so the outer rr must ptrace all the processes and the inner rr's use of ptrace has to be completely emulated by the outer rr.) Along the way I found and fixed some interesting existing rr bugs. In some cases I simplified rr replay to avoid use of ptrace features that it didn't really need (e.g. PTRACE_O_TRACEEXEC).

Debugging rr running under rr can be mind-bending. It's easy to get confused about which rr you're looking at, and it's hard to examine the behavior of the inner rr when you can't trust the outer rr. I hope I don't have to do much more of it!

As you'd expect, adding a level of record and replay makes things much slower.

As you'd expect, I tried rr recording rr replaying rr replaying simple. As you'd expect, it didn't work first time. It's probably a fixable bug but I have no desire to go deeper down the rabbit-hole at the moment.

An obvious question is whether rr could record rr recording as well as rr replay. In principle I think that could work, but it would not be very useful. If rr recording of X fails, then rr recording rr recording X is likely to experience the same fault in the outer rr (because it must record the inner rr and X), and be much more difficult to debug! For debugging rr recording really well we'd need a record and replay system built on an entirely different technology, e.g. Simics.