Monday, 20 April 2015

Another VMWare Hypervisor Bug

Single-stepping through instructions in VMWare (6.0.4 build-2249910 in my case) with a 32-bit x86 guest doesn't trigger hardware watchpoints.

Steps to reproduce:

  1. Configure a VMWare virtual machine (6.0.4 build-2249910 in my case) booting 32-bit Linux (Ubuntu 14.04 in my case).
  2. Compile this program with gcc -g -O0 and run it in gdb:
    int main(int argc, char** argv) {
      char buf[100];
      buf[0] = 99;
      return buf[0];
    }
  3. In gdb, do
    1. break main
    2. run
    3. watchpoint -l buf[0]
    4. stepi until main returns
  4. This should trigger the watchpoint. It doesn't :-(.

Doing the same thing in a KVM virtual machine works as expected.

Sigh.

Thursday, 2 April 2015

Reverse Execution And Signals

gdb's reverse execution interface interacts with signals in counter-intuitive ways. If you're using rr and gdb reverse execution to debug situations involving signals, e.g. a SIGSEGV, read on...

Consider the following program test.c:

int main(int argc, char **argv) {
  __asm__ __volatile__("jmp 0x42");
}

We can debug this with rr as follows:

[roc@eternity test]$ rr ./test
rr: Saving the execution of `/home/roc/tmp/test' to trace directory `/home/roc/.rr/test-6'.
[rr.170] Warning: task 14677 (process 14677) dying from fatal signal SIGSEGV.
[roc@eternity test]$ rr replay
GNU gdb (GDB) 7.9
...
0x00002aaaaaaaf6f6 in _dl_start () from /lib64/ld-linux-x86-64.so.2
(gdb) cont
Continuing.
Program received signal SIGSEGV, Segmentation fault.
0x0000000000000042 in ?? ()
(gdb) where
#0  0x0000000000000042 in ?? ()
#1  0x0000000000000000 in ?? ()

At this point you get that awful sinking feeling... But wait!

(gdb) reverse-stepi
Program received signal SIGSEGV, Segmentation fault.
0x0000000000000042 in ?? ()
(gdb) reverse-stepi
main (argc=1, argv=0x7fffffffdf38) at /home/roc/tmp/test.c:28
28   __asm__ __volatile__("jmp 0x42");

Hurrah!

The obvious questions are: why did we have to reverse-stepi twice to get back to the jmp, and why did the first reverse-stepi trigger SIGSEGV again?

If you singlestep forwards through the program using gdb normally, starting at the jmp instruction, you actually see two separate events:

(gdb) run
Starting program: /home/roc/tmp/test 
Breakpoint 1, main (argc=1, argv=0x7fffffffdfd8) at /home/roc/tmp/test.c:28
28   __asm__ __volatile__("jmp 0x42");
(gdb) stepi
0x0000000000000042 in ?? ()
(gdb) stepi
Program received signal SIGSEGV, Segmentation fault.
0x0000000000000042 in ?? ()
What's happening is that the first stepi runs the jmp and arrives at the invalid location. The second stepi tries to run the instruction at that location and triggers SIGSEGV instead. Now, in the original session we first ran all the way past the SIGSEGV. Then our first reverse-stepi makes us reverse-execute one step: in this case, we reverse-execute the triggering of the SIGSEGV. (In gdb, reverse-executing the triggering of a signal prints the signal just as forward-executing does.) The next reverse-stepi reverse-executes the actual jump.

It feels a little weird, though it makes some amount of sense. It's even weirder when you reverse-singlestep through the execution of a signal handler, but it still all makes sense. I've been pleasantly surprised by gdb's robustness at handling that sort of thing. I've been unpleasantly unsurprised by the number of rr bugs I've had to iron out to make this work properly at scale for Gecko debugging!

Monday, 30 March 2015

Eclipse + Gecko = Win

With Eclipse 4.4.1 CDT and the in-tree Eclipse project builder (./mach build-backend -b CppEclipse), the Eclipse C++ tools work really well on Gecko. Features I really enjoy:

  • Ctrl-click to navigate to definitions/declarations
  • Ctrl-T to popup the superclasses/subclasses of a class, or the overridden/overriding implementations of a method
  • Shift-ctrl-G to find all uses of a declaration (not 100% reliable, but almost always good enough)
  • Instant coloring of syntax errors as you type (useless messages, but still worth having)
  • Instant coloring of unknown identifier and type errors as you type; not 100% reliable, but good enough that most of my compiler errors are caught before doing a build.
  • Really good autocomplete. E.g. given
    nsTArray<nsRefPtr<Foo>> array;
    for (auto& v : array) {
      v->P
    Eclipse will autocomplete methods of Foo starting with P ... i.e., it handles "auto", C++ for-range loops, nsTArray and nsRefPtr operator overloading.
  • Shift-ctrl-R: automated renaming of identifiers. Again, not 100% reliable but a massive time saver nonetheless.

Thanks to Jonathan Watt and Benoit Girard for the mach support and other Eclipse work over the years!

I assume other IDEs can do these things too, but if you're not using a tool at least this powerful, you might be leaving some productivity on the table.

With Eclipse, rr, and a unified hg repo, hacking Gecko has never felt so good :-).

Thursday, 26 March 2015

Paper Titles

A few tips on computer science paper titles:

Titles of the form Catchy Project Name: What Our Project Is About are stilted. Show some imagination.

Titles of the form Towards [Some Goal We Totally Failed To Reach] are an obvious attempt to dress up failure as success. Don't do that.

Do write bold papers about negative results. Call your paper [Our Idea] Doesn't Work (And Here's Why) and I'll be excited to read it.

[Goal] Is Harder Than You Think would also get my attention.

If your paper title contains the word Aristotelian, I will never read your work again and skip the conference too --- but you get points for chutzpah.

Note: following this advice may harm your career. Consider a career where you don't have to publish or perish.

Tuesday, 17 March 2015

Auckland University rr Talk Next Week

I'm scheduled to give a talk at Auckland University next Wednesday on rr.

25 March 2015

12 - 1pm

Venue: 303S - 561, City Campus

Host: Department of Computer Science, University of Auckland

Note: Informal discussion and light refreshments 1 - 2 pm

Speaker: Dr Robert O'Callahan

Abstract: Mozilla's browser developers find debugging expensive and frustrating, especially when bugs are non-deterministic. Researchers have proposed to expedite debugging by recording, replaying and analyzing program executions, and in theory such techniques are well-understood, but they have not yet been widely adopted. Mozilla Research aims to understand and bridge this adoption gap by building a record-and-replay-based debugger that Mozilla's developers actually want to use. This talk will describe some barriers to adoption and how we have addressed them in the design and implementation of 'rr': a lightweight tool which can record unmodified Firefox binaries with less than 1.3x run-time overhead, exactly replay those executions under the control of gdb, and emulate reverse-execution --- using only standard Linux kernel APIs on stock hardware. 'rr' has been used to debug many real Firefox bugs. Furthermore it provides low-overhead recording, replaying, checkpointing and reverse-execution of Linux processes in an open-source tool, opening up many interesting avenues for future work. I will also discuss some of the implications for hardware and software design as record-and-replay becomes more popular.

Bio: Robert O'Callahan is a Distinguished Engineer at Mozilla Corporation, focusing on the development of Web standards and their implementation in Firefox, with a particular focus on CSS, graphics and media APIs. He has a side interest in research on software development, and debugging in particular. 'rr' is the first research tool he ever built that he actually wants to use.

Saturday, 14 March 2015

The Problems Of Significance Testing (aka What's Wrong With Computer Science)

This article is an excellent, easy to read, easy to understand explanation of why the significance testing approaches used in many scientific disciplines are bogus. Almost everything he says applies just as well to computer science as to the author's field (medicine). I'm pleased to see the author call out the catastrophic bias of the publishing system against negative results --- something I've been ranting about for decades.

Thursday, 5 March 2015

Debugging Gecko With Reverse Execution

Over the last month or so I've added reverse-execution support to rr's gdb interface. This enables gdb commands such as reverse-continue, reverse-next, reverse-finish and reverse-step. These work with breakpoints and watchpoints, so you can do things like

Breakpoint 1, nsCanvasFrame::BuildDisplayList (this=0x2aaadd7dbeb0, aBuilder=0x7fffffffaaa0, aDirtyRect=..., aLists=...)
    at /home/roc/mozilla-inbound/layout/generic/nsCanvasFrame.cpp:460
460   if (GetPrevInFlow()) {
(gdb) watch -l mRect.width
Hardware watchpoint 2: -location mRect.width
(gdb) reverse-cont
Continuing.
Hardware watchpoint 2: -location mRect.width

Old value = 12000
New value = 11220
0x00002aaab100c0fd in nsIFrame::SetRect (this=0x2aaadd7dbeb0, aRect=...)
    at /home/roc/mozilla-inbound/layout/base/../generic/nsIFrame.h:718
718       mRect = aRect;
(Here the "New value" is actually the value before this statement was executed, since we just executed backwards past it.)

Since debugging is about tracing effects to causes, and effects happen after their causes, reverse execution is a big deal. I've just started using it to debug Gecko, and I'm enjoying it immensely! I find myself having to unlearn a lot of my usual debugging tactics; much of what I've learned about debugging up until now is really just workarounds for not having had reverse execution.

An example from today: working on bug 1082249, I was in a function nsDisplayTableItem::ComputeInvalidationRegion with a geometry variable pointing to an nsDisplayBackgroundGeometry when it should have been a nsDisplayTableItemGeometry. At first I considered various indirect ways to figure out where the nsDisplayBackgroundGeometry came from, but now there's an easy direct way: set a breakpoint on the nsDisplayBackgroundGeometry, make it conditional on this == the current value of geometry, and reverse-continue to see exactly where that specific object was created (in a subclass of nsDisplayTableItem that I hadn't thought of).

This work is checked into rr master. It's not well tested yet, so a new official rr release is not imminent, but I'd appreciate people trying it out and reporting issues. At least I have been able to use it to get work done today :-).