Tuesday, 12 November 2019

The Power Of Collaborative Debugging

An under-appreciated problem with existing debuggers is that they lack first-class support for collaboration. In large projects a debugging session will often cross module boundaries into code the developer doesn't understand, making collaboration extremely valuable, but developers can only collaborate using generic methods such as screen-sharing, physical co-location, copy-and-paste of terminal transcripts, etc. We recognized the importance of collaboration, and Pernosco's cloud architecture makes such features relatively easy to implement, so we built some in.

The most important feature is just that any user can start a debugging session for any recorded execution, given the correct URL (modulo authorization). We increase the power of URL sharing by encoding in the URL the current moment and stack frame, so you can copy your current URL, paste it into chat or email, and whoever clicks on it will jump directly to the moment you were looking at.

The Pernosco notebook takes collaboration to another level. Whenever you take a navigation action in the Pernosco UI, we tentatively record the destination moment in the notebook with a snippet describing how you got there, which you can persist just by clicking on. You can annotate these snippets with arbitrary text, and clicking on a snippet will return to that moment. Many developers already record their progress by taking notes during debugging sessions (I remember using Borland Sidekick for this when I was a kid!); the Pernosco notebook makes this much more convenient. Our users find that the notebook is great for mitigating the "help, I'm lost in a vast information space" problem that affects Pernosco users as well as users of traditional debuggers (sometimes more so in Pernosco, because it enables higher velocity through that space). Of course the notebook persists indefinitely and is shared between all users and sessions for the same recording, so you have a permanent record of what you discovered that your colleagues can also explore and add to.

Our users are discovering that these features unlock new workflows. A developer can explore a bug, recording what they've learned in the code they understand, then upon reaching unknown code forward the debugging session to a more knowledgeable developer for further investigation — or perhaps just to quickly confirm a hypothesis. We find that, perhaps unexpectedly, Pernosco can be most effective at saving the time of your most senior developers because it's so much easier to leverage the debugging work already done by other developers.

Collaboration via Pernosco is beneficial not just to developers but to anyone who can reproduce bugs and upload them to Pernosco. Our users are discovering that if you want a developer to look into a bug you care about, submitting it to Pernosco yourself and sending them a link makes it much more likely they will oblige — if it only takes a minute or two to start poking around, why not?

Extending this idea, Pernosco makes it convenient to separate reproducing a bug from debugging a bug. It's no problem to have QA staff reproduce bugs and submit them to Pernosco, then hand Pernosco URLs to developers for diagnosis. Developers can stop wasting their time trying to replicate the "steps to reproduce" (and often failing!) and staff can focus on what they're good at. I think this could be transformative for many organizations.

Thursday, 7 November 2019

Omniscient Printf Debugging In Pernosco

Pernosco supports querying for the execution of specific functions and the execution of specific source lines. These resemble setting breakpoints on functions or source lines in a traditional debugger. Traditional debuggers usually let you filter breakpoints using condition expressions, and it's natural and useful to extend that to Pernosco's execution views, so we did. In traditional debuggers you can get the debugger to print the values of specified expressions when a breakpoint is hit, and that would also be useful in Pernosco, so we added that too.

These features strongly benefit from Pernosco's omniscient database, because we can evaluate expressions at different points in time — potentially in parallel — by consulting the database instead of having to reexecute the program.

These features are relatively new and we don't have much user experience with them yet, but I'm excited about them because while they're simple and easily understood, they open the door to "query-based debugging" strategies and endless possibilities for enhancing the debugger with richer query features.

Another reason I'm excited is that together they let you apply "printf-debugging" strategies in Pernosco: click on a source line, and add some print-expressions and optionally a condition-expression to the "line executions" view. I believe that in most cases where people are used to using printf-debugging, Pernosco enables much more direct approaches and ultimately people should let go of those old habits. However, in some situations some quick logging may still be the fastest way to figure out what's going on, and people will take time to learn new strategies, so Pernosco is great for printf-debugging: no rebuilding, and not even any reexecution, just instant(ish) results.

Monday, 4 November 2019

The BBC's "War Of The Worlds"

Very light spoilers ahead.

I had hopes for this show. I liked the book (OK, when I read it >30 years ago). I love sci-fi. I like historical fiction. I was hoping for Downton Abbey meets Independence Day. Unfortunately I think this show was, as we say in NZ, "a bit average".

I really liked the characters reacting semi-realistically to terror and horror. It always bothers me that in fiction normal people plunge into traumatic circumstances, scream a bit, then get over it in time for the next scene. This War Of the Worlds takes time to show characters freaking out, resting and consoling one another, but not quite getting it all back together. Overall I thought the acting was well done.

I think the pacing and editing were poor. Some parts were slow, but other parts (especially in the first half) lurch from scene to scene so quickly it feels like important scenes were cut. It was hard to work out was going on geographically.

Some aspects seemed pointlessly complicated or confusing, e.g. the spinning ball weapon.

Call me old-fashioned, but when a man abandons his wife I am not, by default, sympathetic to him, so I spent most of the show thinking our male protagonist is kind of a bad guy, when I'm clearly supposed to be siding with him against closed-minded society. I even felt a bit vindicated when towards the end his lover Amy wonders if they did the right thing. At least for a change the Christian-esque character was only a fool, not a psychopath, so thank God for small mercies.

I guess I'm still waiting for the perfect period War Of The Worlds adaptation.

Saturday, 2 November 2019

Explaining Dataflow In Pernosco

Tracing dataflow backwards in time is an rr superpower. rr users find it incredibly useful to set hardware data watchpoints on memory locations of interest and reverse-continue to find where those values were changed. Pernosco takes this superpower up a level with its dataflow pane (see the demo there).

From the user's point of view, it's pretty simple: you click on a value and Pernosco shows you where it came from. However, there is more going on here than meets the eye. Often you find that the last modification to memory is not what you're looking for; that the value was computed somewhere and then copied, perhaps many times, until it reached the memory location you're inspecting. This is especially true in move-heavy Rust and C++ code. Pernosco detects copying through memory and registers and follows dataflow backwards through them, producing an explanation comprising multiple steps, any of which the user can inspect just by clicking on them. Thanks to omniscience, this is all very fast. (Jeff Muizelaar implemented something similar with scripting gdb and rr, which partially inspired us. Our infrastructure is a lot more powerful what he had to work with.)

Pernosco explanations terminate when you reach a point where a value was derived from something other than a CPU copy: e.g. an immediate value, I/O, or arithmetic. There's no particular reason why we need to stop there! For example, there is obviously scope to extend these explanations through arithmetic, to explore more general dataflow DAGs, though intelligible visualization would become more difficult.

Pernosco's dataflow explanations differ from what you get with gdb and rr in an interesting way: gdb deliberately ignores idempotent writes, i.e. writes to memory that do not actually change the value. We thought hard about this and decided that Pernosco should not ignore them. Consider a trivial example:

x = 0;
y = 0;
x = y;
If you set a watchpoint on x at the end and reverse-continue, gdb+rr will break on x = 0. We think this is generally not what you want, so a Pernosco explanation for x at the end will show x = y and y = 0. I don't know why gdb behaves this way, but I suspect it's because gdb watchpoints are sometimes implemented by evaluating the watched expression over time and noting when the value changes; since that can't detect idempotent writes, perhaps hardware watchpoints were made to ignore idempotent writes for consistency.

An interesting observation about our dataflow explanations is that although the semantics are actually quite subtle, even potentially confusing once you dig into them (there are additional subtleties I haven't gone into here!), users don't seem to complain about that. I'm optimistic that the abstraction we provide matches user intuitions closely enough that they skate over the complexity — which I think would be a pretty good result.

(One of my favourite moments with rr was when a Mozilla developer called a Firefox function during an rr replay and it printed what they expected it to print. They were about to move on, but then did a double-take, saying "Errrr ... what just happened?" Features that users take for granted but are actually mind-boggling are the best features.)

Thursday, 31 October 2019

Improving Debugging Workflow With Pernosco

One of the key challenges for debuggers is that the traditional interactive debugging workflow — running your program interactively and starting it under the debugger or connecting to it once it's running, and pausing it to inspect its state — doesn't work well for a lot of people anymore. That workflow isn't convenient when the application normally doesn't run locally — e.g. because testing more often happens in CI, or on a phone, or the code you care about runs as part of a big distributed system. It also falls down when pausing the debuggee breaks the system. As software has increasingly moved to the cloud and mobile platforms, this has become a bigger deal and it's no wonder use of interactive debugging has waned. "Remote debugging" helps a bit, but it tends to be painful and although it can bridge gaps between machines, it doesn't bridge gaps in time.

We've published a couple of documents on how Pernosco tackles this, in particular how Pernosco integrates with CI and how Pernosco supports uploads from developers and QA (manual and automatic). A big part of the solution is just record-and-replay (with rr in our case). Being able to record execution on one machine, without stopping the application, and replay execution on another machine at another time, enables a lot of new workflows that mitigate the above problems. However Pernosco goes further in some important ways.

One issue is that just being able to replay execution isn't enough; we also want a good debugging experience during the replay. This means we need to capture compiled debuginfo, source code and other relevant information that aren't strictly necessary for the replay. In many cases that data isn't even available at the recording site, but it might be available somewhere (e.g. a symbol server or build artifact archive) for us to get later. So our debugging infrastructure has to support collecting information at the recording site, harvesting it from various sources later, and actually using it during the debugging session. This is not at all trivial, and Pernosco has a lot of code to handle this sort of thing, some of which needs to be customized for specific customers. For example, Pernosco identifies Firefox binaries built by Mozilla CI and knows how to locate the relevant symbols and sources from Mozilla's archives. For developer and QA-submitted recordings, Pernosco examines the trace to locate relevant debuginfo and source code and upload them. For source code hosted in well-known public repositories (e.g. mozilla-central or Github), we minimize overhead by uploading only local changes and having our debugger client fetch the public changes from the public repository at debugging time.

Note that rr on its own provides trace portability but debugging ported traces is tricky. With rr pack and rr record --disable-cpuid-features, it is generally possible to create rr recordings that can be replayed on other machines. However, when you replay with gdb, locating symbols and source files is problematic when the replay machine filesystem does not exactly match the recording machines. For example when gdb sees the shared-library loader load /home/roc/libfoo.so, that file might not be present at that location on the replay machine (or worse, it might be a different version) so gdb won't load the right symbols. You can try to work around this by populating a "sysroot" directory with the relevant files, copied and renamed from the trace, but figuring out which trace files need to go where is hard (because e.g. it depends on the symlinks present on the recording machine, which rr doesn't capture in the recording, and it's not even clear how you'd do that).

Another important feature for enabling new workflows is just having a cloud-based Web client. We want to minimize the barrier to getting into a debugging session, and it's hard to think of an easier way than publishing a link which the user clicks on to enter a specific debugging session — no installation, no configuration. Those links can be published wherever you already notify users about test failures.

One thing I'm really excited about is that Pernosco enables splitting failure reproduction from debugging. Traditionally, developers had to reproduce a bug locally when they wanted to use an interactive debugger to debug it. Pernosco lets you delegate the reproduction step to other people (or automation). For example, when QA staff find a bug, instead of writing down the steps to reproduce to send to a developer (and inevitably having a back-and-forth discussion about exactly what's required to reproduce the bug, etc), QA can upload a recording to Pernosco and pass the link to the developer. This saves time and money — especially when QA staff are cheaper and/or more scalable then your developer team.

Friday, 25 October 2019

Auckland Half Marathon 2019

I ran the Auckland Half Marathon last Sunday. My time was 1:46:51, a little slower than last year. I didn't quite have the mental endurance I guess. As always I hope I'll do better next year, though I am getting older...

As usual, I ran barefoot. This year I had climbing tape again which definitely makes it a bit easier on my feet at this speed.

People ask why I run barefoot. I didn't start running until I was about 35, and I started running on a beach where I'm usually barefoot. After I got used to that, it never felt right to run in shoes ... they're just weight on my feet. To be honest, I also like being a bit eccentric. I've never had any serious issues with injuries so at this point, I don't feel like changing anything. I did try wearing Vibram 5-Finger foot-gloves for my first half-marathon, but my feet got all sweaty and were still sore at the end so they don't seem to help me much. Sticking small patches of climbing tape to the hardest-wearing parts of my feet (the balls, and my second toes) is plenty of protection when running fast. When running slower (20K training runs around 2:10) I don't even need that.

Pernosco Demo Video

Over the last few years we have kept our work on the Pernosco debugger mostly under wraps, but finally it's time to show the world what we've been working on! So, without further ado, here's an introductory demo video showing Pernosco debugging a real-life bug:

This demo is based on a great gdb tutorial created by Brendan Gregg. If you read his blog post, you'll get more background and be able to compare Pernosco to the gdb experience.

Pernosco makes developers more productive by providing scalable omniscient debugging — accelerating existing debugging strategies and enabling entirely new strategies and features — and by integrating that experience into cloud-based workflows. The latter includes capturing test failures occurring in CI so developers can jump into a debugging session with one click on a URL, separating failure reproduction from debugging so QA staff can record test failures and send debugger URLs to developers, and letting developers collaborate on debugging sessions.

Over the next few weeks we plan to say a lot more about Pernosco and how it benefits software developers, including a detailed breakdown of its approach and features. To see those updates, follow @_pernosco_ or me on Twitter. We're opening up now because we feel ready to serve more customers and we're keen to talk to people who think they might benefit from Pernosco; if that's you, get in touch. (Full disclosure: Pernosco uses rr so for now we're limited x86-64 Linux, and statically compiled languages like C/C++/Rust.)