Wednesday 30 October 2019
Improving Debugging Workflow With Pernosco
One of the key challenges for debuggers is that the traditional interactive debugging workflow — running your program interactively and starting it under the debugger or connecting to it once it's running, and pausing it to inspect its state — doesn't work well for a lot of people anymore. That workflow isn't convenient when the application normally doesn't run locally — e.g. because testing more often happens in CI, or on a phone, or the code you care about runs as part of a big distributed system. It also falls down when pausing the debuggee breaks the system. As software has increasingly moved to the cloud and mobile platforms, this has become a bigger deal and it's no wonder use of interactive debugging has waned. "Remote debugging" helps a bit, but it tends to be painful and although it can bridge gaps between machines, it doesn't bridge gaps in time.
We've published a couple of documents on how Pernosco tackles this, in particular how Pernosco integrates with CI and how Pernosco supports uploads from developers and QA (manual and automatic). A big part of the solution is just record-and-replay (with rr in our case). Being able to record execution on one machine, without stopping the application, and replay execution on another machine at another time, enables a lot of new workflows that mitigate the above problems. However Pernosco goes further in some important ways.
One issue is that just being able to replay execution isn't enough; we also want a good debugging experience during the replay. This means we need to capture compiled debuginfo, source code and other relevant information that aren't strictly necessary for the replay. In many cases that data isn't even available at the recording site, but it might be available somewhere (e.g. a symbol server or build artifact archive) for us to get later. So our debugging infrastructure has to support collecting information at the recording site, harvesting it from various sources later, and actually using it during the debugging session. This is not at all trivial, and Pernosco has a lot of code to handle this sort of thing, some of which needs to be customized for specific customers. For example, Pernosco identifies Firefox binaries built by Mozilla CI and knows how to locate the relevant symbols and sources from Mozilla's archives. For developer and QA-submitted recordings, Pernosco examines the trace to locate relevant debuginfo and source code and upload them. For source code hosted in well-known public repositories (e.g. mozilla-central or Github), we minimize overhead by uploading only local changes and having our debugger client fetch the public changes from the public repository at debugging time.
Note that rr on its own provides trace portability but debugging ported traces is tricky. With rr pack and rr record --disable-cpuid-features, it is generally possible to create rr recordings that can be replayed on other machines. However, when you replay with gdb, locating symbols and source files is problematic when the replay machine filesystem does not exactly match the recording machines. For example when gdb sees the shared-library loader load /home/roc/libfoo.so, that file might not be present at that location on the replay machine (or worse, it might be a different version) so gdb won't load the right symbols. You can try to work around this by populating a "sysroot" directory with the relevant files, copied and renamed from the trace, but figuring out which trace files need to go where is hard (because e.g. it depends on the symlinks present on the recording machine, which rr doesn't capture in the recording, and it's not even clear how you'd do that).
Another important feature for enabling new workflows is just having a cloud-based Web client. We want to minimize the barrier to getting into a debugging session, and it's hard to think of an easier way than publishing a link which the user clicks on to enter a specific debugging session — no installation, no configuration. Those links can be published wherever you already notify users about test failures.
One thing I'm really excited about is that Pernosco enables splitting failure reproduction from debugging. Traditionally, developers had to reproduce a bug locally when they wanted to use an interactive debugger to debug it. Pernosco lets you delegate the reproduction step to other people (or automation). For example, when QA staff find a bug, instead of writing down the steps to reproduce to send to a developer (and inevitably having a back-and-forth discussion about exactly what's required to reproduce the bug, etc), QA can upload a recording to Pernosco and pass the link to the developer. This saves time and money — especially when QA staff are cheaper and/or more scalable then your developer team.