Tuesday, 13 November 2018

Comparing The Quality Of Debug Information Produced By Clang And Gcc

I've had an intuition that clang produces generally worse debuginfo than gcc for optimized C++ code. It seems that clang builds have more variables "optimized out" — i.e. when stopped inside a function where a variable is in scope, the compiler's generated debuginfo does not describe the value of the variable. This makes debuggers less effective, so I've attempted some qualitative analysis of the issue.

I chose to measure, for each parameter and local variable, the range of instruction bytes within its function over which the debuginfo can produce a value for this variable, and also the range of instruction bytes over which the debuginfo says the variable is in scope (i.e. the number of instruction bytes in the enclosing lexical block or function). I add those up over all variables, and compute the ratio of variable-defined-bytes to variable-in-scope-bytes. The higher this "definition coverage" ratio, the better.

This metric has some weaknesses. DWARF debuginfo doesn't give us accurate scopes for local variables; the defined-bytes for a variable defined halfway through its lexical scope will be about half of its in-scope-bytes, even if the debuginfo is perfect, so the ideal ratio is less than 1 (and unfortunately we can't compute it). In debug builds, and sometimes in optimized builds, compilers may give a single definition for the variable value that applies to the entire scope; this improves our metric even though the results are arguably worse. Sometimes compilers produce debuginfo that is simply incorrect; our metric doesn't account for that. Not all variables and functions are equally interesting for debugging, but this metric weighs them all equally. The metric assumes that the points of interest for a debugger are equally distributed over instruction bytes. On the other hand, the metric is relatively simple. It focuses on what we care about. It depends only on the debuginfo, not on the generated code or actual program executions. It's robust to constant scaling of code size. We can calculate the metric for any function or variable, which makes it easy to drill down into the results and lets us rank all functions by the quality of their debuginfo. We can compare the quality of debuginfo between different builds of the same binary at function granularity. The metric is sensitive to optimization decisions such as inlining; that's OK.

I built a debuginfo-quality tool in Rust to calculate this metric for an arbitrary ELF binary containing DWARF debuginfo. I applied it to the main Firefox binary libxul.so built with clang 8 (8.0.0-svn346538-1~exp1+0~20181109191347.1890~1.gbp6afd8e) and gcc 8 (8.2.1 20181105 (Red Hat 8.2.1-5)) using the default Mozilla build settings plus ac_add_options --enable-debug; for both compilers that sets the most relevant options to -g -Os -fno-omit-frame-pointer. I ignored the Rust compilation units in libxul since they use LLVM in both builds.

In our somewhat arbitrary metric, gcc is significantly ahead of clang for both parameters and local variables. "Parameters" includes the parameters of inlined functions. As mentioned above, the ideal ratio for local variables is actually less than 1, which explains at least part of the difference between parameters and local variables here.

gcc uses some debuginfo features that clang doesn't know about yet. An important one is DW_OP_GNU_entry_value (standardized as DW_OP_entry_value in DWARF 5). This defines a variable (usually a parameter) in terms of an expression to be evaluated at the moment the function was entered. A traditional debugger can often evaluate such expressions after entering the function, by inspecting the caller's stack frame; our Pernosco debugger has easy access to all program states, so such expressions are no problem at all. I evaluated the impact of DW_OP_GNU_entry_value and the related DW_OP_GNU_parameter_ref by configuring debuginfo-quality to treat definitions using those features as missing. (I'm assuming that gcc only uses those features when a variable value is not otherwise available.)

DW_OP_GNU_entry_value has a big impact on parameters but almost no impact on local variables. It accounts for the majority, but not all, of gcc's advantage over clang for parameters. DW_OP_GNU_parameter_ref has almost no impact at all. However, in most cases where DW_OP_GNU_entry_value would be useful, users can work around its absence by manually inspecting earlier stack frames, especially when time-travel is available. Therefore implementing DW_OP_GNU_entry_value may not be as high a priority as these numbers would suggest.

Improving the local variable numbers may be more useful. I used debuginfo-quality to compare two binaries (clang-built and gcc-built), computing, for each function, the difference in the function's definition coverage ratios, looking only at local variables and sorting functions according to that difference:

debuginfo-quality --language cpp --functions --only-locals ~/tmp/clang-8-libxul.so ~/tmp/gcc-8-libxul.so
This gives us a list of functions starting with those where clang is generating the worst local variable information compared to gcc (and ending with the reverse). There are a lot of functions where clang failed to generate any variable definitions at all while gcc managed to generate definitions covering the whole function. I wonder if anyone is interested in looking at these functions and figuring out what needs to be fixed in clang.

Designing and implementing this kind of analysis is error-prone. I've made my analysis tool source code available, so feel free to point out any improvements that could be made.

Update Helpful people on Twitter pointed me to some excellent other work in this area. Dexter is another tool for measuring debuginfo quality; it's much more thorough than my tool, but less scalable and depends on a particular program execution. I think it complements my work nicely. It has led to ongoing work to improve LLVM debuginfo. There is also CheckDebugify infrastructure in LLVM to detect loss of debuginfo, which is also driving improvements. Alexandre Oliva has an excellent writeup of what gcc does to preserve debuginfo through optimization passes.

Update #2 Turns out llvm-dwarfdump has a --statistics option which measures something very similar to what I'm measuring. One difference is that if a variable has any definitions at all, llvm-dwarfdump treats the program point where it's first defined as the start of its scope. That's an assumption I didn't want to make. There is a graph of this metric over the last 5.5 years of clang, using clang 3.4 as a benchmark. It shows that things got really bad a couple of years ago but have since been improving.

Sunday, 4 November 2018

What Is "Evil" Anyway?

I found this Twitter thread insightful, given its assumptions. I think that, perhaps inadvertently, it highlights the difficulties of honest discussion of evil in a secular context. The author laments:

It is beyond us, today, to conclude that we have enemies whose moral universe is such that loyalty to our own morality requires us to understand it and them as evil.
That is, evil means moral principles (and the people who hold them) which are incompatible with our own. That definition is honest and logical, and I think probably the best one can do under physicalist assumptions. Unfortunately it makes evil entirely subjective; it means other people can accurately describe us and our principles as evil in just the same way as we describe them as evil. All "evil" must be qualified as "evil according to me" or "evil according to you".

This is a major problem because (almost?) nobody actually thinks or talks about evil this way in day to day life, neither explicitly nor implicitly. Instead we think and act as if "evil" is an objective fact independent of the observer. Try tacking on to every expression of moral outrage the caveat "... but I acknowledge that other people have different moral assumptions which are objectively just as valid". It doesn't work.

Christians and many other monotheists avoid this problem by identifying a privileged frame of moral reference: God's. Our moral universe may or may not align with God's, or perhaps we don't care, or we may have trouble determining what God's is, but at least it lets us define evil objectively.

The Twitter thread raises a further issue: when one encounters evil people — people whose moral universe is incompatible with our own — what shall we do? Without a privileged frame of moral reference, one can't honestly seek to show them they are wrong. At best one can use non-rational means, including force, to encourage them to change their assumptions, or if that fails, perhaps they can be suppressed and their evil neutralized. This too is most unsatisfactory.

The Christian worldview is a lot more hopeful. We believe in a standard by which all will be measured. We believe in God's justice for transgressions. We believe in redemption through Jesus for those who fall short (i.e. everyone). We seek to love those who (for now) reject God's moral universe ... a group which sometimes includes ourselves. We see that even those most opposed or indifferent to God's purposes can change. These beliefs are not purely subjective, but grounded in objective truths about what God has done and is doing in the world.

Thursday, 1 November 2018

Comments on "REPT: Reverse Debugging of Failures in Deployed Software"

It's a pretty good paper! In fact, it won a "best paper" award.

The basic idea is to use Intel PT to gather a control-flow trace and keep just the last segment in a memory buffer, and dump that buffer as part of a crash memory dump. Then they perform some inference based on final memory values and the control-flow trace to figure out what the values of registers must have been during execution leading up to the crash. The inference is non-trivial; they have to use an iterative approach and sometimes make optimistic assumptions that later are corrected. With the register values and memory values they infer, they can provide some amount of reverse-execution debugging over the time interval leading up to the crash, with negligible overhead during normal execution. It's all done in the context of Windows and WinDbg.

One qualm I have about this paper's approach is that their optimistic assumptions mean in some cases they report incorrect data values. They're able to show that their values are correct most of the time, but I would hesitate to show users data that has a significant chance of being incorrect. There might be some room for improvement here, e.g. distinguishing known-correct values from maybe-incorrect values or using more sophisticated confidence estimation.

Overall using PT to capture control flow to be saved in crash dumps seems like a really good idea. Everyone should do that. The details of the interference are probably less important, but some kind of inference like in this paper seems really useful for crash dump analysis. I think you'll still want full rr-style recording when you can afford it, though. (I think it's possible that one day we'll be able to have full recording with negligible overhead ... a man can dream!)

I have a couple of quibbles. The paper doesn't describe how they handle the x86 EFLAGS register. This is important because if EFLAGS is part of their register state, then even simple instructions like ADD aren't reversible, because they reset flags; but if ELFAGS is not part of their register state, then they can't deduce the outputs for instructions like CMOVxx, SETxx, ADC etc. I asked the lead author and they confirmed they have special handling for EFLAGS.

Another quibble I have is that they don't describe how they chose the bugs to evaluate REPT against, therefore it's difficult to be confident that they didn't cherry-pick bugs where REPT performs well. Unfortunately, this is typical for computer science papers in areas where there is no accepted standard corpus of test subjects. Our field should raise standards by expecting authors to document protocols that avoid the obvious biases — not just test cherry-picking, but also tuning tools to work well on the same tests we then report results for.

Sunday, 28 October 2018

Auckland Half Marathon 2018

This morning I ran the Auckland Half Marathon, for the sixth year in a row, fifth year barefoot. I got my best time ever: official time 1:45:07, net time 1:44:35. I had to push hard, and I've been sore all day! Glad to get under 1:45 net.

Last year I applied climbing tape to the balls of my feet to avoid the skin wearing through. Unfortunately I lost that tape and forgot to get more, so I tried some small patches of duct tape but they came off very early in the race. Nevertheless my feet held up pretty well, perhaps because I've been running more consistently during the year and perhaps because my feet are gradually toughening up year after year. The road was wet today because it rained overnight, but I can't tell whether that is better or worse for me.

Thursday, 25 October 2018

Problems Scaling A Large Multi-Crate Rust Project

We have 85K lines of Rust code implementing the backend of our Pernosco debugger. To impose some modularity constraints and to reduce build times, from the beginning we organized our code as a large set of crates in a single Cargo workspace in a single Gitlab repository. Currently we have 48 crates. This has mostly worked pretty well but as the number of our crates keeps increasing, we have hit some serious scalability problems.

The most fundamental issue is that many crates build one or more executables — e.g. command-line tools to work with data managed by the crate — and most crates also build an executable containing tests (per standard Rust conventions). Each of these executables is statically linked, and since each crate on average depends on many other crates (both our own and third-party), the total size of the executables is growing at roughly the square of the number of crates. The problem is especially acute for debug builds with full debuginfo, which are about five times larger than release builds built with debug=1 (optimized builds with just enough debuginfo for stack traces to include inlined functions). To be concrete, our 85K line project builds 4.2G of executables in a debug build, and 750M of executables in release. There are 20 command-line tools and 81 test executables, of which 50 actually run no tests (the latter are small, only about 5.7M each).

The large size of these executables slows down builds and creates problems for our Gitlab CI as they have to be copied over the network between build and test phases. But I don't know what to do about the problem.

We could limit the number of test executables by moving all our integration tests into a single executable in some super-crate ... though that would slow down incremental builds because that super-crate would need to be rebuilt a lot, and it would be large.

We could limit the number of command-line tools in a similar way — combine all the tool executables a super-tool crate that uses the "Swiss Army knife" approach, deciding what its behavior should be by examining argv[0]. Again, this would penalize incremental builds.

Cargo supports a kind of dynamic linking with its dylib option, but I'm not sure how to get that to work. Maybe we could create a super-crate that reexports every single crate in our workspace, attach all tests and binary tools to that crate, and ask Cargo to link that crate as a dynamic library, so that all the tests and tools are linking to that library. This would also hurt incremental builds, but maybe not as much as the above approaches. Then again, I don't know if it would actually work.

Another option would be to break up the project into separate independently built subprojects, but that creates a lot of friction.

Another possibility is that we should simply use fewer, bigger crates. This is probably more viable than it was a couple of years ago, when we didn't have incremental rustc compilation.

I wonder if anyone has hit this problem before, and tried the above solutions or come up with any other solutions.

Wednesday, 24 October 2018

Harmful Clickbait Headline About IT Automation

Over the years a number of parents have asked me whether they should steer their children away from IT/programming careers in case those jobs are automated away in the future. I tell them that the nature of programming work will continue to change but programming will be one of the last jobs to be fully automated; if and when computers don't need to be programmed by people, they will be capable of almost anything. (It's not clear to me why some people are led to believe programming is particularly prone to automation; maybe because they see it as faceless?)

Therefore I was annoyed by the headline in the NZ Herald this morning: "Is AI about to unseat our programmers?" Obviously it's deliberate clickbait, and it wasn't chosen by Juha Saarinen, whose article is actually quite good. But I'm sure some of the parents I've talked to, and many others like them whom I'll never know, will have fears reinforced by this headline and perhaps some people will be turned away from programming careers to their detriment, and the detriment of New Zealands's tech industry.

Monday, 22 October 2018

The Fine Line Between Being A Good Parent And A Bad Parent

Two incidents will illustrate.

Early 2013: I take my quite-young kids to hike to the summit of Mt Taranaki. The ascent is more grueling than I expected; there's a long scree slope which is two steps forward, one step back. My kids start complaining, then crying. I have to decide whether to turn back or to cajole them onward. There are no safety issues (the weather is perfect and it's still early in the day), but the stakes feel high: if I keep pushing them forward but we eventually fail, I will have made them miserable for no good reason, and no-one likes a parent who bullies their kids. I roll the dice and press on. We make it! After the hike, we all feel it was a great achievement and the kids agree we did the right thing to carry on.

Two weeks ago: I take my kids and a couple of adult international students from our church on an overnight hiking trip to the Coromandel Peninsula. On Friday we hike for four hours to Crosbies Hut and stay there overnight. It's wonderful — we arrive at the hut around sunset in glorious weather, eat a good meal, and the night sky is awesome. The next day I return to our starting point, pick up the car and drive around to Whangaiterenga campsite in Kauaeranga Valley so my kids and our guests can descend into the valley by a different route that crosses Whangaiterenga stream a few times. I had called the visitor's centre on Friday to confirm that that track is open and the stream crossings are easy. My kids are now quite experienced (though our guests aren't) and should be able to easily handle this on their own. I get to the pickup point ahead of schedule, but two hours after I expected them to arrive, they still haven't :-(.

To cut the story short, at that point I get a text message from them and after some communication they eventually walk out five hours late. They were unable to pick up the trail after the first stream crossing (maybe it was washed out), and had to walk downstream for hours, also taking a detour up a hill to get phone reception temporarily. The kids made good decisions and gained a lot of confidence from handling an unexpected situation on their own.

What bothers me is that both of these situations could easily have turned out differently. In neither case would there have been any real harm — the weather in Coromandel was excellent and an unexpected night in the bush would have been perfectly safe given the gear they were carrying (if indeed they weren't found before dark). Nevertheless I can see that my decisions could have looked bad in hindsight. If we make a habit of taking these kinds of small risks — and I think we should! — then not all of them are going to pay off. I think, therefore, we should be forgiving of parents who take reasonable risks even if they go awry.