Tuesday, 13 November 2018

Comparing The Quality Of Debug Information Produced By Clang And Gcc

I've had an intuition that clang produces generally worse debuginfo than gcc for optimized C++ code. It seems that clang builds have more variables "optimized out" — i.e. when stopped inside a function where a variable is in scope, the compiler's generated debuginfo does not describe the value of the variable. This makes debuggers less effective, so I've attempted some qualitative analysis of the issue.

I chose to measure, for each parameter and local variable, the range of instruction bytes within its function over which the debuginfo can produce a value for this variable, and also the range of instruction bytes over which the debuginfo says the variable is in scope (i.e. the number of instruction bytes in the enclosing lexical block or function). I add those up over all variables, and compute the ratio of variable-defined-bytes to variable-in-scope-bytes. The higher this "definition coverage" ratio, the better.

This metric has some weaknesses. DWARF debuginfo doesn't give us accurate scopes for local variables; the defined-bytes for a variable defined halfway through its lexical scope will be about half of its in-scope-bytes, even if the debuginfo is perfect, so the ideal ratio is less than 1 (and unfortunately we can't compute it). In debug builds, and sometimes in optimized builds, compilers may give a single definition for the variable value that applies to the entire scope; this improves our metric even though the results are arguably worse. Sometimes compilers produce debuginfo that is simply incorrect; our metric doesn't account for that. Not all variables and functions are equally interesting for debugging, but this metric weighs them all equally. The metric assumes that the points of interest for a debugger are equally distributed over instruction bytes. On the other hand, the metric is relatively simple. It focuses on what we care about. It depends only on the debuginfo, not on the generated code or actual program executions. It's robust to constant scaling of code size. We can calculate the metric for any function or variable, which makes it easy to drill down into the results and lets us rank all functions by the quality of their debuginfo. We can compare the quality of debuginfo between different builds of the same binary at function granularity. The metric is sensitive to optimization decisions such as inlining; that's OK.

I built a debuginfo-quality tool in Rust to calculate this metric for an arbitrary ELF binary containing DWARF debuginfo. I applied it to the main Firefox binary libxul.so built with clang 8 (8.0.0-svn346538-1~exp1+0~20181109191347.1890~1.gbp6afd8e) and gcc 8 (8.2.1 20181105 (Red Hat 8.2.1-5)) using the default Mozilla build settings plus ac_add_options --enable-debug; for both compilers that sets the most relevant options to -g -Os -fno-omit-frame-pointer. I ignored the Rust compilation units in libxul since they use LLVM in both builds.

In our somewhat arbitrary metric, gcc is significantly ahead of clang for both parameters and local variables. "Parameters" includes the parameters of inlined functions. As mentioned above, the ideal ratio for local variables is actually less than 1, which explains at least part of the difference between parameters and local variables here.

gcc uses some debuginfo features that clang doesn't know about yet. An important one is DW_OP_GNU_entry_value (standardized as DW_OP_entry_value in DWARF 5). This defines a variable (usually a parameter) in terms of an expression to be evaluated at the moment the function was entered. A traditional debugger can often evaluate such expressions after entering the function, by inspecting the caller's stack frame; our Pernosco debugger has easy access to all program states, so such expressions are no problem at all. I evaluated the impact of DW_OP_GNU_entry_value and the related DW_OP_GNU_parameter_ref by configuring debuginfo-quality to treat definitions using those features as missing. (I'm assuming that gcc only uses those features when a variable value is not otherwise available.)

DW_OP_GNU_entry_value has a big impact on parameters but almost no impact on local variables. It accounts for the majority, but not all, of gcc's advantage over clang for parameters. DW_OP_GNU_parameter_ref has almost no impact at all. However, in most cases where DW_OP_GNU_entry_value would be useful, users can work around its absence by manually inspecting earlier stack frames, especially when time-travel is available. Therefore implementing DW_OP_GNU_entry_value may not be as high a priority as these numbers would suggest.

Improving the local variable numbers may be more useful. I used debuginfo-quality to compare two binaries (clang-built and gcc-built), computing, for each function, the difference in the function's definition coverage ratios, looking only at local variables and sorting functions according to that difference:

debuginfo-quality --language cpp --functions --only-locals ~/tmp/clang-8-libxul.so ~/tmp/gcc-8-libxul.so
This gives us a list of functions starting with those where clang is generating the worst local variable information compared to gcc (and ending with the reverse). There are a lot of functions where clang failed to generate any variable definitions at all while gcc managed to generate definitions covering the whole function. I wonder if anyone is interested in looking at these functions and figuring out what needs to be fixed in clang.

Designing and implementing this kind of analysis is error-prone. I've made my analysis tool source code available, so feel free to point out any improvements that could be made.

Update Helpful people on Twitter pointed me to some excellent other work in this area. Dexter is another tool for measuring debuginfo quality; it's much more thorough than my tool, but less scalable and depends on a particular program execution. I think it complements my work nicely. It has led to ongoing work to improve LLVM debuginfo. There is also CheckDebugify infrastructure in LLVM to detect loss of debuginfo, which is also driving improvements. Alexandre Oliva has an excellent writeup of what gcc does to preserve debuginfo through optimization passes.

Update #2 Turns out llvm-dwarfdump has a --statistics option which measures something very similar to what I'm measuring. One difference is that if a variable has any definitions at all, llvm-dwarfdump treats the program point where it's first defined as the start of its scope. That's an assumption I didn't want to make. There is a graph of this metric over the last 5.5 years of clang, using clang 3.4 as a benchmark. It shows that things got really bad a couple of years ago but have since been improving.

Sunday, 4 November 2018

What Is "Evil" Anyway?

I found this Twitter thread insightful, given its assumptions. I think that, perhaps inadvertently, it highlights the difficulties of honest discussion of evil in a secular context. The author laments:

It is beyond us, today, to conclude that we have enemies whose moral universe is such that loyalty to our own morality requires us to understand it and them as evil.
That is, evil means moral principles (and the people who hold them) which are incompatible with our own. That definition is honest and logical, and I think probably the best one can do under physicalist assumptions. Unfortunately it makes evil entirely subjective; it means other people can accurately describe us and our principles as evil in just the same way as we describe them as evil. All "evil" must be qualified as "evil according to me" or "evil according to you".

This is a major problem because (almost?) nobody actually thinks or talks about evil this way in day to day life, neither explicitly nor implicitly. Instead we think and act as if "evil" is an objective fact independent of the observer. Try tacking on to every expression of moral outrage the caveat "... but I acknowledge that other people have different moral assumptions which are objectively just as valid". It doesn't work.

Christians and many other monotheists avoid this problem by identifying a privileged frame of moral reference: God's. Our moral universe may or may not align with God's, or perhaps we don't care, or we may have trouble determining what God's is, but at least it lets us define evil objectively.

The Twitter thread raises a further issue: when one encounters evil people — people whose moral universe is incompatible with our own — what shall we do? Without a privileged frame of moral reference, one can't honestly seek to show them they are wrong. At best one can use non-rational means, including force, to encourage them to change their assumptions, or if that fails, perhaps they can be suppressed and their evil neutralized. This too is most unsatisfactory.

The Christian worldview is a lot more hopeful. We believe in a standard by which all will be measured. We believe in God's justice for transgressions. We believe in redemption through Jesus for those who fall short (i.e. everyone). We seek to love those who (for now) reject God's moral universe ... a group which sometimes includes ourselves. We see that even those most opposed or indifferent to God's purposes can change. These beliefs are not purely subjective, but grounded in objective truths about what God has done and is doing in the world.

Thursday, 1 November 2018

Comments on "REPT: Reverse Debugging of Failures in Deployed Software"

It's a pretty good paper! In fact, it won a "best paper" award.

The basic idea is to use Intel PT to gather a control-flow trace and keep just the last segment in a memory buffer, and dump that buffer as part of a crash memory dump. Then they perform some inference based on final memory values and the control-flow trace to figure out what the values of registers must have been during execution leading up to the crash. The inference is non-trivial; they have to use an iterative approach and sometimes make optimistic assumptions that later are corrected. With the register values and memory values they infer, they can provide some amount of reverse-execution debugging over the time interval leading up to the crash, with negligible overhead during normal execution. It's all done in the context of Windows and WinDbg.

One qualm I have about this paper's approach is that their optimistic assumptions mean in some cases they report incorrect data values. They're able to show that their values are correct most of the time, but I would hesitate to show users data that has a significant chance of being incorrect. There might be some room for improvement here, e.g. distinguishing known-correct values from maybe-incorrect values or using more sophisticated confidence estimation.

Overall using PT to capture control flow to be saved in crash dumps seems like a really good idea. Everyone should do that. The details of the interference are probably less important, but some kind of inference like in this paper seems really useful for crash dump analysis. I think you'll still want full rr-style recording when you can afford it, though. (I think it's possible that one day we'll be able to have full recording with negligible overhead ... a man can dream!)

I have a couple of quibbles. The paper doesn't describe how they handle the x86 EFLAGS register. This is important because if EFLAGS is part of their register state, then even simple instructions like ADD aren't reversible, because they reset flags; but if ELFAGS is not part of their register state, then they can't deduce the outputs for instructions like CMOVxx, SETxx, ADC etc. I asked the lead author and they confirmed they have special handling for EFLAGS.

Another quibble I have is that they don't describe how they chose the bugs to evaluate REPT against, therefore it's difficult to be confident that they didn't cherry-pick bugs where REPT performs well. Unfortunately, this is typical for computer science papers in areas where there is no accepted standard corpus of test subjects. Our field should raise standards by expecting authors to document protocols that avoid the obvious biases — not just test cherry-picking, but also tuning tools to work well on the same tests we then report results for.