Wednesday, 28 November 2018

Capitalism, Competition And Microsoft Antitrust Action

Kevin Williamson writes an ode to the benefits of competition and capitalism, one of his themes being the changing fortunes of Apple and Microsoft over the last two decades. I'm mostly sympathetic, but in a hurry to decry "government intervention in and regulation of the part of our economy that is, at the moment, working best", he forgets or neglects to mention the antitrust actions brought by the US government against Microsoft in the mid-to-late 1990s. Without those actions, there is a high chance things could have turned out very differently for Apple. At the very least, we do not know what would have happened without those actions, and no-one should use the Apple/Microsoft rivalry as an example of glorious laissez-faire capitalism that negates the arguments of those calling for antitrust action today.

Would Microsoft have invested $150M to save Apple in 1997 if they hadn't been under antitrust pressure since 1992? In 1994 Microsoft settled with the Department of Justice, agreeing to refrain from tying the sale of other Microsoft products to the sale of Windows. It is reasonable to assume that the demise of Apple, Microsoft's only significant competitor in desktop computer operating systems, would have increased the antitrust scrutiny on Microsoft. At that point Microsoft's market cap was $150B vs Apple's $2B, so $150M seems like a cheap and low-risk investment by Gates to keep the US government off his back. I do not know of any other rational justification for that investment. Without it, Apple would very likely have gone bankrupt.

In a world where the United States v. Microsoft Corporation (2001) antitrust lawsuit didn't happen, would the iPhone have been as successful? In 1999 I was so concerned about the potential domination of Microsoft over the World Wide Web that I started making volunteer contributions to (what became) Firefox (which drew me into working for Mozilla until 2016). At that time Microsoft was crushing Netscape with superior engineering, lowering the price of the browser to zero, bundling IE with Windows and other hardball tactics that had conquered all previous would-be Microsoft competitors. With total domination of the browser market, Microsoft would be able to take control of Web standards and lead Web developers to rely on Microsoft-only features like ActiveX (or later Avalon/WPF), making it practically impossible for anyone but Microsoft to create a browser that could view the bulk of the Web. Web browsing was an important feature for the first release of the iPhone in 2007; indeed for the first year, before the App Store launched, it was the only way to do anything on the phone other than use the built-in apps. We'll never know how successful the iPhone would have been without a viable Web browser, but it might have changed the competitive landscape significantly. Thankfully Mozilla managed to turn the tide to prevent Microsoft's total browser domination. As a participant in that battle, I'm convinced that the 2001 antitrust lawsuit played a big part in restraining Microsoft's worst behavior, creating space (along with Microsoft blunders) for Firefox to compete successfully during a narrow window of opportunity when creating a viable alternative browser was still possible. (It's also interesting to consider what Microsoft could have done to Google with complete browser domination and no antitrust concerns.)

We can't be sure what the no-antitrust world would have been like, but those who argue that Apple/Microsoft shows antitrust action was not needed bear the burden of showing that their counterfactual world is compelling.

Sunday, 25 November 2018

Raglan

We spent a couple of days in Raglan celebrating our wedding anniversary. It's a small coastal town a couple of hours drive south of Auckland, famous for surfing, quiet at this time of year though I hear it gets very busy in the summer holidays. We had a relaxing time, exploring the town a little bit and driving down the coast to Te Toto Gorge to climb Mt Karioi. That's a smaller version of the nearby Mt Pirongia — the summit area comprises many steep volcanic hillocks and ridges, all covered in dense bush. The track is rough, and in one place there are chains to help climb up and down. After two and a half hours we got as far as the lookout for some fabulous views over Raglan and the coast further north, and decided not to bother with the extra hour to the true summit. The views over the ocean on the way up were quite spectacular. On the way up we could see the snowy peaks of Mt Taranaki and, from the lookout, Mt Ruapehu — each about 170km away.

Tuesday, 13 November 2018

Comparing The Quality Of Debug Information Produced By Clang And Gcc

I've had an intuition that clang produces generally worse debuginfo than gcc for optimized C++ code. It seems that clang builds have more variables "optimized out" — i.e. when stopped inside a function where a variable is in scope, the compiler's generated debuginfo does not describe the value of the variable. This makes debuggers less effective, so I've attempted some qualitative analysis of the issue.

I chose to measure, for each parameter and local variable, the range of instruction bytes within its function over which the debuginfo can produce a value for this variable, and also the range of instruction bytes over which the debuginfo says the variable is in scope (i.e. the number of instruction bytes in the enclosing lexical block or function). I add those up over all variables, and compute the ratio of variable-defined-bytes to variable-in-scope-bytes. The higher this "definition coverage" ratio, the better.

This metric has some weaknesses. DWARF debuginfo doesn't give us accurate scopes for local variables; the defined-bytes for a variable defined halfway through its lexical scope will be about half of its in-scope-bytes, even if the debuginfo is perfect, so the ideal ratio is less than 1 (and unfortunately we can't compute it). In debug builds, and sometimes in optimized builds, compilers may give a single definition for the variable value that applies to the entire scope; this improves our metric even though the results are arguably worse. Sometimes compilers produce debuginfo that is simply incorrect; our metric doesn't account for that. Not all variables and functions are equally interesting for debugging, but this metric weighs them all equally. The metric assumes that the points of interest for a debugger are equally distributed over instruction bytes. On the other hand, the metric is relatively simple. It focuses on what we care about. It depends only on the debuginfo, not on the generated code or actual program executions. It's robust to constant scaling of code size. We can calculate the metric for any function or variable, which makes it easy to drill down into the results and lets us rank all functions by the quality of their debuginfo. We can compare the quality of debuginfo between different builds of the same binary at function granularity. The metric is sensitive to optimization decisions such as inlining; that's OK.

I built a debuginfo-quality tool in Rust to calculate this metric for an arbitrary ELF binary containing DWARF debuginfo. I applied it to the main Firefox binary libxul.so built with clang 8 (8.0.0-svn346538-1~exp1+0~20181109191347.1890~1.gbp6afd8e) and gcc 8 (8.2.1 20181105 (Red Hat 8.2.1-5)) using the default Mozilla build settings plus ac_add_options --enable-debug; for both compilers that sets the most relevant options to -g -Os -fno-omit-frame-pointer. I ignored the Rust compilation units in libxul since they use LLVM in both builds.

In our somewhat arbitrary metric, gcc is significantly ahead of clang for both parameters and local variables. "Parameters" includes the parameters of inlined functions. As mentioned above, the ideal ratio for local variables is actually less than 1, which explains at least part of the difference between parameters and local variables here.

gcc uses some debuginfo features that clang doesn't know about yet. An important one is DW_OP_GNU_entry_value (standardized as DW_OP_entry_value in DWARF 5). This defines a variable (usually a parameter) in terms of an expression to be evaluated at the moment the function was entered. A traditional debugger can often evaluate such expressions after entering the function, by inspecting the caller's stack frame; our Pernosco debugger has easy access to all program states, so such expressions are no problem at all. I evaluated the impact of DW_OP_GNU_entry_value and the related DW_OP_GNU_parameter_ref by configuring debuginfo-quality to treat definitions using those features as missing. (I'm assuming that gcc only uses those features when a variable value is not otherwise available.)

DW_OP_GNU_entry_value has a big impact on parameters but almost no impact on local variables. It accounts for the majority, but not all, of gcc's advantage over clang for parameters. DW_OP_GNU_parameter_ref has almost no impact at all. However, in most cases where DW_OP_GNU_entry_value would be useful, users can work around its absence by manually inspecting earlier stack frames, especially when time-travel is available. Therefore implementing DW_OP_GNU_entry_value may not be as high a priority as these numbers would suggest.

Improving the local variable numbers may be more useful. I used debuginfo-quality to compare two binaries (clang-built and gcc-built), computing, for each function, the difference in the function's definition coverage ratios, looking only at local variables and sorting functions according to that difference:

debuginfo-quality --language cpp --functions --only-locals ~/tmp/clang-8-libxul.so ~/tmp/gcc-8-libxul.so
This gives us a list of functions starting with those where clang is generating the worst local variable information compared to gcc (and ending with the reverse). There are a lot of functions where clang failed to generate any variable definitions at all while gcc managed to generate definitions covering the whole function. I wonder if anyone is interested in looking at these functions and figuring out what needs to be fixed in clang.

Designing and implementing this kind of analysis is error-prone. I've made my analysis tool source code available, so feel free to point out any improvements that could be made.

Update Helpful people on Twitter pointed me to some excellent other work in this area. Dexter is another tool for measuring debuginfo quality; it's much more thorough than my tool, but less scalable and depends on a particular program execution. I think it complements my work nicely. It has led to ongoing work to improve LLVM debuginfo. There is also CheckDebugify infrastructure in LLVM to detect loss of debuginfo, which is also driving improvements. Alexandre Oliva has an excellent writeup of what gcc does to preserve debuginfo through optimization passes.

Update #2 Turns out llvm-dwarfdump has a --statistics option which measures something very similar to what I'm measuring. One difference is that if a variable has any definitions at all, llvm-dwarfdump treats the program point where it's first defined as the start of its scope. That's an assumption I didn't want to make. There is a graph of this metric over the last 5.5 years of clang, using clang 3.4 as a benchmark. It shows that things got really bad a couple of years ago but have since been improving.

Sunday, 4 November 2018

What Is "Evil" Anyway?

I found this Twitter thread insightful, given its assumptions. I think that, perhaps inadvertently, it highlights the difficulties of honest discussion of evil in a secular context. The author laments:

It is beyond us, today, to conclude that we have enemies whose moral universe is such that loyalty to our own morality requires us to understand it and them as evil.
That is, evil means moral principles (and the people who hold them) which are incompatible with our own. That definition is honest and logical, and I think probably the best one can do under physicalist assumptions. Unfortunately it makes evil entirely subjective; it means other people can accurately describe us and our principles as evil in just the same way as we describe them as evil. All "evil" must be qualified as "evil according to me" or "evil according to you".

This is a major problem because (almost?) nobody actually thinks or talks about evil this way in day to day life, neither explicitly nor implicitly. Instead we think and act as if "evil" is an objective fact independent of the observer. Try tacking on to every expression of moral outrage the caveat "... but I acknowledge that other people have different moral assumptions which are objectively just as valid". It doesn't work.

Christians and many other monotheists avoid this problem by identifying a privileged frame of moral reference: God's. Our moral universe may or may not align with God's, or perhaps we don't care, or we may have trouble determining what God's is, but at least it lets us define evil objectively.

The Twitter thread raises a further issue: when one encounters evil people — people whose moral universe is incompatible with our own — what shall we do? Without a privileged frame of moral reference, one can't honestly seek to show them they are wrong. At best one can use non-rational means, including force, to encourage them to change their assumptions, or if that fails, perhaps they can be suppressed and their evil neutralized. This too is most unsatisfactory.

The Christian worldview is a lot more hopeful. We believe in a standard by which all will be measured. We believe in God's justice for transgressions. We believe in redemption through Jesus for those who fall short (i.e. everyone). We seek to love those who (for now) reject God's moral universe ... a group which sometimes includes ourselves. We see that even those most opposed or indifferent to God's purposes can change. These beliefs are not purely subjective, but grounded in objective truths about what God has done and is doing in the world.

Thursday, 1 November 2018

Comments on "REPT: Reverse Debugging of Failures in Deployed Software"

It's a pretty good paper! In fact, it won a "best paper" award.

The basic idea is to use Intel PT to gather a control-flow trace and keep just the last segment in a memory buffer, and dump that buffer as part of a crash memory dump. Then they perform some inference based on final memory values and the control-flow trace to figure out what the values of registers must have been during execution leading up to the crash. The inference is non-trivial; they have to use an iterative approach and sometimes make optimistic assumptions that later are corrected. With the register values and memory values they infer, they can provide some amount of reverse-execution debugging over the time interval leading up to the crash, with negligible overhead during normal execution. It's all done in the context of Windows and WinDbg.

One qualm I have about this paper's approach is that their optimistic assumptions mean in some cases they report incorrect data values. They're able to show that their values are correct most of the time, but I would hesitate to show users data that has a significant chance of being incorrect. There might be some room for improvement here, e.g. distinguishing known-correct values from maybe-incorrect values or using more sophisticated confidence estimation.

Overall using PT to capture control flow to be saved in crash dumps seems like a really good idea. Everyone should do that. The details of the interference are probably less important, but some kind of inference like in this paper seems really useful for crash dump analysis. I think you'll still want full rr-style recording when you can afford it, though. (I think it's possible that one day we'll be able to have full recording with negligible overhead ... a man can dream!)

I have a couple of quibbles. The paper doesn't describe how they handle the x86 EFLAGS register. This is important because if EFLAGS is part of their register state, then even simple instructions like ADD aren't reversible, because they reset flags; but if ELFAGS is not part of their register state, then they can't deduce the outputs for instructions like CMOVxx, SETxx, ADC etc. I asked the lead author and they confirmed they have special handling for EFLAGS.

Another quibble I have is that they don't describe how they chose the bugs to evaluate REPT against, therefore it's difficult to be confident that they didn't cherry-pick bugs where REPT performs well. Unfortunately, this is typical for computer science papers in areas where there is no accepted standard corpus of test subjects. Our field should raise standards by expecting authors to document protocols that avoid the obvious biases — not just test cherry-picking, but also tuning tools to work well on the same tests we then report results for.

Sunday, 28 October 2018

Auckland Half Marathon 2018

This morning I ran the Auckland Half Marathon, for the sixth year in a row, fifth year barefoot. I got my best time ever: official time 1:45:07, net time 1:44:35. I had to push hard, and I've been sore all day! Glad to get under 1:45 net.

Last year I applied climbing tape to the balls of my feet to avoid the skin wearing through. Unfortunately I lost that tape and forgot to get more, so I tried some small patches of duct tape but they came off very early in the race. Nevertheless my feet held up pretty well, perhaps because I've been running more consistently during the year and perhaps because my feet are gradually toughening up year after year. The road was wet today because it rained overnight, but I can't tell whether that is better or worse for me.

Thursday, 25 October 2018

Problems Scaling A Large Multi-Crate Rust Project

We have 85K lines of Rust code implementing the backend of our Pernosco debugger. To impose some modularity constraints and to reduce build times, from the beginning we organized our code as a large set of crates in a single Cargo workspace in a single Gitlab repository. Currently we have 48 crates. This has mostly worked pretty well but as the number of our crates keeps increasing, we have hit some serious scalability problems.

The most fundamental issue is that many crates build one or more executables — e.g. command-line tools to work with data managed by the crate — and most crates also build an executable containing tests (per standard Rust conventions). Each of these executables is statically linked, and since each crate on average depends on many other crates (both our own and third-party), the total size of the executables is growing at roughly the square of the number of crates. The problem is especially acute for debug builds with full debuginfo, which are about five times larger than release builds built with debug=1 (optimized builds with just enough debuginfo for stack traces to include inlined functions). To be concrete, our 85K line project builds 4.2G of executables in a debug build, and 750M of executables in release. There are 20 command-line tools and 81 test executables, of which 50 actually run no tests (the latter are small, only about 5.7M each).

The large size of these executables slows down builds and creates problems for our Gitlab CI as they have to be copied over the network between build and test phases. But I don't know what to do about the problem.

We could limit the number of test executables by moving all our integration tests into a single executable in some super-crate ... though that would slow down incremental builds because that super-crate would need to be rebuilt a lot, and it would be large.

We could limit the number of command-line tools in a similar way — combine all the tool executables a super-tool crate that uses the "Swiss Army knife" approach, deciding what its behavior should be by examining argv[0]. Again, this would penalize incremental builds.

Cargo supports a kind of dynamic linking with its dylib option, but I'm not sure how to get that to work. Maybe we could create a super-crate that reexports every single crate in our workspace, attach all tests and binary tools to that crate, and ask Cargo to link that crate as a dynamic library, so that all the tests and tools are linking to that library. This would also hurt incremental builds, but maybe not as much as the above approaches. Then again, I don't know if it would actually work.

Another option would be to break up the project into separate independently built subprojects, but that creates a lot of friction.

Another possibility is that we should simply use fewer, bigger crates. This is probably more viable than it was a couple of years ago, when we didn't have incremental rustc compilation.

I wonder if anyone has hit this problem before, and tried the above solutions or come up with any other solutions.