Thursday, 3 December 2020

Exploiting Precognition In Binary Instrumentation Of rr Replays

This post is part of a series about the rr remix instrumentation engine that powers the Pernosco omniscient debugger.

When rr replays a recording, it constructs processes that have identical memory and register contents to the recorded processes. It replays execution of the threads in those processes in steps; each step runs CPU code until some specific program state is reached (e.g., the next system call, or until registers and the retired-conditional-branch counter match some recorded values). The most efficient way to implement binary instrumentation of this code is to inject the instrumentation engine into each replay process so the engine and its generated code share the same address space as the application code and data. Then, instead of rr replay using PTRACE_CONT to directly execute application code, it uses PTRACE_CONT to enter the instrumentation engine, which is then responsible for executing the application code with instrumentation. When the engine detects that the instrumented code has reached the desired stopping point for that replay step, it returns control to rr replay.

A key invariant is that the remix engine produces the same effects as native execution on application memory and registers at the end of each replay step. This ensures that rr replay continues to produce memory and register states that match those during recording. It also means we can switch between instrumented execution and native execution at any time between replay steps, e.g. we can replay up to a certain point using regular rr replay and then turn on instrumentation. This is useful for debugging the instrumentation engine and for applying instrumentation-based analysis to a subinterval of an rr recording. In particular Pernosco uses this to parallelize analysis by running multiple replays at once, each one instrumenting a different time interval in the recording.

When injecting the instrumentation engine into each replay process we need to allocate a contiguous range of virtual memory that will never be used by the application. Fortunately, because this is an rr replay, we can see into the future. We can quickly scan the recording, identify a sufficiently large range of memory that will never be used in any replay process, and place the engine and its data there in all replay processes from the beginning.

rr replay needs to count the number of retired conditional branches so that we can deliver asynchronous interrupts at the right time during program execution. Effectively, the RCB counter is part of the state that we instruct the engine to stop at. To avoid having to stop and start a hardware performance counter around the instrumentation's own conditional branches, the engine disregards hardware counters and instead adds instructions to count the conditional branches explicitly.

Single-Exit Fragments

Like other instrumentation engines, remix processes a group of instructions at a time, translating each group of application instructions into instrumented instructions in a hidden code buffer; we call these groups "fragments". This lets us apply optimizations across instruction boundaries within a fragment. For example, the shortest instruction to increment our conditional branch counter is a single inc instruction, but this instruction modifies the CPU's arithmetic flags, which could disrupt the application. However, it is safe to modify flags if we can guarantee that the inc will always be followed by an application instruction that overwrites those flags without reading them. Conditional branches are often preceded by such instructions. Therefore, for example, consider the following application instructions:

    cmp r12,[rsp-8]
    jz label
Because cmp overwrites the arithmetic flags, we can translate this code to
    inc [remix_rcb_counter]
    cmp r12,[rsp-8]
    jz translated_label

Correctly applying this kind of optimization in a binary instrumentation engine is more difficult than it looks, because of unexpected early exits from fragments. In this case, a problem would arise if rsp-8 is not a valid address so that instruction triggers a segmentation fault. We would fault after incrementing remix_rcb_counter, counting a conditional branch that may never happen. Even worse, we will have corrupted the application flag values; the cmp instruction we were counting on to cover that up has not executed, and may never execute! (Keep in mind that segfaults don't have to be fatal...) Normally, the possibility of these unexpected exits limits the optimizations an engine can use, and/or requires elaborate recovery machinery to mitigate — machinery that adds overhead to the just-in-time binary instrumentation process.

During remix execution, however, rr replay indicates whether execution will stop at a segmentation fault or not. If it will stop at a fault, the fault state will be the goal state for that execution step, and remix will insert code before the faulting instruction to stop execution when the right state has been reached — it will never execute a faulting instruction. In remix there are no early exits from fragments; once a fragment has been entered, it is guaranteed to run to completion. We can apply code motion within a fragment at will as long as dataflow dependencies are respected. Binary instrumentation of rr replays is a much easier problem than regular just-in-time binary instrumentation and this lets remix achieve lower overhead with a simpler implementation.

Tuesday, 1 December 2020

rr remix: Efficient Replay-Only Binary Instrumentation

The Pernosco omniscient debugger analyzes user-submitted rr recordings to extract all program states into a database, so that during debugging sessions it can efficiently recreate the program state at any point in time. This analysis requires intensive binary instrumentation of rr replays (e.g. to observe all memory writes performed by the application). To integrate binary instrumentation with rr replay we had to create a new binary instrumentation framework, which we call "rr remix". (Record, replay, remix, ...)

Our main goals for remix were:

  • Instrument the replay of any rr recording (including applications with self-modifying code)
  • Support executing arbitrary instrumentation code
  • Minimise space and time overhead, especially on huge applications (e.g. Web browsers)
  • Be efficient on code without compiler optimizations, as well as optimized code (the former being common when debugging)
  • Be simple so we could build and maintain it with minimal effort
It turned out we hit an extra goal we didn't start with:
  • Support replaying rr recordings in hardware/VMs without access to hardware performance counters or CPUID faulting, and with incompatible XSAVE layouts

Traditionally, tool performance has been mostly evaluated on small, compiler-optimized benchmark applications. Therefore our goals led us to make rather different design decisions compared to other well-known binary instrumentation tools such as Pin, Valgrind, and DynamoRIO. Also, doing binary instrumentation during rr replay imposes some extra requirements on the instrumentation engine, but also gives us very valuable knowledge of the future that we can exploit to simplify the design of the instrumentation engine and improve its performance.

To illustrate the performance of remix we will show the performance of clang++ compiling a simple C++ program. This resembles running a single test of a large application, typical behaviour for Pernosco users.

We compare up-to-date DynamoRio, Valgrind and remix all using "null tool" instrumentation, i.e. rewriting the code but not actually adding any unnecessary instrumentation. These are geometric means of five runs, real time. On optimized clang++, remix of an rr replay beats both DynamoRio and Valgrind instrumenting a normal run.

On non-optimized code, remix beats both DynamoRio and Valgrind again, but the overhead ratios of all the instrumentation systems (especially DR/Valgrind) are lower. This is a pretty good result for remix because it is much simpler than DR and Valgrind; in particular it doesn't perform any inlining, while the others do. This hurts more on non-optimized code, which has a lot more function calls that compilers would normally inline. (Valgrind deliberately optimizes for ease of tool writing and portability over performance; I'm including it for completeness and because people are familiar with it.)

Unfortunately we aren't in a position to open-source remix at this time.

We plan to follow up with some more posts documenting interesting design decisions in remix and how they contribute to these results. Probable topics:

  • The basic remix architecture and how it integrates into rr
  • Fixing regular rr's limitations on trace portability and target hardware
  • Leveraging knowledge of the future to improve the efficiency of binary rewriting
  • The mystery of efficient branch-and-link instructions on x86-64
  • Optimizing non-optimized code: leveraging hardware return address prediction in binary instrumentation
  • Optimizing non-optimized code: dataflow analysis

PS, remember this is all in service of the Pernosco omniscient debuggertry it out!

Tuesday, 24 November 2020

DOM Recording For Web Application Demos

To show off the power of our Pernosco debugger, we wanted many short demo videos of the application interface. Regular videos are relatively heavyweight and lossy; we wanted something more like Asciinema, but for our Web application, not just a terminal. So we created DOMRec, a DOM recorder.

The approach is surprisingly straightforward. To record a demo, we inject the DOMRec script into our application. DOMRec captures the current DOM state and uses a DOM Mutation Observer to record all DOM state changes with associated timestamps. To replay the demo, we create an iframe, fill it with the recorded initial DOM state, then replay the DOM state changes over time. Replay inserts links to the original stylesheets but doesn't load or execute the original scripts. (Of course it couldn't be quite that simple ... see below.)

The resulting demos are compact (most of ours are less than 100KB gzipped), work on almost any browser/device, and are pixel-perfect at any zoom level. DOMRec supports single-frame screenshots when dynamism is not required. We can't use browser HTML5 video controls but we implement play/pause/fullscreen buttons.

Capturing DOM state isn't quite enough because some rendering-relevant state isn't in the DOM. DOMRec logs mouse movement and click events and during replay, displays a fake mouse cursor and click effect. DOMRec tracks focus changes, and during replay sets "fake focus" classes on relevant DOM elements; our application style sheets check those classes in addition to the real :focus and :focus-within pseudoclasses. When necessary DOMRec creates a fake caret. DOMRec captures scroll offsets and makes a best-effort attempt to match scroll positions during recording and replay. Canvas drawing operations don't change the DOM and don't trigger events, so our application manually triggers a "didDrawCanvas" DOM event which DOMRec listens for. Sometimes the Pernosco client needs to trigger a style flush to get CSS transitions to work properly, which also needs to happen during replay, so we fire a special notification event for that too. It's a bit ad-hoc — we implemented just enough for needs, and no more — but it does at least handle Microsoft's Monaco editor correctly, which is pretty complicated.

DOMRec can record demos executed manually, but in practice our demos are scripted using Selenium so that we can rebuild the demo movie when our application interface has changed.

This kind of thing has certainly been built many times — see Inspectlet and its competitors — so when I first looked into this problem a few years ago I assumed I would find an open-source off-the-shelf solution. Unfortunately I couldn't find one that looked easy to consume. Now we're releasing DOMRec with an MIT license. To tell the truth, we're not trying hard to make DOMRec easy to consume, either; as noted above, some applications need to be tweaked to work with it, and this blog post is about all you're going to get in terms of documentation. Still, it's only 1200 lines of standalone Javascript so people may find it easier than starting from scratch.

Friday, 20 November 2020

Debugging With Screenshots In Pernosco

When debugging graphical applications it can be helpful to see what the application had on screen at a given point in time. A while back we added this feature to Pernosco.

This is nontrivial because in most record-and-replay debuggers the state of the display (e.g., the framebuffer) is not explicitly recorded. In rr for example, a typical application displays content by sending data to an X11 server, but the X11 server is not part of the recording.

Pernosco analyzes the data sent to the X11 server and reconstructs the updates to window state. Currently it only works for simple bitmap copies, but that's enough for Firefox, Chrome and many other modern applications, because the more complicated X drawing primitives aren't suitable for those applications and they do their complex drawing internally.

Pernosco doesn't just display the screenshots, it helps you debug with them. As shown in the demo, clicking on a screenshot shows a pixel-level zoomed-in view which lets you see the exact channel values in each pixel. Clicking on two screenshots highlights the pixels in them that are different. We know where the image data came from in memory, so when you click on a pixel we can trace the dataflow leading to that pixel and show you the exact moment(s) the pixel value was computed or copied. (These image debugging tools are generic and also available for debugging Firefox test failures.) Try it yourself!

Try Pernosco on your own code today!

Saturday, 14 November 2020

rr Repository Moved To Independent Organisation

For a long time, rr has not been a Mozilla project in practice, so we have worked with Mozilla to move it to an independent Github organization. The repository is now at https://github.com/rr-debugger/rr. Update your git remotes!

This gives us a bit more operational flexibility for the future because we don't need Mozilla to assist in making certain kinds of Github changes.

There have been no changes in intellectual property ownership. rr contributions made by Mozilla employees and contractors remain copyrighted by Mozilla. I will always be extremely grateful for the investment Mozilla made to create rr!

For now, the owners of the rr-debugger organisation will be me (Robert O'Callahan), Kyle Huey, and Keno Fischer (of Julia fame, who has been a prolific contributor to rr).

Tuesday, 10 November 2020

Pernosco Now Available For Individual Developers

We're pleased to announce that Pernosco is now available to individual developers!

Each Github account gets five free submissions. After that you can subscribe for 20 USD per month for 5 submissions per month or 30 USD for 5 "carry-over" submissions that don't expire. Workflow: make a recording with rr, then submit the recording to Pernosco with the pernosco-submit tool and credentials obtained from the Pernosco account page.

Pernosco supports x86-64 ELF binaries with DWARF debuginfo — e.g. C/C++/Rust — and V8 JS (in some configurations, e.g. Node, Chromium).

Experience how much better debugging can be, and contact us with any questions!

Update We've added a new "volume subscription plan" for individual users: 50 submissions per month for $50. This should make it easier to not think about rationing.

Sunday, 1 November 2020

Auckland Half Marathon 2020

I did the Auckland Half Marathon again this year. There were 4600 runners in the Half Marathon this year, a little down on the 5200 last year, but not bad given the COVID pandemic happening elsewhere and our borders being closed.

I did around 1:48. I pushed myself pretty hard this year and unlike some previous years I don't think I could realistically have pushed harder, but still didn't hit my personal best — and I'm a bit more sore than in previous years too. Some of that is no doubt just me getting older, but it may be partly due to residual tiredness having done the Pouakai Circuit last weekend. Also, it might be overall faster for me to walk rather than run up the Harbour Bridge, though it feels more satisfying to have run the whole way.

Anyway, despite my sore legs and feet, I am pleased I did my best.

Monday, 26 October 2020

Pouakai Circuit 2020

Last year's Pouakai Circuit tramp was good but I wanted to try again, hoping for better weather, and a friend had a day off and wanted to go on a short trip near Auckland, so I organised a group to walk the Pouakai Circuit again, Octobert 23-25 (Friday-Sunday).

Pouakai Hut is only 16 bunks and very popular, so I was worried it might be very full on Friday night (the day before Labour Weekend), so I decided to spend Friday night at Matekawa Hut, then walk around the mountain to Holly Hut for Saturday night, then either finish the circuit on Sunday or take the shorter track directly out to the visitor's centre. In the end we did finish the circuit, so we actually did "Pouakai Circuit plus Matekawa Hut".

On Friday we drove down from Auckland and had only a short walk to Matekawa Hut. The weather was a little bit drizzly but we had good views over the plains to the north and east of Mt Taranaki — including faint views of Mt Tongariro, Ngarahoe and Ruapehu —. Later in the evening the cloud over the mountain cleared and we got a good view of Mt Taranaki itself. We were the only people in the hut apart from one guy who walked up in the dark and arrived after we all were in bed!

On Saturday we got started reasonably early, leaving the hut around 8:20am even after cooking a decent breakfast. We walked up to Tahurangi Lodge, up "The Puffer", then west around the flank of the mountain to Holly Hut. Unfortunately the weather wasn't great, cloudy and drizzly and a bit windy too. Nevertheless the walk around the mountain was interesting, with some good views of waterfalls, lava cliffs and canyons in the mist. We got to Holly Hut in time for lunch and spent the rest of the afternoon watching people file in. As expected Holly Hut got pretty full; I think just about all the 32 bunks were taken, and there were several tents in the adjacent clearing. The weather improved a bit in the afternoon and we went out to the side track to Bell Falls, which is rather nice.

Yesterday we got going a bit earlier, around 8am. We crossed Ahukawakawa swamp and scooted up to Pouakai Hut in good time, taking only about 1.5 hours instead of the signposted 2.5 hours. The weather in the Pouakai tops was similar to the previous day — foggy, drizzly and a bit windy — so unfortunately we again didn't see much. Trampers at Pouakai Hut told us the hut had been incredibly full on Saturday night, more than 30 people in the 16 bunk hut, with people sleeping on floors, under bunks, etc. We pushed on to complete the circuit, walking over Henry Peak then down into the bush and out at the North Egmont road, with only a brief stop at Kaiauai Shelter for lunch, exiting around 2:30pm. The weather was overall significantly wetter than last year and the bush track yesterday afternoon was quite wet and muddy.

Overall the trip was a good workout and a lot of fun. We played Bang a fair bit, ate some good food and everyone said they enjoyed it despite the weather and the mud. I'm more determined than ever to go again when we can finally get some good weather and the great views that are possible!

Sunday, 11 October 2020

The Parable Of The Two Bus Drivers

Two bus drivers were driving along a dangerous mountain road. The first bus driver drove very fast in a hurry to get to town. Many times their bus scraped the guardrails and nearly plunged off the road, but the driver was lucky and each time managed to get the bus under control.

The second bus driver also wanted to get to town quickly, but was more afraid of plunging off the road, so drove a little slower. Their bus scraped the guardrails only a few times.

The second bus reached the town a few minutes after the first bus. Many passengers on the second bus complained: "if only you had driven faster, we could have arrived in town at the same time as the first bus. Next time we will choose the first bus driver."

Monday, 21 September 2020

New Zealand's Long Term COVID19 Strategy

People in traditional and social media regularly describe New Zealand's "elimination" strategy as "unrealistic" or "unsustainable", for example Greg Foran today. It's usually not clear what they mean or what specifically they think the New Zealand government should do differently, so it's hard to support their views, but I do think Jacinda Ardern (or Judith Collins if she leads the government after election) needs to articulate one or two of the most likely paths she expects New Zealand to take through 2021. Understandably they don't want to promise something they might not be able to deliver, but I would like to see them fill the vacuum of uncertainty with at least a tentative proposal.

Skimming media and various expert opinions, it seems very probable that at least one safe and reasonably effective vaccine will be available to us some time in 2021 — probable enough that we should bet on it. It seems reasonable to expect to vaccinate everyone in New Zealand who's willing — not an especially onerous step up from our annual flu vaccination campaign — by late 2021. After that we could return to pre-COVID border controls — even if we still expected COVID19 to arrive, some people to catch it, and some to die, the cost of reopening the border then would be much lower then than it is today. If we make that our tentative plan, I think there's a pretty strong argument for keeping NZ COVID19-eliminated as much of the time as possible until then.

On the other hand if that plan is unreasonable or unrealistic, I would like to hear expert views on what a better plan looks like! I mean an actual actionable plan, not something hopelessly vague like "we need to learn to live with the virus".

Sunday, 30 August 2020

Surprising Words In Luke 1:16-17

This is another interesting, perhaps slightly weird, verse that is often read and glossed over. Luke writes of a prophecy delivered to Zechariah about his yet-to-be born son, John the Baptist:

"And he will turn many of the children of Israel to the Lord their God, and he will go before him in the spirit and power of Elijah, to turn the hearts of the fathers to the children, and the disobedient to the wisdom of the just, to make ready for the Lord a people prepared."
Turn people back to God, return the disobedient to wisdom, prepare people to receive the Messiah ... these are unsurprising prophetic priorities. But turn the hearts of fathers to their children? Why was that a priority for God in that era?

Understanding this depends on knowing that it's actually a quote from Malachi 4:6:

"He [returning Elijah] will turn the hearts of the parents to their children, and the hearts of the children to their parents; or else I will come and strike the land with total destruction."
Quoting that here makes sense because John the Baptist is later identified as acting in the spirit of Elijah, fulfilling this prophecy. Thus John will not just turn the hearts of fathers to their children, but also those of children to their fathers, i.e. strengthen familial love in general. Still, it's very interesting that while Biblical teaching often emphasizes the duty of children to honor their parents (e.g. the fifth Commandment), Luke has instead chosen to emphasize the duty of parents to love their children.

I think it's also interesting that Luke highlights familial love here when the rest of the Gospels often seem to give it short shrift. Jesus sometimes leaves his family waiting, he tells his followers they are his mother and brothers, and his disciples leave their families to follow him. Luke's choice here helps keep those events in perspective.

Sunday, 23 August 2020

What's So Amazing About Mark 10:32

I've probably read this sentence twenty times without really thinking about it:

They were on their way up to Jerusalem, with Jesus leading the way, and the disciples were astonished, while those who followed were afraid.

Sandwiched between Jesus explaining how hard it is for the rich to enter the kingdom of God and predicting his own death, this sentence is easy to gloss over as a mere connective, but it raises interesting questions. Why mention that Jesus was leading the way — didn't he always lead the way? Why were the disciples astonished? Why were those who followed afraid?

Fortunately in this case the context suggests satisfactory answers. Jesus is nearing Jerusalem, the seat of civil and religious powers who are hostile to him and his message. Everyone can sense an impending confrontation that will either end in his triumph, or if history is any guide, much more likely his death. (Spoiler! It will be both.) No wonder his followers are afraid. No wonder the disciples are amazed that Jesus is deliberately pressing on towards this confrontation — and perhaps that amazement gives them confidence to be less fearful than the other followers. We can't know exactly what Jesus was thinking at this moment, but it's easy to imagine him being afraid but driven onward by his Messianic mission.

With that in mind, this simple sentence paints an extraordinary psychological scene — a worthy subject for a painting or a short drama. Too bad those aren't my skills.

I expect to keep finding new treasures no matter how many times I read the Bible.

Saturday, 8 August 2020

Scaling Debuginfo For Zero-Cost Abstractions

In yesterday's post I raised the issue that Rust and C++ promise zero-cost abstractions, but with the standard build configurations people use today, you have to choose between "zero-cost abstractions" and "fast and debuggable builds". Reflecting on this a bit more, I realized that the DWARF debuginfo format used with ELF binaries inherently conflicts with the goal of having both zero-cost abstractions and fast debuggable builds, because of the way it handles inline functions (and possibly for other reasons).

Suppose we have a non-inline function F calling non-inline function G, at K sites, each time via a stack of N inline trivial wrapper functions I1 ... IN. Zero-cost abstraction means this compiles to the same code as F calling G directly K times. However, correct DWARF debuginfo for F must contain KxN explicit DW_TAG_inlined_subroutines, one for each instance of an inlined function :-(. Tracking and then emitting all this debuginfo will necessarily slow down the build, and make it slower as you add more layers of abstraction. Maybe this isn't (yet) a problem in practice compared to the cost of optimizations, but it is a fundamental limitation. Fixing it would probably require changing the debuginfo format so that, at least for the "minimal optimizations for fast debuggable builds" mode, the debuginfo for an inline function can be emitted just once and then parameterized for each call site where it is inlined. I don't have a clear idea of how that would work! This probably means the format and emission of debuginfo needs to be carefully codesigned with the optimizer to achieve fast debuggable builds with zero-cost abstractions.

Friday, 7 August 2020

What Is The Minimal Set Of Optimizations Needed For Zero-Cost Abstraction?

A compelling feature of Rust and C++ is "zero-cost abstractions". You can write "high level" code, e.g. using iterators, that compiles down to the same machine code as the low-level code that you'd write by hand. You can add layers of abstraction, e.g. wrapping a primitive value in a struct and providing a specialized API for it, without adding run-time overhead. However, "zero-cost" only applies if you enable an adequate set of compiler optimizations. Unfortunately, enabling those optimizations slows down compilation and, with current compilers, trashes a lot of debug information, making it very difficult to debug with those binaries. Since the Rust standard library (and increasingly the C++ standard library) makes heavy use of "zero-cost" abstractions, using non-optimized builds for the sake of better debugging and build times creates binaries that are many times slower than release builds, often untenably slow. So the question is: how can we get fast builds and quality debuginfo while keeping zero-cost abstractions?

An obvious approach is limit the set of enabled compiler optimizations to the minimum necessary to achieve zero-cost abstraction, and hope that that produces acceptable build speed and debuginfo quality. But what is that set? If it's "all optimizations", then this is no help. I guess it depends on the exact set of "zero-cost abstraction" patterns we want to support. Here are some optimizations that I think need to be on the list:

  • Inlining: Most or all abstractions depend on inlining functions to achieve zero-cost, because they encapsulate code into functions that you'd otherwise write yourself. So, we must have aggressive inlining.
  • Copy propagation: After inlining functions, there will be chains of assignments through parameters and results that will need to be shortened using copy propagation.
  • Limited dead store elimination and temporary elimination: After copy propagation, a lot of temporary values aren't needed. We shouldn't store their values or allocate space for them.
  • Scalar replacement: Many abstractions package up variables into structs and encapsulate operations on the struct into functions. It seems likely that the optimizer needs to be able to undo that, taking structs apart and treating primitive components like any other scalar variable. This would typically be needed to make the above optimizations work on those components.

Here are some optimizations that I think don't need to be on the list:

  • Register allocation: Unoptimized debug builds typically do almost no register allocation: local variables live in stack slots. Thus, if unnecessary temporaries are eliminated (see above) storing the surviving variables on the stack is not penalizing abstraction.
  • Vectorization: Unoptimized debug builds typically don't do any vectorization, so again we can just keep on not doing it without penalizing abstraction.
  • Tail calls: I can't think of a Rust or C++ abstraction that relies on TCO to be zero-cost.

Here's an optimization I'm not sure about:

  • Common subexpression elimination: Are there common abstractions that we think should be zero-cost in Rust and C++ that require CSE? I can't think of any off the top of my head.

It would be very interesting to dive into this further and actually configure the Rust compiler with a promising set of optimizations and evaluate the results.

A somewhat orthogonal approach to the whole problem is to simply improve the debuginfo quality and build speed of optimized builds. The former is mostly a lot of engineering and architectural work to preserve debuginfo through various optimization passes. There are issues where some optimizations (e.g. register allocation) cause variable values to be unavailable at program points where they're technically still in scope — but good record-and-replay debuggers like rr or better still omniscient debuggers like Pernosco can get around that. Build speed is really the harder problem: doing lots of optimizations, especially with all the bookkeeping to preserve debuginfo, is inherently expensive.

I wonder what a compiler backend would look like if you designed from scratch with the goal of good debuginfo and the fastest possible builds while enabling zero-cost abstractions.

Monday, 8 June 2020

Cape Brett 2020

This weekend I walked to Cape Brett Hut for the third time, with some friends and family. It's a tough hike but once again we had great weather, especially for a winter tramp.

Getting to and from the start at Rawhiti was quite exciting, but once we were on the track the walk was straightforward — but arduous. There is no especially difficult terrain, just endless up-and-down steep hills. The bush is beautiful; there are consistently spectacular views of the Bay of Islands to the north and the Tutukaka coast to the south, with steep cliffs, rocks, and sheltered bays; and the ocean looks appropriately vast. We took our time and got to the hut after about eight hours, at 4pm. One of our group went down to the rocks to fish, and caught a snapper which we had for dinner along with sausages and bread. There were a dozen seals on the rocks across the channel that were fun to watch.

The DoC website has a permanent warning that the hut water is salty. Last year it seemed fine, but this time it was certainly not fit for human consumption, so we had to conserve water carefully.

The next day we watched the sun rise over the ocean and had a good bacon-and-egg breakfast. The return walk was again both arduous and glorious. We ran a little late and finished right as the sun was setting over the Bay of Islands, which was very beautiful. Everyone was pretty tired and some were quite sore, but over time I think the good memories will last!

My Google Maps Disaster

Last weekend I did the Cape Brett Track with some friends and family (more about that in a separate post). On Saturday morning I left Whangarei at 6am, planning to take Russell Road to the trailhead at Rawhiti around 7:30am, so we could start the walk just after dawn.

The first thing that went wrong is that we missed the turnoff to Russell Road from State Highway One. Google Maps (and my road atlas) shows the turnoff at Whakapara, but no road signs in that area mention Whakapara, so it's easy to miss in the dark.

We carried on, looking for a place to turn around. The next opportunity was at the intersection with Puhipuhi Road. Google Maps shows Puhipuhi Road connecting with Russell Road via Teatree Flat Road and Peach Orchard Road, so I thought we might as well just take it. (Narrator: big mistake!)

Upon reaching Teatree Flat Road, we discovered the street sign was marked "no exit". Worrying, but Google Maps clearly shows it connecting to Peach Orchard Road, so we might as well have a look, right? (Narrator: no.)

At the end of Teatree Flat Road, a sign says Council Maintenance Ends, but a road continues where Peach Orchard Road should be and it doesn't look too bad. Backtracking now might have forced our friends to wait for quite a while ... so we gave it a go. (Narrator shakes head.)

Peach Orchard Road deteriorated rapidly. Before we knew it, our people-mover was bouncing down a track that seemed to be mostly rocks and tree roots. It was clear that I had made a dreadful mistake and should not have been on that road in that car. I was genuinely scared and praying fervently. There was nowhere to turn around and even if there was, I'm not sure our car could have gotten back up the hill, so I gritted my teeth and kept going, just hoping the road wouldn't get worse and our car wouldn't fall apart. Thank God, the road started to improve again, connected to a proper gravel road, and then we were on Russell Road. We caught up to our friends and were back on schedule.

But there is a sequel! After we finished the walk on Sunday night, we started driving out and immediately discovered we had a flat tire! I assume this was a slow leak from damage inflicted by Peach Orchard Road :-(. Still, I'm grateful it didn't stop us from getting to Rawhiti. We changed the tire and were able to get home after several hours of cautious driving.

All in all it was a cheap lesson. Don't trust Google Maps for connectivity or state of rural roads! At least check to see if there is Street View for the roads you want to use. In fact the drivers of the Google Maps car were wise enough to skip the scary part of Peach Orchard Road :-).

Monday, 11 May 2020

Why Forking HTML Into A Static Language Doesn't Make Sense

Quite often someone proposes something like this:

HTML should never have grown into the mutated application runtime it is today. The presentational concerns for documents are different from application rendering. The javascript stack should have been something entirely separate.

I strongly feel we should create a lightweight HTML fork that is again document-centric and doesn't allow for all of this javascript nonsense. Something that doesn't allow for stupid custom UI or behavioural tracking. Just text, images, videos, and links. The painting algorithm would be dead simple, documents would load lightning fast, and we'd be confident there would be no ad malware.

Rather than reply directly, I decided to write this blog post so I can refer to it again next time this comes up.

There are incorrect factual assumptions here. "The presentational concerns for documents are different from application rendering" is incorrect; there's a lot of overlap, and there is a continuum between static documents and interactive applications, with use-cases all the way along that continuum. "Something that doesn't allow for stupid custom UI or behavioural tracking. Just text, images, videos, and links" is contradictory; embedded images, videos and links allow for lots of tracking and custom UI.

A bigger issue is that defining a static fork of HTML is easy, but persuading Web developers to use it is where the real problem lies, and that is immensely difficult and no-one has any good ideas for how to do it. Of course, if you do find a way to persuade Web developers to avoid features you don't like, there isn't much value in defining a specific "static subset" of HTML.

By the way, I just partially lied. Defining a static fork of HTML is easy, but I'm sure everyone who thinks the modern Web has gone off the rails has a slightly different list of features that should be allowed, so getting your entire constituency to agree on a specific list of allowed features is going to be very difficult. (E.g., are forms in that subset? contenteditable? Video? Animated GIFs? CSS :hover? CSS media queries? Different people will give different answers to these questions.)

Some people believe that defining a static subset of HTML would benefit Web developers and users because we could build a much more efficient browser (or browser mode) for that subset. As a former Mozilla distinguished engineer, I'm skeptical. Even something as fundamental as window size changes requires the ability to do incremental relayout. Any kind of editing or forms requires support for DOM changes, as does rendering of incrementally loaded documents. You could take a few shortcuts but you won't easily beat the performance of existing browsers on the same content, especially keeping in mind the incredibly impressive optimizations those browsers have already implemented.

The only possible benefit I can see of a defined static HTML subset is that for documents opting into that subset, browsers might be able to put all those documents into a single sandboxed child process instead of splitting them into a different sandboxed process per site, on the assumption that these documents are a negligible security risk to each other. I don't think this is at the forefront of the minds of advocates for a static HTML subset.

Friday, 8 May 2020

Omniscient JS Debugging In Pernosco

Until recently Pernosco was limited to debugging statically-compiled languages with DWARF debuginfo. Many of our potential customers would like to debug other languages as well, so we're doing that, starting with support for debugging Javascript running in V8. Here's a demo:

You also can open the Pernosco session in the demo yourself and follow along!

This is an rr recording of Chrome, processed by Pernosco. We haven't modified Chrome/V8 source code. Instead Pernosco observes the operation of V8 — using our omnisicient infrastructure — and infers what the JS code is doing. We get an omniscient debugging experience for Javascript — the first tool in this space, as far as I know. Not only that, you get C++ and Javascript debugging tightly integrated together, which is invaluable in some scenarios.

Currently we only understand the V8 interpreter, so we disable the JIT (--js-flags=--jitless). Handling the V8 baseline compiler would be more work, but probably straightforward. What we already have should be enough to debug many Node applications, Electron apps, and even Web apps.

Pernosco is quite a different sort of experience to the "browser devtools" debugging most Web developers are used to. We can certainly implement more devtools features, but for some applications a tool like WebReplay will be more appropriate.

As usual we're only scratching the surface of what can be done here. We've built infrastructure to make it easier to support more languages, and we're eager to add whatever our customers might need. If you're interested in using Pernosco for your debugging needs, get in touch with your requirements and we may be able to set up you with a free demo.

Friday, 17 April 2020

Have Some Humility, Mike Hosking

Today Mike Hosking is saying that our lockdown was an overreaction.

On March 23 he was saying "If we want to beat Covid-19, shut New Zealand down". "Michael Baker is right, we need to be shutting doors." No concern expressed about overreacting.

March 24, after the lockdown was announced, he wrote: "Thank God we got there in the end". "My only real fear at this point is us." "Being stuck will be fun for some for a while, then it'll be a pain. That's a very small price to pay for, at last, the urgency of this starting to drive decision making."

I do not see anything from Hosking, from then until recently, saying our lockdown was an overreaction.

So it's easy now to look at results in Australia and say maybe what they did was enough. But who could be confident saying that three weeks ago? Not Mike Hosking apparently, and I think not anyone.

Have some humility, Mike Hosking.

Friday, 27 March 2020

What If C++ Abandoned Backward Compatibility?

Some C++ luminaries have submitted an intriguing paper to the C++ standards committee. The paper presents an ambitious vision to evolve C++ in the direction of safety and simplicity. To achieve this, the authors believe it is worthwhile to give up backwards source and binary compatibility, and focus on reducing the cost of migration (e.g. by investing in tool support), while accepting that the cost of each migration will be nonzero. They're also willing to give up the standard linking model and require whole-toolchain upgrades for each new version of C++.

I think this paper reveals a split in the C++ community. I think the proposal makes very good sense for organizations like Google with large legacy C++ codebases that they intend to continue investing in heavily for a long period of time. (I would include Mozilla in that set of organizations.) The long-term gains from improving C++ incompatibly will eventually outweigh the ongoing migration costs, especially because they're already adept at large-scale systematic changes to their code (e.g. thanks to gargantuan monorepo, massive-scale static and dynamic checking, and risk-mitigating deployment systems). Lots of existing C++ software really needs those safety improvements.

I think it also makes sense for C++ developers whose projects are relatively short-lived, e.g. some games. They don't need to worry about migration costs and will reap the benefits of C++ improvement.

For mature, long-lived projects that are poorly resourced, such as rr, it doesn't feel like a good fit. I don't foresee adding a lot of new code to rr, so we won't reap much benefit from improvements in C++. On the other hand it would hurt to pay an ongoing migration tax. (Of course rr already needs ongoing maintenance due to kernel changes etc, but every extra bit hurts.)

I wonder what compatibility properties toolchains would have if this proposal carries the day. I suspect the intent is the latest version of a compiler implements only the latest version of C++, but it's not clear. An aggressive policy like that would increase the pain for projects like rr (and, I guess, every other C++ project packaged by Linux distros) because we'd be dragged along more relentlessly.

It'll be interesting to see how this goes. I would not be completely surprised if it ends with a fork in the language.

Wednesday, 11 March 2020

Debugging Gdb Using rr: Ptrace Emulation

Someone tried using rr to debug gdb and reported an rr issue because it didn't work. With some effort I was able to fix a couple of bugs and get it working for simple cases. Using improved debuggers to improve debuggers feels good!

The main problem when running gdb under rr is the need to emulate ptrace. We had the same problem when we wanted to debug rr replay under rr. In Linux a process can only have a single ptracer. rr needs to ptrace all the processes it's recording — in this case gdb and the process(es) it's debugging. Gdb needs to ptrace the process(es) it's debugging, but they can't be ptraced by both gdb and rr. rr circumvents the problem by emulating ptrace: gdb doesn't really ptrace its debuggees, as far as the kernel is concerned, but instead rr emulates gdb's ptrace calls. (I think in principle any ptrace user, e.g. gdb or strace, could support nested ptracing in this way, although it's a lot of work so I'm not surprised they don't.)

Most of the ptrace machinery that gdb needs already worked in rr, and we have quite a few ptrace tests to prove it. All I had to do to get gdb working for simple cases was to fix a couple of corner-case bugs. rr has to synthesize SIGCHLD signals sent to the emulated ptracer; these signals weren't interacting properly with sigsuspend. For some reason gdb spawns a ptraced process, then kills it with SIGKILL and waits for it to exit; that wait has to be emulated by rr because in Linux regular "wait" syscalls can only wait for a non-child process if the waiter is ptracing the target process, and under rr gdb is not really the ptracer, so the native wait doesn't work. We already had logic for that, but it wasn't working for process exits triggered by signals, so I had to rework that, which was actually pretty hard (see the rr issue for horrible details).

After I got gdb working I discovered it loads symbols very slowly under rr. Every time gdb demangles a symbol it installs (and later removes) a SIGSEGV handler to catch crashes in the demangler. This is very sad and does not inspire trust in the quality of the demangling code, especially if some of those crashes involve potentially corrupting memory writes. It is slow under rr because rr's default syscall handling path makes cheap syscalls like rt_sigaction a lot more expensive. We have the "syscall buffering" fast path for the most frequent syscalls, but supporting rt_sigaction along that path would be rather complicated, and I don't think it's worth doing at the moment, given you can work around the problem using maint set catch-demangler-crashes off. I suspect that (with KPTI especially) doing 2-3 syscalls per symbol demangle (sigprocmask is also called) hurts gdb performance even without rr, so ideally someone would fix that. Either fix the demangling code (possibly writing it in a safe language like Rust), or batch symbol demangling to avoid installing and removing a signal handler thousands of times, or move it to a child process and talk to it asynchronously over IPC — safer too!

Sunday, 19 January 2020

Static Customization Of Function Signatures In Rust

Sometimes I have a big function that does a lot, and in new code I need to do almost the same thing, but slightly differently. Often the best approach is to call the same function but add parameters (often described as "options") to select specific variations. This can get ugly when the function has optional outputs — results that are only produced when certain options were passed in — because typically there is the possibility of an error when code looks for an output (e.g. unwraps a Rust Option) at a site that did not request it. It would be great if the compiler could check that you only use an output when you passed in the option to enable it. Fortunately, some simple Rust coding patterns let us achieve this.

Here's an example simplified from some Pernosco code. We have a function that pretty-prints a data structure at a specific moment in time, which takes an optional parameter specifying another time to render the change in values between the two times. If that optional parameter is specified, the function returns an additional result — whether or not there was any difference in the two values. The naive signature would look something like this:

fn print(&self, moment: Moment, other: Option<Moment>)
-> (String, Option<Difference>) {
  ...
  let difference = other.map(|moment| ...);
  ...
  (..., difference)
}
...
let (string1, difference1) = value.print(moment, None);
let (string2, difference2) = value.print(moment, Some(other_moment));
println!("{:?}", difference1.unwrap()); // PANIC
println!("{:?}", difference2.unwrap());
It is possible to misuse this function by passing None in other and then expecting to find a meaningful value in the second part of the result pair. We'd probably catch it in tests, but it would be good to catch it at compile time. There is also a small efficiency issue: passing None for other is a bit less efficient than calling a customized version of print that has been optimized to remove other and the Difference result.

Here's a way to avoid those problems:

trait PrintOptionalMoment {
  type PrintOptionalDifference;
  fn map<F>(&self, closure: F) -> Self::PrintOptionalDifference
     where F: FnOnce(Moment) -> Difference;
}
impl PrintOptionalMoment for () {
  type PrintOptionalDifference = ();
  fn map<F>(&self, closure: F) -> Self::PrintOptionalDifference
     where F: FnOnce(Moment) -> Difference {}
}
impl PrintOptionalMoment for Moment {
  type PrintOptionalDifference = Difference;
  fn map<F>(&self, closure: F) -> Self::PrintOptionalDifference
     where F: FnOnce(Moment) -> Difference { closure(*self) }
}
fn print<Opt: PrintOptionalMoment>(&self, moment: Moment, other: Opt)
-> (String, Opt::PrintOptionalDifference) {
  ...
  let difference = other.map(|moment| ...);
  ...
  (..., difference)
}
let (string1, ()) = value.print(moment, ());
let (string2, difference2) = value.print(moment, other_moment);
println!("{:?}", difference2);

Rust playground link.

This cleans up the call sites nicely. When you don't pass other, the "difference" result has type (), so you can't misuse it or cause a panic trying to unwrap it. When you do pass other, the "difference" result is not an Option, so you don't need to unwrap it. The implementation of print is basically unchanged, but now Rust will generate two versions of the function, and the version that doesn't take other should be optimized about as well as a handwritten function that removed other. (Unlike in C++, in Rust a () value does not take any space in a struct or tuple.)

If you're not familiar with Rust, the intuition here is that we define a trait PrintOptionalMoment that we use to mean "a type that is either nothing, or a Moment", and we declare that the "void" type () and the type Moment both satisfy PrintOptionalMoment. Then we make print generic over type Opt, which can be either of those. The PrintOptionalMoment trait defines an associated type PrintOptionalDifference which is the result type associated with each Opt that satisfies PrintOptionalMoment.

This approach easily extends to cover more complicated relationships between options and input and output types. In some situations it might be more trouble than it's worth, or the generated code duplication is undesirable, but I think it's a good tool to have.

Friday, 3 January 2020

Updating Pernosco To Rust Futures 0.3

The Pernosco debugger engine is written in Rust and makes extensive use of async code. We had been using futures-preview 0.2; sooner or later we had to update to "new futures" 0.3, and I thought the sooner we did it the easier it would be, so we just did it. The changes were quite extensive:

103 files changed, 3610 insertions(+), 3865 deletions(-)
That took about five days of actual work. The changes were not just mechanical; here are a few thoughts about the process.

The biggest change is that Future and Stream now have a single Output/Item instead of Item and Error, so if you need errors, you have to use an explicit Result. Fixing that was tedious, but I think it's a clear improvement. It encouraged me to reconsider whether we really needed to return errors at all in places where the error type was (), and determine that many of those results can in fact be infallible. Also we had places where the error type was Never but we still had to write unwrap() or similar, which are now cleaned up.

I mostly resisted rewriting code to use async/await, but futures 0.3 omits loop_fn, for the excellent reason that code using it is horrible, so I rewrote our uses of loop_fn with async/await. That code is far easier to read (and write) now. Now that the overall conversion has landed we can incrementally adopt async/await as needed.

It took me a little while to get my head around Pin. Pin adds unfortunate complexity, especially how it bifurcates the world of futures into "pinned" and "unpinned" futures, but I accept that there is no obviously better approach for Rust. The pin-project crate was really useful for porting our Future/Stream combinators without writing unsafe code.

A surprisingly annoying pain point: you can't assign an explicit return type to an async block, and it's difficult to work around. (I often wanted this to set an explicit error type when using ? inside the block.) Async closures would make it easy to work around, but they aren't stabilized yet. Typically I had to work around it by adding explicit type parameters to functions inside the block (e.g. Err).

Outside the main debugger engine we still have quite a lot of code that uses external crates with tokio and futures 0.1. We don't need to update that right now so we'll keep putting it off and hopefully the ecosystem will have evolved by the time we get to it. When the time comes we should be able to do that update more incrementally using the 0.1 ⟷ 0.3 compatibility features.

The really good news is that once I got everything to build, we had only two regressions in our (fairly good) test suite. One was because I had dropped a ! while editing some code. The other was because I had converted a warning into a fatal error to simplify some code, and it turns out that due to an existing bug we were already hitting that situation. Given this was a 4K line patch of nontrivial complexity, I think that's a remarkable testament to the power of Rust.