Wednesday, 11 July 2018

Why Isn't Debugging Treated As A First-Class Activity?

Mark Côté has published a "vision for engineering workflow at Mozilla": part 2, part 3. It sounds really good. These are its points:

  • Checking out the full mozilla-central source is fast
  • Source code and history is easily navigable
  • Installing a development environment is fast and easy
  • Building is fast
  • Reviews are straightforward and streamlined
  • Code is landed automatically
  • Bug handling is easy, fast, and friendly
  • Metrics are comprehensive, discoverable, and understandable
  • Information on “code flow” is clear and discoverable

Consider also Gitlab's advertised features:

  • Regardless of your process, GitLab provides powerful planning tools to keep everyone synchronized.
  • Create, view, and manage code and project data through powerful branching tools.
  • Keep strict quality standards for production code with automatic testing and reporting.
  • Deploy quickly at massive scale with integrated Docker Container Registry.
  • GitLab's integrated CI/CD allows you to ship code quickly, be it on one - or one thousand servers.
  • Configure your applications and infrastructure.
  • Automatically monitor metrics so you know how any change in code impacts your production environment.
  • Security capabilities, integrated into your development lifecycle.

One thing developers spend a lot of time on is completely absent from both of these lists: debugging! Gitlab doesn't even list anything debugging-related in its missing features. Why isn't debugging treated as worthy of attention? I genuinely don't know — I'd like to hear your theories!

One of my theories is that debugging is ignored because people working on these systems aren't aware of anything they could do to improve it. "If there's no solution, there's no problem." With Pernosco we need to raise awareness that progress is possible and therefore debugging does demand investment. Not only is progress possible, but debugging solutions can deeply integrate into the increasingly cloud-based development workflows described above.

Another of my theories is that many developers have abandoned interactive debuggers because they're a very poor fit for many debugging problems (e.g. multiprocess, time-sensitive and remote workloads — especially cloud and mobile applications). Record-and-replay debugging solves most of those problems, but perhaps people who have stopped using a class of tools altogether stop looking for better tools in the class. Perhaps people equate "debugging" with "using an interactive debugger", so when trapped in "add logging, build, deploy, analyze logs" cycles they look for ways to improve those steps, but not for tools to short-circuit the process. Update This HN comment is a great example of the attitude that if you're not using a debugger, you're not debugging.

Sunday, 24 June 2018

Yosemite: Clouds Rest And Half Dome

On Saturday morning, immediately after the Mozilla All Hands, I went with some friends to Yosemite for an outstanding five-night, five-day hiking-and-camping trip! We hiked from the Cathedral Lakes trailhead all the way down to Yosemite Valley, ascending Clouds Rest and Half Dome along the way. The itinerary:

  • Saturday night: camped at Tuolumne Meadows
  • Sunday: hiked from Cathedral Lakes trailhead past the Cathedral Lakes to Sunrise High Sierra Camp
  • Monday: hiked from Sunrise HSC past the Sunrise Lakes to camp just north of Clouds Rest
  • Tuesday: hiked up and over Clouds Rest and camped just north of the trail leading up to Half Dome
  • Wednesday: left most of our gear in camp, climbed Half Dome, returned to camp, and hiked down to camp in Little Yosemite Valley
  • Thursday: hiked out to Yosemite Valley

Apart from the first day, each day was relatively short in terms of distance, but the first few days were quite strenuous regardless because of the altitude. I've never spent much time above 2500m and I was definitely unusually short of breath. The highest points on the trail were around 3000m, where the air pressure was down to 700 millibars.

The weather was (predictably) good: cold at night the first couple of nights, warmer later, but always warm and sunny during the day.

We saw lots of animals — deer, marmots, chipmunks, woodpeckers, other birds, lizards, and other animals you don't see in New Zealand. Also lots of interesting trees, flowers and other plants.

The mosquitoes at Sunrise HSC were terrible in the morning! My friend said it was the worst he'd ever seen, even having grown up in South Florida.

I've never camped for so many consecutive nights before — in New Zealand we usually stay in huts. I got to use my "squeeze bag" mechanical water filter a lot; it works very well and doesn't have the latency of the chemical purifiers.

Swimming in the Merced River at Little Yosemite Valley after a hot day felt very good!

I thought my fear of heights would kick in climbing the cables to get to the top of Half Dome, but it didn't at all. The real challenge was upper body strength, using my arms to pull myself up the cables — my strength is all in my legs.

Needless to say, Clouds Rest and Half Dome had amazing views and they deserve their iconic status. I'm very thankful to have had the chance to visit them.

My companions on the trip were also great, a mix of old friends and new. Thank you.

Monday, 11 June 2018

Bay Area Visit

I'm on my way to San Francisco for a guest visit to the Mozilla All Hands ... thanks, Mozilla!

After that, I'll be taking a break for a few days and going hiking with some friends. Then I'll spending a week or so visiting more people to talk about rr and Pernosco. Looking forward to all of it!

Sunday, 10 June 2018

Crypto-Christians In Tech

This is not about cryptocurrencies; for that, watch this. Nor is it about cryptography. It's about the hidden Christians working in tech.

I sometimes get notes from Christian tech people thanking me for being open about my Christian commitment, because they feel that few of their colleagues are. That matches my experience, but it's a combination of factors: most tech people aren't Christians, but more are than you think — they're just not talking about it. Both of these are sad, but I expect the former. The latter is more problematic. I would encourage my brothers and sisters in tech to shine brighter. Here are some concerns I've had — or heard — over the years:

What can I do without being a jerk?
When asked what I did during the weekend, I say I worshiped the Creator. Sometimes I just say I went to church.

Sometimes I write blog posts about Christ. People don't have to read them if they don't want to.

I used to put Christian quotes in my email signature, but I got bashed over that and decided it wasn't worth fighting over. Now my email signatures are obscured. Those who seek, find. I should try emoji.

Sometimes I'm probably a jerk. Sorry!

Won't my career suffer?
It may. People I've worked with (but not closely) have told me they look down on me because I'm a Christian. Surely more have thought so, but not said so. But Jesus is super-clear that we need to take this on the chin and respond with love.

I don't want to be associated with THOSE OTHER Christians.
I know, right? This is a tough one because the easy path is to disavow Christians who embarrass us, but I think that is often a mistake. I could write a whole post about this, but Christians need unity and that sometimes means gritting our teeth and acknowledging our relationship with people who are right about Christ and wrong about everything else.

Another side of this is that if your colleagues only know of THOSE OTHER Christians (or perhaps just those who are particularly thick-skinned or combative), they need you to show them an alternative.

Woah, persecution!
No. Claiming I've ever experienced persecution would embarrass me among my brothers and sisters who really have.

People are generally very good about it, especially in person. People who are jerks about it generally turn out to be jerks to everyone. In the long run it will reduce the number of awkward conversations people have around you about how awful those Christians are, not knowing where you stand. But this is not about our comfort anyway.

What if I screw up and give people a bad impression?
Bad news: you will. Good news: if you were perfect, you might give people the false impression that Christianity is about being a good person (or worse, trying to make other people "good"). But of course it isn't: it's about us recognizing our sin, seeking reconciliation with people and God, and obtaining forgiveness through Christ; not just once, but every day. How can we demonstrate that if we never fail?

Saturday, 26 May 2018

rr 5.2.0 Released

I released rr 5.2.0. This is a minor update that fixes some bugs, improves chaos mode a bit, and adds some features to help with trace portability. If rr's working for you, you probably don't need to upgrade.

Friday, 25 May 2018

Intel CPU Bug Affecting rr Watchpoints

I investigated an rr bug report and discovered an annoying Intel CPU bug that affects rr replay using data watchpoints. It doesn't seem to be hit very often in practice, which is good because I don't know any way to work around it. It turns out that the bug is probably covered by an existing Intel erratum for Skylake and Kaby Lake (and probably later generations, but I'm not sure), which I even blogged about previously! However, the erratum does not mention watchpoints and the bug I've found definitely depends on data watchpoints being set.

I was able to write a stand-alone testcase to characterize the bug. The issue seems to be that if a rep stos (and probably rep movs) instruction writes between 1 and 64 bytes (inclusive), and you have a read or write watchpoint in the range [64, 128) bytes from the start of the writes (i.e., not triggered by the instruction), then one spurious retired conditional branch is (usually) counted. The alignment of the writes does not matter, and it's not related to speculative execution.

If you find rr failing during replay with watchpoints set, and the failures go away if you remove the watchpoints, it could well be this bug. Broadwell and earlier don't seem to have the bug.

A possible workaround would be to disable "fast-string optimization" in the kernel at boot time. I don't think there's any way for users to force this to happen in Linux, currently, but someone could write a kernel patch adding a command-line option for that and send it upstream. It would be great if they did!

Fortunately this sort of bug does not affect Pernosco.

Update: Pernosco

In the two years since I left Mozilla I've been working on a new debugger (with Kyle Huey, most of that time). We call it Pernosco — "know thoroughly". Pernosco takes rr recordings of failing runs, analyzes them in the cloud, and provides a Web interface for developers to debug them. It uses components of rr but the debugger implementation is completely different. Pernosco's data-oriented approach has features and performance that rr's approach cannot match — and nor can any other debugger. We've reached the point where external users have used Pernosco to fix real bugs, and their response has been very enthusiastic. It's exciting!

Developers spend a lot of time figuring out bugs. We think Pernosco will create a lot of value for software development organizations. We intend to capture some of that value by offering Pernosco as a paid cloud service. We have many ideas for making debugging easier and more fun, and we need a sustainable business model to make them happen. It's especially important because debugging is an area that has been chronically under-invested in across the industry.