Saturday, 30 December 2006

More About Amber

I have uploaded a document describing Amber in some detail. It focuses on the back end: the instrumentation, indexing, and compression. I also motivate Amber by discussing the debugging problem and related attempts to solve it. There are a few important ideas here:

  • A conceptual framework for classifying debuggers and execution recorders.
  • Treating instruction execution and memory writes as two instances of a common concept --- "memory effects" --- and providing a single implementation and API for recording, indexing, compressing, storing and querying them.
  • The "bunching" optimization for memory effects.
  • The principle that repetitious program behaviour should produce repetitious input to the compressor in order to maximise the benefits of compression, and how to tweak the formatting of trace data to follow that principle.

The document also contains screenshots of the prototype XULRunner-based debugger based on Amber, to motivate Amber's design by illustrating what is possible when you go beyond just reverse execution. I assure you that those screenshots are not faked in any way; it really works :-).



Thursday, 28 December 2006

A New Approach To Debugging

One of the painful truths about Linux development is that debugging sucks. Specifically, gdb on Linux sucks. Basic functionality simply does not work reliably. The details of the brokenness vary depending on the version, but the general suckiness seems to hold up across versions. Since it fails on the basics, we need not even discuss the lack of modern features or its appalling usability. This is a big problem for Linux because it is driving developers away from the platform.

Here's a deeper and less widely understood truth: all debuggers suck. Debugging is mainly about tracing bug symptoms back to buggy code, i.e. tracing effects to causes. Effects follow their causes in time, so debugging by nature requires backwards traversal of time. But traditional debuggers don't provide this because it's not obvious how to implement it. Instead they let the programmer monitor a program's forward execution, in the hope that the programmer will be able to somehow remember or reconstruct enough history, usually by running the program several times, to figure out for themselves the backwards cause and effect chain. Such tools aren't particularly suited to the debugging task; they're just a compromise forced by either limited imagination or implementation constraints.

Lots of other people have made this observation and have tried to extend traditional debuggers with some support for looking backwards in time. For example:


  • SoftICE supports a "back trace buffer" that records a limited number of instruction executions in a certain instruction address range. Only the program counter values are recorded.
  • Green Hills' "Time Machine" debugger also has a trace buffer, which apparently can include register and memory deltas so that old memory and register values can be restored by "backwards stepping" through the buffer. Unfortunately it seems this buffer is limited to memory, and therefore for a large application only a small history window can be retained. Furthermore Green Hills' implementation requires specialized hardware if you want to trace more than program counter values.
  • TTVM logs the interactions between a guest virtual machine and its host so that later the exact execution of the guest VM can be replayed. Periodic checkpoints are recorded so that the replay cost required to reach any desired point in time is limited.

In Time Machine and TTVM, the UI is a traditional debugger UI extended with "reverse" versions of the standard forward execution commands. The debugger still displays the program state at a certain point in time; there is now some capability to move the current point backwards as well as forwards.

This is good progress --- but those tools still need to overcome implementation limitations to become immediately useful for mainstream desktop application debugging (i.e. useful to me!). More importantly, I want more than just reverse execution. For example, when debugging we frequently want to know exactly when a variable was set to its current (incorrect) value, and see what the program was doing at that time. Reverse-execution debuggers can set a watchpoint on the variable and execute backwards to find that write, but it would be quicker and more powerful if the debugger just knew the history of the variable and could perform an efficient database lookup to find the last time the variable was written before the current time. The debugger should have the entire history of the program available for efficient inspection and querying. With such a debugger, after we've gathered the history we don't need to run the program any more; the user just browses the history in the debugger until the bug is located.

The only other exposition of this vision I know of is Omnisicient Debugging. Bil Lewis has created a Java debugger that takes this approach. Unfortunately it is restricted to Java programs, and apparently rather small and short-running Java programs since it keeps the trace data in memory. So the big question is: can we implement this vision so that it's a practical approach for debugging real applications on commodity hardware?

I have spent a few of the last 18 months investigating this question. I have built a system, which I'm calling Amber, to record the complete execution history of arbitrary Linux processes. The history is recorded using binary instrumentation based on Valgrind. The history is indexed to support efficient queries that debuggers need, and then compressed and written to disk in a format optimized for later query and retrieval. The history supports efficient reconstruction of the contents of any memory location or register at any point in time. It also supports efficient answers to "when was the last write to location X before time T", "when was location P executed between times T1 and T2", and other kinds of queries. I can record the 4.1 billion instructions of a Firefox debug build starting up, displaying a Web page, and exiting; the compressed, indexed trace is about 0.83 bytes per instruction executed. (With some cleverness, we achieve compression ratios of over 20 to 1.) It takes less than half an hour to record on my laptop. 300X slowdown may sound like a lot, but spending computer time to save programmer time is almost always a good idea. Furthermore Amber tracing easily parallelizes across multiple cores so hardware trends are in my favour.

I've built a trace query engine in C and a prototype debugger front end using XULRunner to demonstrate the power of this approach. It works as expected. I've found that basing the debugger entirely on a "complete recording" approach dramatically simplifies some of the difficult parts of debugger implementation. Novell has filed for a patent on my work (and is planning to file another regarding an extension which I haven't described here yet), but I hope to get Novell to release the source code under an open source license so I can keep working on it. Maybe others will be interested in working on it too.

When I get a chance I'll write more about how Amber works, its full set of features, and what needs to be done to get it to prime-time.



Sunday, 24 December 2006

Pohutukawa

Someone on IRC asked about the pohutukawas flowering. The pohutukawa is a native tree that produces beautiful red flowers in December --- very apt for Christmas. The Internet has thousands of photos of flowering pohutukawas; here's mine.

Update I forgot to mention --- this photo was taken in Shakespear Regional Park, yesterday (i.e. Saturday). We drove up with some friends, had a picnic, and did the Tiri Tiri Track loop in about two hours. It was magnificent.


Pohutukawa


Saturday, 23 December 2006

Parallelism

Here's an interesting interview with Hennessey and Patterson, the famous hardware architecture researchers. Nothing particularly surprising or new, but they remind me of the importance of the parallelism revolution that is happening right now. Namely, processors aren't getting much faster anymore; instead, we're all getting multicore chips instead. Software that doesn't take advantage of this parallelism will be slower than software that does.

This is a huge huge problem because after fifty years of research we still don't have good ways to write parallel software. Threads and locks are very difficult to use correctly, don't work when you combine software components, and don't even scale well in performance. Language-level transactions are one possible answer but we don't have language and runtime support for them in mainstream systems. (At IBM I worked on the design of X10, a programming language designed for highly concurrent and distributed programming that used transactions as one of its building blocks.)

Even if we get language-level transactions, we still have several more problems to deal with. Reworking large legacy codebases to use them will be tons of work, and often impractical. Then there's the problem that many tasks are intrinsically very difficult or impossible to parallelize. In particular many higher-level programming models (such as oh say Javascript on the Web) assume single-threadedness.

For Mozilla, we need to be thinking about how we can exploit multiple cores in the browser to transparently accelerate existing Web content. We also need to be thinking about how we can add transaction support to Javascript to enable Web developers to exploit parallelism --- they don't ask for this today, but it will take a few years to do it right and by then, they'll probably be begging for it.

For open source in general, we need to be thinking about how the open source stack is going to evolve to incorporate more parallelism. If we can't get useful transaction support into C or C++ then two things will happen: first, most performance-sensitive projects will soon have to move away from them except for projects that can afford to invest in hairy threads-and-locks parallelism. Then, when chips with dozens of cores become popular, even those projects will have to migrate. I'm talking mostly about clients here, not servers; servers are easier to parallelize because one server usually serves many independent clients, so you can parallelize over the clients --- and of course people have been doing that for a long time.

I think there's a huge opportunity here for a client software mass extinction followed by Cambrian explosion. There's a new environmental pressure; projects that aren't agile enough to adapt quickly will be replaced by more agile projects, or by entirely new projects, perhaps ones that rewrite from scratch to take advantage of new languages. Exciting times.



Wednesday, 20 December 2006

Changes Afoot

I am leaving Novell at the end of the year. Starting from the beginning of January I will be working as a contractor for the Mozilla Corporation.

Novell has been very good to me, and I've very much enjoyed the last two years, but the lure of MoCo has proven too great. In particular Mozilla has offered to support additional developers here in Auckland. When I came back here, I had a vague, long-term goal of helping develop New Zealand's software industry by creating the sort of jobs that would have lured me back. I'm very excited to be realizing this goal much sooner than I expected, and I'm grateful to the corporation for making it possible. I'm also greatly looking forward to working in an office with other real live Gecko developers. I will be able to say more about this soon. Also, staff at Auckland University have shown interest in having students work on Firefox-related projects, and that's something else I'm very excited about. It just so happens I have the ideal qualifications to make that work. Perhaps God does have a plan :-).

A secondary issue is that over the last two years I've also felt I wasn't serving Novell's business needs very well. I spent most of my time working on core Gecko, some time working on Novell-specific needs, and some time working on other things. They never indicated they were dissatisfied with me, but I think they might be better off with someone more focused on Novell-specific issues. As it is, I'll continue to do 80% of what I was doing for them and they won't have to pay for any of it :-).

Just for the record, my move has nothing to do with Novell's deal with Microsoft or any of that stuff. My plans were well underway before I heard about it.

I've been busy in the last few weeks tidying up work at Novell and organizing my new position. Meanwhile it's a lovely time of year to be in Auckland; the pohutukawas are flowering and most people are winding down for the end of the year, assisted by an extraordinary number of Christmas "do's" (i.e., parties). Novell's Auckland staff had a wonderful lunch at Mudbrick bar and restaurant on Waiheke Island. I haven't really been able to wind down myself yet, but now that my Mozilla contract is signed, I can relax a bit ... but only for a couple of weeks. Yet it's good stress; the coming year looks very exciting indeed.



Saturday, 16 December 2006

The Ultimate Square Peg

I'm annoyed by stereotypes --- and naturally, particularly so when I'm in the group being stereotyped. So I was pleased to read about a new US reality show "One Punk Under God" about Christian minister Jay Bakker. That's cool.

I also dislike the way that Christians, especially (but not only) in the USA, have become identified with some key causes that all seem to revolve around making other people behave in certain ways. This has unfortunately led to people believing that that's what Christianity is all about --- when actually one of the most important truths is that just doing things is not the way to make things right between us and God.

So, I thought I'd really like this article by Jay Bakker, which leads with:

What the hell happened? Where did we go wrong? How was Christianity co-opted by a political party? Why are Christians supporting laws that force others to live by their standards?

In fact, there's little in there that I would specifically disagree with. But there's a problem with it as a whole: it implies that Jesus was about nothing but unconditional love. That is an error in another direction, and it's also a common one. His parables and lessons do talk a lot about love and forgiveness, but they also talk a lot about hell, judgement, and "I have come not to bring peace, but a sword". To paraphrase a common Christian saying, he does have a message of "come as you are, not as you should be" --- but he doesn't want you to stay that way.

The problem with Jesus is that he doesn't fit into boxes. He doesn't seem to countenance the imposition of a morality state, but he also won't be coerced into being a comfortable anything-goes Saviour. He never conforms to the image we demand of him. That's uncomfortable and disconcerting ... but it's one of the things I love about him.



Wednesday, 13 December 2006

Patent Sea Change

My brother alerted me to some very interesting news that I haven't seen widely reported yet.

Guess who filed the most US patent applications in November? IBM? Microsoft? Guess again. The answer is Samsung --- and by a mile.

It was IBM 347, Microsoft 360, Samsung 523. And this isn't a one-off spike.

I'm not sure what's happening here. 6K patents per year means tens of millions of dollars in filing fees alone, so it's obviously a huge investment with some kind of strategy behind it. I don't believe Samsung has suddenly built a network of research labs to rival IBM and Microsoft. It seems likely that they've just decided to become very aggressive about patenting absolutely every idea their people can think of. Over the years a lot of people have warned about what could happen if someone tried to exploit the weaknesses of the US patent system to the maximum (and the patent industry has pushed back with "don't worry, that's not happening"). It may well be happening now.

What next? Things could be very interesting. We may see US companies ramping up their patent-farming efforts to match. But I think we will also see good old-fashioned nationalism (perhaps with a racist flavour) coming to bear on the question of patent reform. Will Americans stand for the transfer of public ideas into private monopolies when it's foreigners locking things up? But if the US government starts to backpedal, and possibly even breach its WIPO commitments, then that gives cover to other countries to implement sensibly restrictive patent laws. That would be a good result.



Sunday, 10 December 2006

Congratulations David!

... on landing the reflow branch. This is a great day.

Now we finally pass acid2, we've improved performance, and fixed a lot of bugs. Yay!

Also, we can now make major reflow changes again without fear of making David's life miserable with difficult merges to the reflow branch.

Update Just to clarify --- the reflow branch landed after we branched for 1.9 alpha 1 (this was intentional!), so 1.9 alpha 1 does not have the reflow branch and does not pass acid2. You'll have to get a nightly build or wait for 1.9 alpha 2 to come out.



Wednesday, 6 December 2006

XUL Dark Matter

I had lunch with some of my old friends yesterday, as I often do (one of the great things about being back in Auckland!) and one of them told me about his latest consulting job, involving a project using XUL to build a very large intranet application.

This isn't the first time I've heard about such things, but this is definitely closer to home than before. If a random company in Auckland is working on a big XUL intranet app, then how many other such projects might there be? One might guess hundreds or even thousands. This is rather disconcerting for several reasons. These projects have no visibility in Bugzilla --- why aren't they filing bugs? We don't really know anything about these projects: how many there are, what features they use, how we're hurting or helping them. Gecko development is totally ignorant of their needs. But how can we find out since these intranet apps by design never become visible to the public?

Maybe it's a good thing in some ways to not be giving much attention to these apps. It means we can focus more on Firefox which is more in line with Mozilla's mission of driving the Web forward. And we certainly haven't promised much to XUL app developers. Still, it sucks to cause pain for people, and if there are thousands of XUL apps below the waterline and we could support them better in simple ways, I want to.

For one thing, our Q&A for XUL is basically "do Firefox and Thunderbird still work?" This sounds inadequate now that it seems likely that Firefox and Thunderbird are less than 1% of the XUL apps out there. If we had a lot of unit tests for XUL that would give us some more confidence that we're not breaking things ... or at least that if we are breaking things, we do so deliberately.

The other thing we should do is think about how to harness the power of this community that probably is quite large and influential ... even if we can barely detect it.



Linux

Mike Connor has blogged about the Linux policy discussion we had during the summit and the conclusions we reached. I don't have much to add. Because Mozilla release schedules and distribution release schedules often aren't aligned, distributions are often adding features and non-security fixes to their branch builds. It will be very helpful to be able to share those changes between distributions via Mozilla CVS.

We are going to have to deal with situations where one set of distributions wants only security fixes on a branch and other distributions want to ship features off that branch. I'm not quite sure how that's going to work yet.



NetworkManager, dbus and Offline Status

I've finally checked in support for offline status detection on Linux. We hook into dbus to listen for signals from NetworkManager saying that it has gone offline or gone online. I originally did this work about a year ago but it's taken ... a while ... to get reviewed. After I wrote the infrastructure the Windows support actually got written and landed while I was waiting for review on the Linux bits, so Windows has had this on trunk for a while and it even landed in FF2.

Anyway, now if your network link goes down, FF will automatically switch to offline mode and if the link comes back, FF will switch to online mode. If you manually change the "Work Offline" state then FF will stop switching automatically for the duration of your FF session.

Other Xulrunner apps can take advantage of this too. By default the IOService will watch for network link status changes and automatically toggle its online/offline state. But if an application wants to do things differently, it can tell the IOService to stop the automatic tracking. Then the application can listen for network link status changes via nsIObserver and modify IOService's online/offline status as desired.

We still want Mac support. Also, if you happen to be using a Linux distribution without dbus or without NetworkManager, the build will still work, you just won't get the automation.