Sunday, 31 January 2010

Three Lessons From A Pernicious Bug

We were having a lot of problems on Tinderbox with xpcshell tests randomly failing on Windows by returning exit code 1. Curiously, it seemed that almost any test could fail this way, with low probability, and the output log for each test indicated that all the actual subtest assertions passed, the process just mysteriously returned an exit code of 1 for no apparent reason.

Naturally I charged into battle with my sword +5 against orange (i.e., VMWare's record-and-replay debugging). The tests failed randomly about one run in a thousand, so capturing a failure wasn't hard. Then I verified that the problematic xpcshell run was passing 0 as the exit code for the final NtTerminateProcess system call, but the outer python script was indeed getting 1 from GetExitCodeProcess. Very mysterious!

I was at a loss for a while until some questions from Benjamin Smedberg led me to observe that in the failing run, the Google Breakpad helper thread was still alive at process termination, but in passing runs, that thread had exited before process termination.

Benjamin then looked into the Breakpad shutdown code and noticed it using TerminateThread to shut down its helper thread. The MSDN documentation for TerminateThread warns that that function is an extremely dangerous API, because it abruptly terminates the thread, leaving any shared resources it was manipulating in a possibly inconsistent state. Nevertheless, the Breakpad shutdown code seems to use it in a safe way ... mostly. We observed that that code was passing exit code 1 to TerminateThread. Changing that exit code to 44 made the exit code returned in our randomly failing tests change to 44. Clearly, somehow the exit code set for the Breakpad helper thread by TerminateThread was becoming the exit code for the process!

We think that, although it's not mentioned in MSDN, TerminateThread is actually asynchronous. The thread is not terminated immediately. If you're (un)lucky, the main thread can trigger process exit (passing exit code 0) before the helper thread's termination is complete. Then, somewhere in the netherworld of NT kernel process finalization, after the main thread has been terminated, termination of the helper thread finally completes and as the last thread to terminate, its exit code is recorded as the process exit code!

Moral of the story: never use TerminateThread. If you must use TerminateThread, you should probably try to wait for termination to finish (e.g. by calling WaitForSingleObject) before exiting the process, if you care about your process exit code. Another moral is that record-and-replay debugging rocks (but we already knew that). Yet another moral is that debugging closed-source operating systems sucks (but we already knew that too).



8 comments:

  1. I really enjoy these "tales of how I found hard-to-find bugs"!

    ReplyDelete
  2. "You've read the stories; you've read the bugmail. The agony, and the ecstasy, of debugging code written in unsafe languages, running across multiple racing processors. Now: see them in real life at a desk next to yours!" (In a couple weeks whenever the desktop machine arrives so I can set it up and attack some random-orange bugs with it, that is.)
    (This comment is directed entirely toward Luke. :-D )

    ReplyDelete
  3. On Vista, if someones tries to enter a critical section owned by a thread that was terminated with TerminateThread, TerminateProcess is called immediately.
    Maybe the exit code for this call to TerminateProcess is the one of the thread who orphaned the critical section?
    I mostly talking out of my butt, I haven't worked through the code you pointed to, I just saw that the target thread made used of critical sections.

    ReplyDelete
  4. "Yet another moral is that debugging closed-source operating systems sucks (but we already knew that too)." Yes, it does. But, at least in this case Microsoft's documentation was available and accurate. If the person who wrote the Breakpad shutdown code had also looked there, maybe this problem would not have occurred. In this case, I wonder if things would have turned out that much better even if the OS source code were available.

    ReplyDelete
  5. Robert O'Callahan31 January 2010 22:01

    alex: That's not the problem. Quite early in the analysis I determined that NtTerminateProcess was only called when the main thread exits after 'main' has returned.
    Jon: Microsoft's documentation was incomplete in two important ways. It doesn't mention that TerminateThread is asynchronous. And I haven't found anything that indicates the exit code of another thread may be used as the process exit code even after the main thread has passed its own exit code to ExitProcess. (For example, http://msdn.microsoft.com/en-us/library/ms686722%28VS.85%29.aspx doesn't mention it.)
    Microsoft's documentation is actually better than the documentation for most free software projects. However, since software has bugs, and no documentation is complete, there's no substitute for having the source code available so you can actually see what is going on.

    ReplyDelete
  6. Now the obvious question for a Mozillian: Where's the bug report in that where I can CC myself to see when this gets patched and landed in some way?
    And thanks for fixing one more random orange - everyone watching tinderbox or tbpl should love you for that!

    ReplyDelete
  7. https://bugzilla.mozilla.org/show_bug.cgi?id=535585

    ReplyDelete
  8. I too enjoy reading these. So which fix did you go for? Bug number?

    ReplyDelete

Note: only a member of this blog may post a comment.