Saturday 30 January 2010
Three Lessons From A Pernicious Bug
We were having a lot of problems on Tinderbox with xpcshell tests randomly failing on Windows by returning exit code 1. Curiously, it seemed that almost any test could fail this way, with low probability, and the output log for each test indicated that all the actual subtest assertions passed, the process just mysteriously returned an exit code of 1 for no apparent reason.
Naturally I charged into battle with my sword +5 against orange (i.e., VMWare's record-and-replay debugging). The tests failed randomly about one run in a thousand, so capturing a failure wasn't hard. Then I verified that the problematic xpcshell run was passing 0 as the exit code for the final NtTerminateProcess system call, but the outer python script was indeed getting 1 from GetExitCodeProcess. Very mysterious!
I was at a loss for a while until some questions from Benjamin Smedberg led me to observe that in the failing run, the Google Breakpad helper thread was still alive at process termination, but in passing runs, that thread had exited before process termination.
Benjamin then looked into the Breakpad shutdown code and noticed it using TerminateThread to shut down its helper thread. The MSDN documentation for TerminateThread warns that that function is an extremely dangerous API, because it abruptly terminates the thread, leaving any shared resources it was manipulating in a possibly inconsistent state. Nevertheless, the Breakpad shutdown code seems to use it in a safe way ... mostly. We observed that that code was passing exit code 1 to TerminateThread. Changing that exit code to 44 made the exit code returned in our randomly failing tests change to 44. Clearly, somehow the exit code set for the Breakpad helper thread by TerminateThread was becoming the exit code for the process!
We think that, although it's not mentioned in MSDN, TerminateThread is actually asynchronous. The thread is not terminated immediately. If you're (un)lucky, the main thread can trigger process exit (passing exit code 0) before the helper thread's termination is complete. Then, somewhere in the netherworld of NT kernel process finalization, after the main thread has been terminated, termination of the helper thread finally completes and as the last thread to terminate, its exit code is recorded as the process exit code!
Moral of the story: never use TerminateThread. If you must use TerminateThread, you should probably try to wait for termination to finish (e.g. by calling WaitForSingleObject) before exiting the process, if you care about your process exit code. Another moral is that record-and-replay debugging rocks (but we already knew that). Yet another moral is that debugging closed-source operating systems sucks (but we already knew that too).
Comments
(This comment is directed entirely toward Luke. :-D )
Maybe the exit code for this call to TerminateProcess is the one of the thread who orphaned the critical section?
I mostly talking out of my butt, I haven't worked through the code you pointed to, I just saw that the target thread made used of critical sections.
Jon: Microsoft's documentation was incomplete in two important ways. It doesn't mention that TerminateThread is asynchronous. And I haven't found anything that indicates the exit code of another thread may be used as the process exit code even after the main thread has passed its own exit code to ExitProcess. (For example, http://msdn.microsoft.com/en-us/library/ms686722%28VS.85%29.aspx doesn't mention it.)
Microsoft's documentation is actually better than the documentation for most free software projects. However, since software has bugs, and no documentation is complete, there's no substitute for having the source code available so you can actually see what is going on.
And thanks for fixing one more random orange - everyone watching tinderbox or tbpl should love you for that!