Saturday 30 January 2010
We were having a lot of problems on Tinderbox with xpcshell tests randomly failing on Windows by returning exit code 1. Curiously, it seemed that almost any test could fail this way, with low probability, and the output log for each test indicated that all the actual subtest assertions passed, the process just mysteriously returned an exit code of 1 for no apparent reason.
Naturally I charged into battle with my sword +5 against orange (i.e., VMWare's record-and-replay debugging). The tests failed randomly about one run in a thousand, so capturing a failure wasn't hard. Then I verified that the problematic xpcshell run was passing 0 as the exit code for the final NtTerminateProcess system call, but the outer python script was indeed getting 1 from GetExitCodeProcess. Very mysterious!
I was at a loss for a while until some questions from Benjamin Smedberg led me to observe that in the failing run, the Google Breakpad helper thread was still alive at process termination, but in passing runs, that thread had exited before process termination.
Benjamin then looked into the Breakpad shutdown code and noticed it using TerminateThread to shut down its helper thread. The MSDN documentation for TerminateThread warns that that function is an extremely dangerous API, because it abruptly terminates the thread, leaving any shared resources it was manipulating in a possibly inconsistent state. Nevertheless, the Breakpad shutdown code seems to use it in a safe way ... mostly. We observed that that code was passing exit code 1 to TerminateThread. Changing that exit code to 44 made the exit code returned in our randomly failing tests change to 44. Clearly, somehow the exit code set for the Breakpad helper thread by TerminateThread was becoming the exit code for the process!
We think that, although it's not mentioned in MSDN, TerminateThread is actually asynchronous. The thread is not terminated immediately. If you're (un)lucky, the main thread can trigger process exit (passing exit code 0) before the helper thread's termination is complete. Then, somewhere in the netherworld of NT kernel process finalization, after the main thread has been terminated, termination of the helper thread finally completes and as the last thread to terminate, its exit code is recorded as the process exit code!
Moral of the story: never use TerminateThread. If you must use TerminateThread, you should probably try to wait for termination to finish (e.g. by calling WaitForSingleObject) before exiting the process, if you care about your process exit code. Another moral is that record-and-replay debugging rocks (but we already knew that). Yet another moral is that debugging closed-source operating systems sucks (but we already knew that too).