Eyes Above The Waves

Robert O'Callahan. Christian. Repatriate Kiwi. Hacker.

Saturday 28 April 2018

CPUID Features, XSAVE, And rr Trace Portability

Last year we made rr use "CPUID faulting" to record and replay the results of CPUID instructions. This has enabled a reasonable level of rr trace portability in practice. However, we've run into some interesting limitations, some of which we've addressed and others that I don't know how to address.

Sometimes the recording CPUs support features that are missing on the CPUs you want to eventually replay on. If your application detects and uses those features, replay won't work. Fortunately, because CPUID faulting lets us control the results of application CPUID invocations, it was easy to extend rr with the ability to mask off CPUID feature bits during recording. I added recording options --disable-cpuid-features, --disable-cpuid-features-ext, and --disable-cpuid-features-xsave to do this. To make it easier to determine which bits to mask off, I also added a new command rr cpufeatures which you can run on a replay machine to print the command line options you should use for recording, to disable features unsupported on the replay machine. This seems to work reasonably well.

Unfortunately there's a portability hazard related to the XSAVE instruction. The size of the memory range written by XSAVE depends on the XSAVE features supported by the CPU and enabled by the OS via the XCR0 register. This can be different on the recording and replay machines and there's nothing rr can do about it! User-space code can (and should!) use CPUID queries to determine what that size is, but overriding what we report doesn't change what the CPU will actually write. Even though applications shouldn't really depend on the what's written by XSAVE, as long as XSAVE/XRSTOR pairs actually restore state as expected, in practice the size of XSAVE writes affects uninitialized values on the stack which leak into registers which cause rr to detect divergence and abort.

A possible work-around is to mask off XSAVE feature bit in CPUID results, so well-behaved user-space programs don't use XSAVE at all. Unfortunately that also effectively disables AVX and other XSAVE-dependent features :-(.

A possible fix is to add a Linux kernel feature to let us set the XCR0 value for specific processes, i.e. disable XSAVE state saving for some XSAVE components. Then during recording we could limit the XSAVE components to those that are supported by the replay machine. Unfortunately, from experience we know that adding new a kernel API upstream is quite difficult, especially those that require adding code to context switches.

A simpler kernel patch would be to provide a boot command-line option to mask bits out of XCR0 for all processes.

Another possible work-around would be to record in a virtual machine whose XCR0 is specifically configured to match the replay machine's. I haven't tried it but I assume it's possible; VM migration would require this.

For now I plan to just have rr print an error during replay when XSAVE is enabled and the replay machine's XCR0 features are not equal to the recording machine's.

Update I realized that the XSAVEC instruction ("XSAVE compressed") avoids the above issues. State components that are not in use do not affect what is written by XSAVEC. If a state component is actually in use during recording but not supported during replay, replay is already impossible. Therefore applications that stick to XSAVEC and avoid XSAVE will not incur the above problems. The only user-level use of XSAVE I currently know of is in the glibc dynamic loader, which prefers XSAVEC if available; therefore recording (and replaying) on machines supporting XSAVEC should generally work fine regardless of what XSAVE features are enabled via XCR0. The situation is not as bad as I feared.