Sunday, 12 September 2021

Emulating AMD Approximate Arithmetic Instructions On Intel

Pernosco accepts uploaded rr recordings from customers and replays them with binary instrumentation to build a database of all program execution, to power an amazing debugging experience. Our infrastructure is Intel-based AWS instances. Some customers upload recordings made on AMD (Zen) machines; for these recordings to replay correctly on Intel machines, instruction execution needs to produce bit-identical results. This is almost always true, but I recently discovered that the approximate arithmetic instructions RSQRTSS, RCPSS and friends do not produce identical results on Zen vs Intel. Fortunately, since Pernosco replays with binary instrumentation, we can insert code to emulate the AMD behavior of these instructions. I just needed to figure out a good way to implement that emulation.

Reverse engineering AMD's exact algorithm and reimplementing it with Intel's instructions seemed like it would be a lot of work and tricky to reimplement correctly. Instead, we take advantage of the fact that RSQRT/RCP are unary operations on single-precision float values. This means there are only 232 possible inputs, so a lookup table of all results is not out of the question: in the worst case it would only be 16GB. Of course we would prefer something smaller, so I computed the full table of Intel and AMD results and looked for patterns we can exploit.

Since the Intel and AMD values should always be pretty close together, I computed the XOR of the Intel and AMD values. Storing just this table lets us convert from AMD to Intel and vice versa. It turns out that for RSQRT there are only 22 distinct difference values, and for RCP only 17. This means we can store just one byte per table entry, an index into a secondary lookup table that gives the actual difference value. Another key observation is that the difference value depends only on the upper 21 bits of the input. (I suspect RSQRT/RCP completely ignore the bottom 11 bits of the input mantissa, but I haven't verified that.) Thus the main table can be stored in just 221 bytes, i.e. 2MB, and of course we need one table for RSQRT and one for RCP, so 4MB total, which is acceptable. With deeper analysis we might find more patterns we can use to compress the table further, but this is already good enough for our purposes.

That part was pretty easy. It turned out that most of the work was actually implementing the instrumentation. The problem is that each instruction comes in five distinct flavours. For RSQRT, there is:

  • RSQRTSS: compute RSQRT of the bottom 32 bits of the input operand, store the result in the bottom 32 bits of the output register, leaving all other output register bits unchanged
  • RSQRTPS: compute RSQRT of the bottom four 32-bit lanes of the input operand, store the results in the bottom four 32-bit lanes of the output register, leaving all other output register bits unchanged
  • VRSQRTSS (two input operands): compute RSQRT of the bottom 32 bits of the second input operand, store the result in the bottom 32 bits of the output register, copy bits 32-127 from the first input register to the output register, zero all bits >= 128 of the output register (seriously, Intel?)
  • VRSQRTPS, 128-bit version: compute RSQRT of the bottom four 32-bit lanes of the input operand, store the results in the bottom four 32-bit lanes of the output register, zero all bits >= 128 of the output register
  • VRSQRTPS, 256-bit version: compute RSQRT of the eight 32-bit lanes of the input operand, store the results in the eight 32-bit lanes of the output register
In each of these instructions the primary input operand can be a memory load operand instead of a register.

So our generated instrumentation has to perform one table lookup per lane and also handle the correct effects on other bits of the output register. If we really cared about performance we'd probably want to vectorize the table lookups, but that's hard and the performance impact is unlikely to matter in our case, so I kept it simple with serial logic using general purpose registers.

Anyway it's working well now and Pernosco is able to process AMD submissions using these instructions, so go ahead and send us your recordings to debug! (The logic also handles emulating Intel semantics if you happen to be running Pernosco on-prem on Zen hardware.) Tracing replay divergence back to RSQRTSS (through many long def-use chains) was extremely painful so I wrote a fairly good automated test suite for this work; I want to never again have to debug divergence caused by this.

Thursday, 9 September 2021

rr Trace Portability: Diverging Behavior of RSQRTSS in AMD vs Intel

When we added Zen support to rr, it was an open question whether it would be possible to reliably replay Zen recordings on Intel CPUs or vice versa. It wasn't clear whether CPU instructions normally used by applications had bit-identical semantics across vendors. Over time the news was good: replaying Zen recordings on Intel generally works — if you trap and emulate CPUID to return the Zen results, and work around a difference in x87 FIP handling. So Pernosco has been able to handle submissions from Zen users.

Unfortunately, today I discovered a new difference between AMD and Intel: the RSQRTSS instruction. Perhaps this is unsurprising, since it is described as: "computes an approximate reciprocal of the square root of the low single-precision floating-point value in the source operand" (emphasis mine). A simple test program:

#include <stdio.h>
#include <string.h>
int main(void) {
  float in = 256;
  float out;
  unsigned int raw;
  asm ("rsqrtss %1,%0" : "=x"(out) : "x"(in));
  memcpy(&raw, &out, 4);
  printf("out = %x, float = %f\n", raw, out);
  return 0;
}
On Intel Skylake I get
out = 3d7ff000, float = 0.062485
On AMD Rome I get
out = 3d7ff800, float = 0.062492
Intel's result just stays within the documented 1.5 x 2-12 relative error bound. (Seems unfortunate given that the exact reciprocal square root of 256 is so easily computed to 0.0625, but whatever...)

The net effect of this is that rr recordings captured on Zen that use RSQRTSS may not replay correctly on Intel machines. The instructions will execute fine but it's possible that the slight differences in results may later lead to diverging control flow which break the rr recording. We have seen this in practice with a Pernosco user.

I have some ideas about how to fix this for Pernosco. If they work that'll be fodder for another post.

Update For what it's worth, the same issue exists with RCPSS (and presumably the SIMD versions (V)RCPPS and (V)RSQRTPS). Intel also has a number of new approximate-arithmetic instructions in AVX512, but has published software reference implementations of those, so hopefully if AMD does ever implement them Zen will match those. I'm not (yet) aware of any other non-AVX512 "approximate" instructions.