Monday, 4 October 2021

How WHO Failed

I see that WHO is in contention for the Nobel Peace Prize. This is absurd. WHO got almost everything wrong early in the COVID19 pandemic and probably made the pandemic much worse. Here's a list:

  • As late as April 2020 WHO was advising countries against closing borders. (NZ eliminated COVID19 after closing borders in March against WHO advice. Later, WHO had the gall to pretend NZ eliminated COVID19 by following WHO advice.)
  • As late as June 2020 WHO was advising that asymptomatic spread of COVID19 was "rare". We now know that asymptomatic spread of COVID19 was and is a major factor in transmission.
  • Until June 2020 WHO was advising people to not wear masks unless they were sick with COVID19 or caring for someone sick, because it would be either ineffective or harmful. We now know that general mask-wearing is helpful at preventing transmission.
  • Until May 2021 WHO was advising that COVID19 was spread mainly by droplets. Now we know that it is spread mainly via aerosols.
  • As late as July 2020 WHO was advising that fomite transmission was a "likely mode of transmission" for COVID19. Fomite transmission has never been demonstrated as far as I know.
  • WHO delayed declaring COVID19 a pandemic until 11 March 2020, long after it was obviously a pandemic.

It would be unreasonable to expect WHO to get everything right given the unknowns of a new pandemic. However, we should expect WHO to get more right than wrong, and the above list shows they were actually worse than useless. These failures demand serious investigation and reform, not a Nobel Prize. If that investigation and reform doesn't happen, in the next pandemic, countries will be best off ignoring WHO advice.

Sadly, I see little sign of such criticism and reform happening. Instead, as this Nobel talk illustrates, mainstream opinion backs WHO's COVID19 response and is almost completely silent on WHO's appalling COVID19 track record. I'm not sure why this has happened, but I suspect it's another casualty of American partisan politics: "Trump attacked WHO, therefore reasonable people have to uncritically support WHO". It's maddening.

(Note for those who don't know me: I am an enthusiastic supporter of mainstream science and institutions, in general. WHO bungled this one.)

Sunday, 12 September 2021

Emulating AMD Approximate Arithmetic Instructions On Intel

Pernosco accepts uploaded rr recordings from customers and replays them with binary instrumentation to build a database of all program execution, to power an amazing debugging experience. Our infrastructure is Intel-based AWS instances. Some customers upload recordings made on AMD (Zen) machines; for these recordings to replay correctly on Intel machines, instruction execution needs to produce bit-identical results. This is almost always true, but I recently discovered that the approximate arithmetic instructions RSQRTSS, RCPSS and friends do not produce identical results on Zen vs Intel. Fortunately, since Pernosco replays with binary instrumentation, we can insert code to emulate the AMD behavior of these instructions. I just needed to figure out a good way to implement that emulation.

Reverse engineering AMD's exact algorithm and reimplementing it with Intel's instructions seemed like it would be a lot of work and tricky to reimplement correctly. Instead, we take advantage of the fact that RSQRT/RCP are unary operations on single-precision float values. This means there are only 232 possible inputs, so a lookup table of all results is not out of the question: in the worst case it would only be 16GB. Of course we would prefer something smaller, so I computed the full table of Intel and AMD results and looked for patterns we can exploit.

Since the Intel and AMD values should always be pretty close together, I computed the XOR of the Intel and AMD values. Storing just this table lets us convert from AMD to Intel and vice versa. It turns out that for RSQRT there are only 22 distinct difference values, and for RCP only 17. This means we can store just one byte per table entry, an index into a secondary lookup table that gives the actual difference value. Another key observation is that the difference value depends only on the upper 21 bits of the input. (I suspect RSQRT/RCP completely ignore the bottom 11 bits of the input mantissa, but I haven't verified that.) Thus the main table can be stored in just 221 bytes, i.e. 2MB, and of course we need one table for RSQRT and one for RCP, so 4MB total, which is acceptable. With deeper analysis we might find more patterns we can use to compress the table further, but this is already good enough for our purposes.

That part was pretty easy. It turned out that most of the work was actually implementing the instrumentation. The problem is that each instruction comes in five distinct flavours. For RSQRT, there is:

  • RSQRTSS: compute RSQRT of the bottom 32 bits of the input operand, store the result in the bottom 32 bits of the output register, leaving all other output register bits unchanged
  • RSQRTPS: compute RSQRT of the bottom four 32-bit lanes of the input operand, store the results in the bottom four 32-bit lanes of the output register, leaving all other output register bits unchanged
  • VRSQRTSS (two input operands): compute RSQRT of the bottom 32 bits of the second input operand, store the result in the bottom 32 bits of the output register, copy bits 32-127 from the first input register to the output register, zero all bits >= 128 of the output register (seriously, Intel?)
  • VRSQRTPS, 128-bit version: compute RSQRT of the bottom four 32-bit lanes of the input operand, store the results in the bottom four 32-bit lanes of the output register, zero all bits >= 128 of the output register
  • VRSQRTPS, 256-bit version: compute RSQRT of the eight 32-bit lanes of the input operand, store the results in the eight 32-bit lanes of the output register
In each of these instructions the primary input operand can be a memory load operand instead of a register.

So our generated instrumentation has to perform one table lookup per lane and also handle the correct effects on other bits of the output register. If we really cared about performance we'd probably want to vectorize the table lookups, but that's hard and the performance impact is unlikely to matter in our case, so I kept it simple with serial logic using general purpose registers.

Anyway it's working well now and Pernosco is able to process AMD submissions using these instructions, so go ahead and send us your recordings to debug! (The logic also handles emulating Intel semantics if you happen to be running Pernosco on-prem on Zen hardware.) Tracing replay divergence back to RSQRTSS (through many long def-use chains) was extremely painful so I wrote a fairly good automated test suite for this work; I want to never again have to debug divergence caused by this.

Thursday, 9 September 2021

rr Trace Portability: Diverging Behavior of RSQRTSS in AMD vs Intel

When we added Zen support to rr, it was an open question whether it would be possible to reliably replay Zen recordings on Intel CPUs or vice versa. It wasn't clear whether CPU instructions normally used by applications had bit-identical semantics across vendors. Over time the news was good: replaying Zen recordings on Intel generally works — if you trap and emulate CPUID to return the Zen results, and work around a difference in x87 FIP handling. So Pernosco has been able to handle submissions from Zen users.

Unfortunately, today I discovered a new difference between AMD and Intel: the RSQRTSS instruction. Perhaps this is unsurprising, since it is described as: "computes an approximate reciprocal of the square root of the low single-precision floating-point value in the source operand" (emphasis mine). A simple test program:

#include <stdio.h>
#include <string.h>
int main(void) {
  float in = 256;
  float out;
  unsigned int raw;
  asm ("rsqrtss %1,%0" : "=x"(out) : "x"(in));
  memcpy(&raw, &out, 4);
  printf("out = %x, float = %f\n", raw, out);
  return 0;
}
On Intel Skylake I get
out = 3d7ff000, float = 0.062485
On AMD Rome I get
out = 3d7ff800, float = 0.062492
Intel's result just stays within the documented 1.5 x 2-12 relative error bound. (Seems unfortunate given that the exact reciprocal square root of 256 is so easily computed to 0.0625, but whatever...)

The net effect of this is that rr recordings captured on Zen that use RSQRTSS may not replay correctly on Intel machines. The instructions will execute fine but it's possible that the slight differences in results may later lead to diverging control flow which break the rr recording. We have seen this in practice with a Pernosco user.

I have some ideas about how to fix this for Pernosco. If they work that'll be fodder for another post.

Update For what it's worth, the same issue exists with RCPSS (and presumably the SIMD versions (V)RCPPS and (V)RSQRTPS). Intel also has a number of new approximate-arithmetic instructions in AVX512, but has published software reference implementations of those, so hopefully if AMD does ever implement them Zen will match those. I'm not (yet) aware of any other non-AVX512 "approximate" instructions.

Saturday, 19 June 2021

Spectre Mitigations Murder *Userspace* Performance In The Presence Of Frequent Syscalls

I just made a performance improvement to the (single-threaded) rr sources command to cache the results of access system calls checking for directory existence. It's a simple and very effective optimization on my Skylake, Linux 5.12 laptop:

Before:
[roc@localhost code]$ time rr sources ~/pernosco/main/test-tmp/basics-demo >& ~/tmp/output2
real	3m19.648s
user	1m9.157s
sys	2m9.416s

After:
[roc@localhost code]$ time rr sources  ~/pernosco/main/test-tmp/basics-demo >& ~/tmp/output2
real	0m36.160s
user	0m36.009s
sys	0m0.053s

One interesting thing is that we cut the userspace execution time in half even though we're executing more userspace instructions than before. Frequent system calls actually slow down code execution in userspace. I assumed this was at least partly due to Spectre mitigations so I turned those off (with mitigations=off) and reran the test:

Before:
[roc@localhost code]$ time rr sources ~/pernosco/main/test-tmp/basics-demo >& ~/tmp/output2
real	2m5.776s
user	0m33.052s
sys	1m32.280s

After:
[roc@localhost code]$ time rr sources  ~/pernosco/main/test-tmp/basics-demo >& ~/tmp/output2
real	0m33.422s
user	0m32.934s
sys	0m0.110s
So those Spectre mitigations make pre-optimization userspace run 2x slower (due to cache and TLB flushes I guess) and the whole workload overall 1.6x slower! Before Spectre mitigations, those system calls hardly slowed down userspace execution at all.

Monday, 7 June 2021

Tama Lakes Winter Tramp 2021

This weekend my kids and I went down to Tongariro National Park for a winter tramp, repeating a similar trip in 2019. Overall it was great but apparently word got out and Waihohonu Hut was a lot busier than last time.

The weather forecast wasn't great so it was just me and my kids on this trip. The first sign of busyness was that the car park of the Desert Road was about full. We got to the hut about 2:15pm and were able to claim the last bunks, but people kept arriving. I guess there were probably more than 50 people there on Saturday night, 30+ squeezed into bunks and a lot of people who had to sleep on the floor of the main room. It's such a huge hut that this was still tolerable and we had a fun afternoon and evening. We got some views of the lower flanks of Mt Ruapehu on the walk in, and a good view of Mt Ngaurahoe topped by cloud.

Sunday was drizzly with low cloud, as forecast. One of my kids stayed at the hut to study for exams. The other one and I walked to the Tama Lakes via an off-track route I heard about years ago and had been hoping to try out ever since. I don't have much experience with off-track walking and the conditions weren't ideal, but: they weren't bad, we were carrying all relevant gear, my son and I are reasonably fit and fast walkers, we left at 8am so had plenty of time, and the return trip from the lakes was via the Great Walk track (easy). It worked out well and we had a lot of fun, though the views were obscured by cloud — on a good day they would be magnificent. I'd like to do it again but only with a small group and not often; I don't want the environment to be damaged by the route becoming popular.

We got back to the hut about 2:20pm after some fast walking and found it already pretty full, again. In fact it looked like it would be even fuller than on Saturday night, so my kids and I decided to just walk out to the car and come home a day early. That seemed like the right decision for us, but also freed up a little bit of space in the hut.

So unfortunately it seems that the Queens Birthday trip to Waihohonu Hut will not be such a great option in the future; given it was barely tolerable with a poor weather forecast, it probably would be really intolerable had the forecast been good.

Wednesday, 19 May 2021

Forward Compatibility Of rr Recordings

Since 2017 rr has maintained backward compatibility for recordings, i.e. new rr versions can replay any recording made by any earlier rr version back to 5.0. When we set that goal, it wasn't clear for how long we'd be able to sustain it, but so far so good!

However, we have said nothing about forward compabitility — whether old rr versions are able to replay recordings produced by new rr versions — and in practice we have broken that many times. In practice that's generally OK. However, when we do break forward compatibility, when an old rr tries to replay an incompatible recording, it often just crashes mysteriously. This is suboptimal.

So, I have added "forward compatibility version checking" to rr. rr builds have a forward compability version; each recording is stamped with the forward compatibility version it was created with; and rr will refuse to replay a recording with a later forward compatibility version than the rr build supports. When we make an rr change that means old rrs can no longer replay new rr recordings, we'll bump the forward compatibility version in the source.

Note that rrs built before this change don't have the check, will continue to merrily try to replay recordings they can't replay, and die in exciting ways.

Tuesday, 4 May 2021

Lake Waikaremoana 2021

Last week I did the Lake Waikaremoana Great Walk again, in Te Uruwera National Park. Last time it was just me and my kids, and that was our first "Great Walk" and first more-than-one-night tramp, so it has special memories for me (hard to believe it was only seven years ago!). Since then my tramping group has grown; this time I was with one of my kids and ten friends. The weather was excellent and once again, I think everyone had a great time — I certainly did! I really thank God for all of it: the weather; the location; the huts and tracks and the people who maintain them; my tramping friends, whom I love so much; and the time I get to spend with them.

We did the track in the clockwise direction, starting from Hopuruahine at the north end of the lake, walking the west edge of the lake to Onepoto in the south. Most people go the other way but I like this direction because it leaves the best views for last, along the Panekiri Bluff. We took four days, even though it's definitely doable in three, because I like spending time around huts, and the only way to spread the walk evenly over three days is to skip staying at Panekiri Hut, which has the best views.

I heard from a local that this summer was the busiest ever at Lake Waikaremoana, which makes sense because NZers haven't wanted to leave the country. We were right at the end of the first-term school holidays and the track was reasonably busy (the most popular huts (Marauiti Hut, Waiopaoa Hut) were full, but the others weren't) but by no means crowded. We would see a couple of dozen other people on the track each day and fewer people at the huts we stayed at.

So six of our group stayed at Marauiti Hut the first night, and four at Waiharuru Hut (because Marauiti Hut had filled up before they booked). On the second day one of my friends and his six-year-old son (on his first overnight tramp) were dropped by water taxi at the beach by Maraunui campsite; that worked well. Unfortunately one of our group had a leg problem and had to take the same water taxi out. That night we all stayed at Waiopaoa Hut (after most of us did the side trip to Korokoro Falls). On the third day we raced up to Panekiri Hut in three hours and had the whole afternoon there — a bit cold, but amazing views when the cloud lifted. Many games of Bang were played. The last day was pretty slow for various reasons so we didn't exit until about 1:30pm, but it was great to be in the extraordinary "goblin forest" along the Panekiri Bluff, with amazing views over the lake on a perfectly clear day.

For this trip I ran my dehydrator four times — lots of fruit chips, and half a kilogram of beef jerky. That was all well received. We also had lots of other usual snacks (chocolate, muesli bars). As a result we didn't eat many of the cracker boxes I had brought for lunch; we'll cut down on those next time. We did nearly run out of gas and actually run out of toilet paper (at the end), so I need to increase our budget for those. We rehydrated my dehydrated carrots by draining pasta water over them; they didn't grow much, but they took on the texture of pickled vegetables and were popular, so we'll do that again.

With such a big group it's easy to just talk to each other, so I need to make a special effort to talk to other people at the huts. We did meet a few interesting characters. One group at Panekiri Hut, who had just started the track, needed their car moved from Onepoto to Hopuruahine so we did that for them after we finished the track. I hope they picked it up OK!

In my opinion, Lake Waikaremoana isn't the best of the Great Walks in any particular dimension, but it's certainly very good and well worth doing if you've done the others. Kia ora Ngai TÅ«hoe!