Tuesday, 9 November 2021

Some Observations On The NZ CovidPass System

NZ's Ministry of Health has published a specification for the data in the CovidPass QR code. The spec looks pretty good to me; there's probably enough information for anyone to go ahead and implement a verifier app today, and it should only take a few days to put together a bare-bones verifier app. The spec also tells us a lot about how the system will work. I see some confusion/misinformation out there so here are some observations.

The main idea is very simple. You ask the Ministry (probably via the My Covid Record Web site, but possibly in other ways) to generate a statement of the form "<full-name>, <date-of-birth> is considered fully vaccinated". The Ministry computer system checks your records to ensure that they agree you're fully vaccinated, then generates that statement, digitally signs it with the Ministry's private key, and encodes the statement and the signature as a QR code. You can store that code on your phone, or print it out on a piece of paper. Later, you show that QR code to a gatekeeper who wants to check your vaccination status. They scan it with their own app, which decodes the statement, checks that the statement has a valid signature from the Ministry, and if it does, tells the gatekeeper "<full-name>, <date-of-birth> is considered fully vaccinated". To confirm that you're the person the statement is talking about, the gatekeeper will need to check your driver's license or other ID.

If you're not familar with digital signatures, it's important to know that unlike pen-and-paper signatures, altering the statement invalidates the signature and only the Ministry of Health can generate new signatures that verifiers will accept. This is basic "public key crytography" and generally very secure. To generate a fake vaccine certificate someone would have to break into Ministry computer systems, or feed false data into the Ministry database recording them as vaccinated, or find an egregious bug in the verification software. So of course you can easily copy someone else's statement, but if you change the details to match your own details, verifier apps will reject the new statement; a copied statement is only useful if you can pretend to be the person you copied it from.

For privacy: be aware that when you let someone view your QR code, you're telling them your full name and date of birth. They could record that information if they want to (though there may be legislation soon that restricts what they can do with that information). There is no need for a verifier app to notify anyone of these QR code scans, and I would expect the government's app to not notify or record scans. (Hopefully they'll release the source code like they do for CovidTracer.)

As I mentioned above, you don't need a phone to prove you're vaccinated; your code printed on a piece of paper will work fine. Verifiers will need a phone or similar device, but it doesn't have to be connected to the Internet to verify certificates (though the app will need to be updated once in a while). So DoC rangers could scan vaccination certificates at huts for example.

The data in the QR code currently doesn't record which vaccines you have had or when. In fact the Ministry could choose to issue these certificates to people who haven't even been vaccinated, if there's a good reason.

These signed statements have an expiration date on them, so periodically a particular QR code will expire. People using their phones will probably get the new one automatically but if you carry a printed one, you will need to print a new one every so often. This means the Ministry could change the criteria for issuing new certificates (e.g. to require a booster shot) in the future.

I like the way this has been designed. It could perhaps be a bit simpler — I'm not sure using W3C DID is worthwhile — but it's simple enough. By committing to this spec, it will be pretty easy to integrate certificate verification into other apps. People might even be able to implement interesting enhancements like scanning a QR code alongside a drivers license to verify the name and DoB automatically with one action. Let's hope the Ministry's contractors can finish their backend work and verifier app before the end of this month!

Monday, 4 October 2021

How WHO Failed

I see that WHO is in contention for the Nobel Peace Prize. This is absurd. WHO got almost everything wrong early in the COVID19 pandemic and probably made the pandemic much worse. Here's a list:

  • As late as April 2020 WHO was advising countries against closing borders. (NZ eliminated COVID19 after closing borders in March against WHO advice. Later, WHO had the gall to pretend NZ eliminated COVID19 by following WHO advice.)
  • As late as June 2020 WHO was advising that asymptomatic spread of COVID19 was "rare". We now know that asymptomatic spread of COVID19 was and is a major factor in transmission.
  • Until June 2020 WHO was advising people to not wear masks unless they were sick with COVID19 or caring for someone sick, because it would be either ineffective or harmful. We now know that general mask-wearing is helpful at preventing transmission.
  • Until May 2021 WHO was advising that COVID19 was spread mainly by droplets. Now we know that it is spread mainly via aerosols.
  • As late as July 2020 WHO was advising that fomite transmission was a "likely mode of transmission" for COVID19. Fomite transmission has never been demonstrated as far as I know.
  • WHO delayed declaring COVID19 a pandemic until 11 March 2020, long after it was obviously a pandemic.

It would be unreasonable to expect WHO to get everything right given the unknowns of a new pandemic. However, we should expect WHO to get more right than wrong, and the above list shows they were actually worse than useless. These failures demand serious investigation and reform, not a Nobel Prize. If that investigation and reform doesn't happen, in the next pandemic, countries will be best off ignoring WHO advice.

Sadly, I see little sign of such criticism and reform happening. Instead, as this Nobel talk illustrates, mainstream opinion backs WHO's COVID19 response and is almost completely silent on WHO's appalling COVID19 track record. I'm not sure why this has happened, but I suspect it's another casualty of American partisan politics: "Trump attacked WHO, therefore reasonable people have to uncritically support WHO". It's maddening.

(Note for those who don't know me: I am an enthusiastic supporter of mainstream science and institutions, in general. WHO bungled this one.)

Sunday, 12 September 2021

Emulating AMD Approximate Arithmetic Instructions On Intel

Pernosco accepts uploaded rr recordings from customers and replays them with binary instrumentation to build a database of all program execution, to power an amazing debugging experience. Our infrastructure is Intel-based AWS instances. Some customers upload recordings made on AMD (Zen) machines; for these recordings to replay correctly on Intel machines, instruction execution needs to produce bit-identical results. This is almost always true, but I recently discovered that the approximate arithmetic instructions RSQRTSS, RCPSS and friends do not produce identical results on Zen vs Intel. Fortunately, since Pernosco replays with binary instrumentation, we can insert code to emulate the AMD behavior of these instructions. I just needed to figure out a good way to implement that emulation.

Reverse engineering AMD's exact algorithm and reimplementing it with Intel's instructions seemed like it would be a lot of work and tricky to reimplement correctly. Instead, we take advantage of the fact that RSQRT/RCP are unary operations on single-precision float values. This means there are only 232 possible inputs, so a lookup table of all results is not out of the question: in the worst case it would only be 16GB. Of course we would prefer something smaller, so I computed the full table of Intel and AMD results and looked for patterns we can exploit.

Since the Intel and AMD values should always be pretty close together, I computed the XOR of the Intel and AMD values. Storing just this table lets us convert from AMD to Intel and vice versa. It turns out that for RSQRT there are only 22 distinct difference values, and for RCP only 17. This means we can store just one byte per table entry, an index into a secondary lookup table that gives the actual difference value. Another key observation is that the difference value depends only on the upper 21 bits of the input. (I suspect RSQRT/RCP completely ignore the bottom 11 bits of the input mantissa, but I haven't verified that.) Thus the main table can be stored in just 221 bytes, i.e. 2MB, and of course we need one table for RSQRT and one for RCP, so 4MB total, which is acceptable. With deeper analysis we might find more patterns we can use to compress the table further, but this is already good enough for our purposes.

That part was pretty easy. It turned out that most of the work was actually implementing the instrumentation. The problem is that each instruction comes in five distinct flavours. For RSQRT, there is:

  • RSQRTSS: compute RSQRT of the bottom 32 bits of the input operand, store the result in the bottom 32 bits of the output register, leaving all other output register bits unchanged
  • RSQRTPS: compute RSQRT of the bottom four 32-bit lanes of the input operand, store the results in the bottom four 32-bit lanes of the output register, leaving all other output register bits unchanged
  • VRSQRTSS (two input operands): compute RSQRT of the bottom 32 bits of the second input operand, store the result in the bottom 32 bits of the output register, copy bits 32-127 from the first input register to the output register, zero all bits >= 128 of the output register (seriously, Intel?)
  • VRSQRTPS, 128-bit version: compute RSQRT of the bottom four 32-bit lanes of the input operand, store the results in the bottom four 32-bit lanes of the output register, zero all bits >= 128 of the output register
  • VRSQRTPS, 256-bit version: compute RSQRT of the eight 32-bit lanes of the input operand, store the results in the eight 32-bit lanes of the output register
In each of these instructions the primary input operand can be a memory load operand instead of a register.

So our generated instrumentation has to perform one table lookup per lane and also handle the correct effects on other bits of the output register. If we really cared about performance we'd probably want to vectorize the table lookups, but that's hard and the performance impact is unlikely to matter in our case, so I kept it simple with serial logic using general purpose registers.

Anyway it's working well now and Pernosco is able to process AMD submissions using these instructions, so go ahead and send us your recordings to debug! (The logic also handles emulating Intel semantics if you happen to be running Pernosco on-prem on Zen hardware.) Tracing replay divergence back to RSQRTSS (through many long def-use chains) was extremely painful so I wrote a fairly good automated test suite for this work; I want to never again have to debug divergence caused by this.

Thursday, 9 September 2021

rr Trace Portability: Diverging Behavior of RSQRTSS in AMD vs Intel

When we added Zen support to rr, it was an open question whether it would be possible to reliably replay Zen recordings on Intel CPUs or vice versa. It wasn't clear whether CPU instructions normally used by applications had bit-identical semantics across vendors. Over time the news was good: replaying Zen recordings on Intel generally works — if you trap and emulate CPUID to return the Zen results, and work around a difference in x87 FIP handling. So Pernosco has been able to handle submissions from Zen users.

Unfortunately, today I discovered a new difference between AMD and Intel: the RSQRTSS instruction. Perhaps this is unsurprising, since it is described as: "computes an approximate reciprocal of the square root of the low single-precision floating-point value in the source operand" (emphasis mine). A simple test program:

#include <stdio.h>
#include <string.h>
int main(void) {
  float in = 256;
  float out;
  unsigned int raw;
  asm ("rsqrtss %1,%0" : "=x"(out) : "x"(in));
  memcpy(&raw, &out, 4);
  printf("out = %x, float = %f\n", raw, out);
  return 0;
}
On Intel Skylake I get
out = 3d7ff000, float = 0.062485
On AMD Rome I get
out = 3d7ff800, float = 0.062492
Intel's result just stays within the documented 1.5 x 2-12 relative error bound. (Seems unfortunate given that the exact reciprocal square root of 256 is so easily computed to 0.0625, but whatever...)

The net effect of this is that rr recordings captured on Zen that use RSQRTSS may not replay correctly on Intel machines. The instructions will execute fine but it's possible that the slight differences in results may later lead to diverging control flow which break the rr recording. We have seen this in practice with a Pernosco user.

I have some ideas about how to fix this for Pernosco. If they work that'll be fodder for another post.

Update For what it's worth, the same issue exists with RCPSS (and presumably the SIMD versions (V)RCPPS and (V)RSQRTPS). Intel also has a number of new approximate-arithmetic instructions in AVX512, but has published software reference implementations of those, so hopefully if AMD does ever implement them Zen will match those. I'm not (yet) aware of any other non-AVX512 "approximate" instructions.

Saturday, 19 June 2021

Spectre Mitigations Murder *Userspace* Performance In The Presence Of Frequent Syscalls

I just made a performance improvement to the (single-threaded) rr sources command to cache the results of access system calls checking for directory existence. It's a simple and very effective optimization on my Skylake, Linux 5.12 laptop:

Before:
[roc@localhost code]$ time rr sources ~/pernosco/main/test-tmp/basics-demo >& ~/tmp/output2
real	3m19.648s
user	1m9.157s
sys	2m9.416s

After:
[roc@localhost code]$ time rr sources  ~/pernosco/main/test-tmp/basics-demo >& ~/tmp/output2
real	0m36.160s
user	0m36.009s
sys	0m0.053s

One interesting thing is that we cut the userspace execution time in half even though we're executing more userspace instructions than before. Frequent system calls actually slow down code execution in userspace. I assumed this was at least partly due to Spectre mitigations so I turned those off (with mitigations=off) and reran the test:

Before:
[roc@localhost code]$ time rr sources ~/pernosco/main/test-tmp/basics-demo >& ~/tmp/output2
real	2m5.776s
user	0m33.052s
sys	1m32.280s

After:
[roc@localhost code]$ time rr sources  ~/pernosco/main/test-tmp/basics-demo >& ~/tmp/output2
real	0m33.422s
user	0m32.934s
sys	0m0.110s
So those Spectre mitigations make pre-optimization userspace run 2x slower (due to cache and TLB flushes I guess) and the whole workload overall 1.6x slower! Before Spectre mitigations, those system calls hardly slowed down userspace execution at all.

Monday, 7 June 2021

Tama Lakes Winter Tramp 2021

This weekend my kids and I went down to Tongariro National Park for a winter tramp, repeating a similar trip in 2019. Overall it was great but apparently word got out and Waihohonu Hut was a lot busier than last time.

The weather forecast wasn't great so it was just me and my kids on this trip. The first sign of busyness was that the car park of the Desert Road was about full. We got to the hut about 2:15pm and were able to claim the last bunks, but people kept arriving. I guess there were probably more than 50 people there on Saturday night, 30+ squeezed into bunks and a lot of people who had to sleep on the floor of the main room. It's such a huge hut that this was still tolerable and we had a fun afternoon and evening. We got some views of the lower flanks of Mt Ruapehu on the walk in, and a good view of Mt Ngaurahoe topped by cloud.

Sunday was drizzly with low cloud, as forecast. One of my kids stayed at the hut to study for exams. The other one and I walked to the Tama Lakes via an off-track route I heard about years ago and had been hoping to try out ever since. I don't have much experience with off-track walking and the conditions weren't ideal, but: they weren't bad, we were carrying all relevant gear, my son and I are reasonably fit and fast walkers, we left at 8am so had plenty of time, and the return trip from the lakes was via the Great Walk track (easy). It worked out well and we had a lot of fun, though the views were obscured by cloud — on a good day they would be magnificent. I'd like to do it again but only with a small group and not often; I don't want the environment to be damaged by the route becoming popular.

We got back to the hut about 2:20pm after some fast walking and found it already pretty full, again. In fact it looked like it would be even fuller than on Saturday night, so my kids and I decided to just walk out to the car and come home a day early. That seemed like the right decision for us, but also freed up a little bit of space in the hut.

So unfortunately it seems that the Queens Birthday trip to Waihohonu Hut will not be such a great option in the future; given it was barely tolerable with a poor weather forecast, it probably would be really intolerable had the forecast been good.

Wednesday, 19 May 2021

Forward Compatibility Of rr Recordings

Since 2017 rr has maintained backward compatibility for recordings, i.e. new rr versions can replay any recording made by any earlier rr version back to 5.0. When we set that goal, it wasn't clear for how long we'd be able to sustain it, but so far so good!

However, we have said nothing about forward compabitility — whether old rr versions are able to replay recordings produced by new rr versions — and in practice we have broken that many times. In practice that's generally OK. However, when we do break forward compatibility, when an old rr tries to replay an incompatible recording, it often just crashes mysteriously. This is suboptimal.

So, I have added "forward compatibility version checking" to rr. rr builds have a forward compability version; each recording is stamped with the forward compatibility version it was created with; and rr will refuse to replay a recording with a later forward compatibility version than the rr build supports. When we make an rr change that means old rrs can no longer replay new rr recordings, we'll bump the forward compatibility version in the source.

Note that rrs built before this change don't have the check, will continue to merrily try to replay recordings they can't replay, and die in exciting ways.