Sunday, 9 January 2022

The End Of The Runway

As of January 10 I'm leaving day-to-day work on Pernosco and joining Google's NZ office.

Pernosco is doing OK. We signed up significant new customers in 2021 and many of our customers really love the product. I'm very proud of it and I continue to think it has immense potential. However, I've been working on Pernosco for nearly 6 years now and my runway has simply run out. On the other side of the coin, Google has announced they want to start engineering work in their Auckland office, they reached to out to me, and ended up making me an incredibly attractive offer. I think having Google engineering here is important for NZ and I'm eager to contribute to building that up. I'm also very much looking forward to working with people in an office again after six years at home, on an interesting project too.

Pernosco is not going away! Kyle will continuing maintaining, improving and selling it, and there is significant revenue to support him doing that. I'm retaining my stake in the company, but I won't be doing Pernosco development, sales or operations while I'm at Google. I have high hopes that Pernosco usage will continue to grow and Pernosco will be increasingly successful.

Sunday, 19 December 2021

Mt Pirongia 2021

On Friday I took advantage of the Auckland border having opened (on Wednesday) to travel down to Mt Pirongia and tramp to the summit, staying at Pahautea Hut overnight and then walking out again on Saturday (yesterday). This is the second time I've done Mt Pirongia (the last one was April 2016). It was intense but pretty great!

IIRC last time we took the shortest route — Corcoran Rd end, up Tirohanga Track to Ruapane peak and then along the track to Pirongia summit, returning via the same route. This time we did a loop from the Grey Road end: taking the link track to Ruapane Track, then joining the Tirohanga track to Pirongia summit, then back down Mahaukura Track to the car park via Mahaukura and Wharauroa peaks. It's definitely longer this way but you see and do more.

Pirongia is extremely rugged and the tracks reflect this. There aren't many steps or boardwalk sections and the tracks stick to the ridgelines, and there are many peaks along those ridges (the remains of many volcanic cores), so you're constantly scrambling up and down steep slopes with the aid of rocks and roots. Where the rock faces are nontrivial, chains have been installed to help with the climbing. Pirongia gets a lot of rainfall and its soils don't drain well so there's a lot of mud along the way. Although the tracks are quite short horizontally, they're hard going. Good fitness, good boots, and determination are all pretty important here. But Pirongia isn't huge so you should get there in the end.

On Friday night the hut was not very full — twenty bunks but only seven people there, me and my three friends and another group, three young women. We got talking to their leader, who told us about her extensive tramping experience and an upcoming 10-day tramp around the infamous Northwest Circuit on Stewart Island that she was organising. Later she mentioned she's 17 years old. I was a bit flabbergasted to be honest. Good for her, and well done to her parents!

There's a real shortage of tramping huts in the Auckland region. Within a two-hour drive there's really only Pahautea at Pirongia and Crosbie's/Pinnacles in the Kauaeranga Valley in Coromandel, as far as I know. The latter are super busy and Pirongia is just a bit too hard for inexperienced trampers. But if you are fit and at least a little bit experienced, it's good option.

Saturday, 18 December 2021

Do We Really Need A Link Step?

mold looks pretty cool, and a faster drop-in ld replacement is obviously extremely useful. But having no link step at all would be even faster. Why do native-code compilers write out temporary object files which then have to be linked together, anyway? Could we stop doing that and have compilers emit compiled translation units directly into a final executable file that the OS can execute directly --- a "zero-link" approach? I think we could ... in many cases.

The basic idea is to treat the final executable file (an ELF file, say) as a mutable data structure. When the compiler would emit an object file it instead allocates space in that executable file using a shared memory allocator, and writes the object code directly into that space. To make this tractable we'll assume we aren't going to generate optimal code in size or space; we're going to build an executable that runs "pretty fast", for testing purposes (manual or automated).

An obvious problem is how to handle symbol resolution, i.e. what happens when the compiler emits code for a translation unit that uses symbol A from some other unit --- especially if that other unit hasn't been compiled yet? Here's an option for function symbols: when A is used for the first time, write a stub for A to the final binary and call that. When a definition for A is seen, patch the stub to jump to the definition. Internally this will mean maintaining a parallel hashtable of all undefined symbols that all compiler instances can share efficiently.

For data symbols, instead of using a stub, we can emit a pointer that can be patched to point to the data definition. For complex constants, we might need to defer initialization until load time or emit code to initialize them lazily.

To challenge the design a bit more, let's think about why object files are useful.

Sometimes compilers emit object files for a project which are then linked into multiple different output binaries. True, but it's more efficient to link them once into a single shared library which is then loaded by each of those output binaries, so projects should just do that.

Compilers use object files for incremental compilation: when a translation unit hasn't changed, its object file can be reused. We can capture the same benefits with the zero-link approach: reuse the final executable and keep around its symbol hashtable; when an object file changes, release the object file's space in the final executable, and allocate new space for the new object file.

You can combine multiple object files into a static library and the linker will select the object files that are needed to satistfy undefined symbols. In many projects this feature is only used for "system libraries" --- a project's build system should avoid building project object files that will not be used in the final link. System libraries are usually dynamically linked for sharing reasons. When we really need to subset static libraries, we could link objects from those libraries into our final executable on-demand when we first see them being used.

Another issue is debuginfo (particularly important to me!) Supporting debuginfo would require extending DWARF5 debuginfo sections to allow their data to be scattered over the final executable.

There are lots of unresolved questions, enough that I wouldn't bet money on this actually being practical. But I think it's worth questioning the assumption that we have to have a link step for native binaries.

Update Zig can do something like this.

Tuesday, 9 November 2021

Some Observations On The NZ CovidPass System

NZ's Ministry of Health has published a specification for the data in the CovidPass QR code. The spec looks pretty good to me; there's probably enough information for anyone to go ahead and implement a verifier app today, and it should only take a few days to put together a bare-bones verifier app. The spec also tells us a lot about how the system will work. I see some confusion/misinformation out there so here are some observations.

The main idea is very simple. You ask the Ministry (probably via the My Covid Record Web site, but possibly in other ways) to generate a statement of the form "<full-name>, <date-of-birth> is considered fully vaccinated". The Ministry computer system checks your records to ensure that they agree you're fully vaccinated, then generates that statement, digitally signs it with the Ministry's private key, and encodes the statement and the signature as a QR code. You can store that code on your phone, or print it out on a piece of paper. Later, you show that QR code to a gatekeeper who wants to check your vaccination status. They scan it with their own app, which decodes the statement, checks that the statement has a valid signature from the Ministry, and if it does, tells the gatekeeper "<full-name>, <date-of-birth> is considered fully vaccinated". To confirm that you're the person the statement is talking about, the gatekeeper will need to check your driver's license or other ID.

If you're not familar with digital signatures, it's important to know that unlike pen-and-paper signatures, altering the statement invalidates the signature and only the Ministry of Health can generate new signatures that verifiers will accept. This is basic "public key crytography" and generally very secure. To generate a fake vaccine certificate someone would have to break into Ministry computer systems, or feed false data into the Ministry database recording them as vaccinated, or find an egregious bug in the verification software. So of course you can easily copy someone else's statement, but if you change the details to match your own details, verifier apps will reject the new statement; a copied statement is only useful if you can pretend to be the person you copied it from.

For privacy: be aware that when you let someone view your QR code, you're telling them your full name and date of birth. They could record that information if they want to (though there may be legislation soon that restricts what they can do with that information). There is no need for a verifier app to notify anyone of these QR code scans, and I would expect the government's app to not notify or record scans. (Hopefully they'll release the source code like they do for CovidTracer.)

As I mentioned above, you don't need a phone to prove you're vaccinated; your code printed on a piece of paper will work fine. Verifiers will need a phone or similar device, but it doesn't have to be connected to the Internet to verify certificates (though the app will need to be updated once in a while). So DoC rangers could scan vaccination certificates at huts for example.

The data in the QR code currently doesn't record which vaccines you have had or when. In fact the Ministry could choose to issue these certificates to people who haven't even been vaccinated, if there's a good reason.

These signed statements have an expiration date on them, so periodically a particular QR code will expire. People using their phones will probably get the new one automatically but if you carry a printed one, you will need to print a new one every so often. This means the Ministry could change the criteria for issuing new certificates (e.g. to require a booster shot) in the future.

I like the way this has been designed. It could perhaps be a bit simpler — I'm not sure using W3C DID is worthwhile — but it's simple enough. By committing to this spec, it will be pretty easy to integrate certificate verification into other apps. People might even be able to implement interesting enhancements like scanning a QR code alongside a drivers license to verify the name and DoB automatically with one action. Let's hope the Ministry's contractors can finish their backend work and verifier app before the end of this month!

Monday, 4 October 2021

How WHO Failed

I see that WHO is in contention for the Nobel Peace Prize. This is absurd. WHO got almost everything wrong early in the COVID19 pandemic and probably made the pandemic much worse. Here's a list:

  • As late as April 2020 WHO was advising countries against closing borders. (NZ eliminated COVID19 after closing borders in March against WHO advice. Later, WHO had the gall to pretend NZ eliminated COVID19 by following WHO advice.)
  • As late as June 2020 WHO was advising that asymptomatic spread of COVID19 was "rare". We now know that asymptomatic spread of COVID19 was and is a major factor in transmission.
  • Until June 2020 WHO was advising people to not wear masks unless they were sick with COVID19 or caring for someone sick, because it would be either ineffective or harmful. We now know that general mask-wearing is helpful at preventing transmission.
  • Until May 2021 WHO was advising that COVID19 was spread mainly by droplets. Now we know that it is spread mainly via aerosols.
  • As late as July 2020 WHO was advising that fomite transmission was a "likely mode of transmission" for COVID19. Fomite transmission has never been demonstrated as far as I know.
  • WHO delayed declaring COVID19 a pandemic until 11 March 2020, long after it was obviously a pandemic.

It would be unreasonable to expect WHO to get everything right given the unknowns of a new pandemic. However, we should expect WHO to get more right than wrong, and the above list shows they were actually worse than useless. These failures demand serious investigation and reform, not a Nobel Prize. If that investigation and reform doesn't happen, in the next pandemic, countries will be best off ignoring WHO advice.

Sadly, I see little sign of such criticism and reform happening. Instead, as this Nobel talk illustrates, mainstream opinion backs WHO's COVID19 response and is almost completely silent on WHO's appalling COVID19 track record. I'm not sure why this has happened, but I suspect it's another casualty of American partisan politics: "Trump attacked WHO, therefore reasonable people have to uncritically support WHO". It's maddening.

(Note for those who don't know me: I am an enthusiastic supporter of mainstream science and institutions, in general. WHO bungled this one.)

Sunday, 12 September 2021

Emulating AMD Approximate Arithmetic Instructions On Intel

Pernosco accepts uploaded rr recordings from customers and replays them with binary instrumentation to build a database of all program execution, to power an amazing debugging experience. Our infrastructure is Intel-based AWS instances. Some customers upload recordings made on AMD (Zen) machines; for these recordings to replay correctly on Intel machines, instruction execution needs to produce bit-identical results. This is almost always true, but I recently discovered that the approximate arithmetic instructions RSQRTSS, RCPSS and friends do not produce identical results on Zen vs Intel. Fortunately, since Pernosco replays with binary instrumentation, we can insert code to emulate the AMD behavior of these instructions. I just needed to figure out a good way to implement that emulation.

Reverse engineering AMD's exact algorithm and reimplementing it with Intel's instructions seemed like it would be a lot of work and tricky to reimplement correctly. Instead, we take advantage of the fact that RSQRT/RCP are unary operations on single-precision float values. This means there are only 232 possible inputs, so a lookup table of all results is not out of the question: in the worst case it would only be 16GB. Of course we would prefer something smaller, so I computed the full table of Intel and AMD results and looked for patterns we can exploit.

Since the Intel and AMD values should always be pretty close together, I computed the XOR of the Intel and AMD values. Storing just this table lets us convert from AMD to Intel and vice versa. It turns out that for RSQRT there are only 22 distinct difference values, and for RCP only 17. This means we can store just one byte per table entry, an index into a secondary lookup table that gives the actual difference value. Another key observation is that the difference value depends only on the upper 21 bits of the input. (I suspect RSQRT/RCP completely ignore the bottom 11 bits of the input mantissa, but I haven't verified that.) Thus the main table can be stored in just 221 bytes, i.e. 2MB, and of course we need one table for RSQRT and one for RCP, so 4MB total, which is acceptable. With deeper analysis we might find more patterns we can use to compress the table further, but this is already good enough for our purposes.

That part was pretty easy. It turned out that most of the work was actually implementing the instrumentation. The problem is that each instruction comes in five distinct flavours. For RSQRT, there is:

  • RSQRTSS: compute RSQRT of the bottom 32 bits of the input operand, store the result in the bottom 32 bits of the output register, leaving all other output register bits unchanged
  • RSQRTPS: compute RSQRT of the bottom four 32-bit lanes of the input operand, store the results in the bottom four 32-bit lanes of the output register, leaving all other output register bits unchanged
  • VRSQRTSS (two input operands): compute RSQRT of the bottom 32 bits of the second input operand, store the result in the bottom 32 bits of the output register, copy bits 32-127 from the first input register to the output register, zero all bits >= 128 of the output register (seriously, Intel?)
  • VRSQRTPS, 128-bit version: compute RSQRT of the bottom four 32-bit lanes of the input operand, store the results in the bottom four 32-bit lanes of the output register, zero all bits >= 128 of the output register
  • VRSQRTPS, 256-bit version: compute RSQRT of the eight 32-bit lanes of the input operand, store the results in the eight 32-bit lanes of the output register
In each of these instructions the primary input operand can be a memory load operand instead of a register.

So our generated instrumentation has to perform one table lookup per lane and also handle the correct effects on other bits of the output register. If we really cared about performance we'd probably want to vectorize the table lookups, but that's hard and the performance impact is unlikely to matter in our case, so I kept it simple with serial logic using general purpose registers.

Anyway it's working well now and Pernosco is able to process AMD submissions using these instructions, so go ahead and send us your recordings to debug! (The logic also handles emulating Intel semantics if you happen to be running Pernosco on-prem on Zen hardware.) Tracing replay divergence back to RSQRTSS (through many long def-use chains) was extremely painful so I wrote a fairly good automated test suite for this work; I want to never again have to debug divergence caused by this.

Thursday, 9 September 2021

rr Trace Portability: Diverging Behavior of RSQRTSS in AMD vs Intel

When we added Zen support to rr, it was an open question whether it would be possible to reliably replay Zen recordings on Intel CPUs or vice versa. It wasn't clear whether CPU instructions normally used by applications had bit-identical semantics across vendors. Over time the news was good: replaying Zen recordings on Intel generally works — if you trap and emulate CPUID to return the Zen results, and work around a difference in x87 FIP handling. So Pernosco has been able to handle submissions from Zen users.

Unfortunately, today I discovered a new difference between AMD and Intel: the RSQRTSS instruction. Perhaps this is unsurprising, since it is described as: "computes an approximate reciprocal of the square root of the low single-precision floating-point value in the source operand" (emphasis mine). A simple test program:

#include <stdio.h>
#include <string.h>
int main(void) {
  float in = 256;
  float out;
  unsigned int raw;
  asm ("rsqrtss %1,%0" : "=x"(out) : "x"(in));
  memcpy(&raw, &out, 4);
  printf("out = %x, float = %f\n", raw, out);
  return 0;
}
On Intel Skylake I get
out = 3d7ff000, float = 0.062485
On AMD Rome I get
out = 3d7ff800, float = 0.062492
Intel's result just stays within the documented 1.5 x 2-12 relative error bound. (Seems unfortunate given that the exact reciprocal square root of 256 is so easily computed to 0.0625, but whatever...)

The net effect of this is that rr recordings captured on Zen that use RSQRTSS may not replay correctly on Intel machines. The instructions will execute fine but it's possible that the slight differences in results may later lead to diverging control flow which break the rr recording. We have seen this in practice with a Pernosco user.

I have some ideas about how to fix this for Pernosco. If they work that'll be fodder for another post.

Update For what it's worth, the same issue exists with RCPSS (and presumably the SIMD versions (V)RCPPS and (V)RSQRTPS). Intel also has a number of new approximate-arithmetic instructions in AVX512, but has published software reference implementations of those, so hopefully if AMD does ever implement them Zen will match those. I'm not (yet) aware of any other non-AVX512 "approximate" instructions.