Tuesday, 10 April 2018

Payment Express's "Account2Account" Is Bad For Security

Today I discovered that an Australian company called Payment Express has started offering, in addition to credit-card payment processing, a feature called "Account2Account". With this feature, customers enter their online banking credentials into Payment Express' Web site which then performs a payment transaction on the customer's behalf. This is insane and I don't know why banks allow it.

The security FAQ presented to customers (which I can't find a public URL for) emphasizes that Payment Express does not store the customer's credentials or other information. Good, but the problem is that even if customers can completely trust Payment Express (and I don't know why they should; Payment Express' terms and conditions disclaim all liability), any workflow that trains customers to enter their banking credentials into a Web site other than their bank's site makes them vulnerable to phishing attacks. Online banking phishing attacks are ubiquitous. Even worse, Payment Express advertises "Payment page look and feel customisable via an online wizard", which suggests that the appearance of pages presented to customers can vary, making those customers even more vulnerable to phishing. Payment Express doesn't even use an EV certificate.

Maybe banks allow this because they get a bigger cut of the transactions than they do with a credit-card transaction, enough to cover the cost to them of any increase in phishing losses? If so I don't know how they're calculating the cost to our entire social ecosystem of people being trained to enter critical login credentials into random Web sites.

Thursday, 29 March 2018

Speeding Up `dwarfdump` With Rust

Writing a debugger for C++ on Linux, you spend a lot of time examining pretty-printed DWARF debug information using tools like readelf, objdump or dwarfdump. Unfortunately this can be quite slow. For Firefox's libxul.so, dwarfdump's pretty-printed output of just the main .debug_info is 25GiB. The standard objdump and readelf tools take about three minutes to print it to /dev/null.

The Rust gimli crate includes a Rust-implemented version of dwarfdump. Initially it took about eight minutes to dump libxul, although the comparison is unfair because dwarfdump dumps more data than readelf. I decided to try to speed dwarfdump up. TL;DR: I reduced the dump time from 506s to 26s by fixing some simple issues and taking advantage of Rust "fearless parallelism". I think there are interesting opportunities for speeding up many kinds of command-line tools using Rust and parallelism.

Initially gimli dwarfdump was using println! to print every line of output. Every println! call temporarily locks the Rust stdout stream and performs a write system call, even when redirected to a file. This is extremely inefficient; it's much more efficient to buffer output so multiple lines are written with each system call. Since dwarfdump only used one thread, it can take the lock at the start and hold it permanently. These changes reduced the run time to 215s.

I did some profiling to look for hotspots amenable to micro-optimizations. I found that we spent a lot of CPU time in Rust's formatted-string padding code. dwarfdump used right-padding formatting of an empty string ("{:indent$}") to insert a specified number of spaces to indent each line. This is quite slow because the padding implementation writes the padded value to a temporary buffer, measures the length of the resulting string, and then writes the buffered value and the padding bytes. I changed dwarfdump to indent by printing slices from a large string of spaces, reducing the run time to 142s. I also found that where some non-empty string tokens were being right-padded, we could speed things up emitting the padding as a string slice, taking advantage of the fact that these are static strings whose lengths we know. That reduced the run time to 99s.

There are probably opportunities to improve the performance of Rust formatted text output by combining static knowledge of the format string and the types and other details of the parameters.

I then turned to parallelism. dwarfdump prints the contents of .debug_info by looping over each compilation unit, and this is easy to do in parallel — in Rust, at least. In Rust, a method that takes an immutable &self reference guarantees it can be called safely on the same object from multiple threads, so it's evident from a library's API types whether and how it can be used with multiple threads, and the compiler checks your usage. Better still, since immutability is the default, in practice Rust libraries tend to work well with multithreading, and gimli is no exception.

One important detail is that we want to output the results for compilation units in the correct order: the output for a compilation unit has to be buffered and printed to stdout after the output for all previous compilation units. You don't want to buffer the output of too many compilation units at once; each of those buffers can be tens of megabytes. I created a parallel-output utility function to handle that. It assigns compilation units to a fixed number of worker threads (N) in a strict round-robin order and ensures that a worker thread doesn't start working on a new compilation unit until the results for its previous compilation unit have been printed. Thus at most N output buffers are required. With eight worker threads (for my quad-core, eight-hardware-threads Skylake laptop) this reduced the run time to 26s.

Even with a single worker thread, run time dropped to 77s; some of the improvement actually came from changing writes from a BufWriter to a Vec<u8>. That may be due to io::Result error propagation and checking. Two threads give 47s, and four threads give 30s. Exceeding the physical cores provides negligible returns.

At peak performance gimli dwarfdump produces 1GB of output per second. It's interesting that even for relatively simple pretty-printing and such high data volume, serialized large writes to stdout are not the bottleneck. This suggests that even simple Unixy stdio-pipeline tools might benefit from internal parallelism.

Achieving this performance by improving existing C tools would have been a lot more difficult than with gimli and Rust. There's no way to be sure that the relevant DWARF processing code, e.g. binutils/dwarf.c, is safe for use on multiple threads. (Given the presence of many unadorned static variables, it probably isn't.) Efficiently switching output backends would have been more difficult than in Rust, where it's idiomatic to parameterize output code on the static type of the output backend, e.g. fn dump_info<W: Writer>(w: &mut W, ...).

Tuesday, 20 March 2018

Too Many DWARF Packaging Options

There are too many ways in which DWARF debuginfo can be packaged, and it makes building DWARF-consuming tools a nightmare.

Debuginfo can be packaged into the executable file, or into a separate external debuginfo file This debuginfo file can be referenced by a build id stored in the .note.gnu-build-id section, or via a filename stored in the .gnu_debuglink section.

Either of those kinds of debuginfo file can use "split DWARF". With "split DWARF", most of the debuginfo is not present in the file itself. Instead most of the debuginfo is left in the in object files that were input to the linker, and the debuginfo binary references those files. Unfortunately there are two flavours of this scheme: the DWARF5 standard "DWO" and the GNU variant "DWZ", and they are different. You can merge the debuginfo from multiple DWO/DWZ files into a single file, although I don't think that adds to the complexity of debuginfo-consuming tools.

On top of that, there is also "multi-file DWZ". This lets you extract DWARF data that's common to multiple binaries into a single "alternative debuginfo" file referenced by multiple binaries via a .gnu_debugaltlink section. This has been standardized as "supplementary object files". Fedora is using the GNU variant in its debuginfo packages already.

You can optionally gzip-compress individual ELF sections in any of those files.

These techniques could, in theory, be combined. Fedora uses external debuginfo files with multi-file DWZ. I think you could have a binary that uses both "supplementary object files" and "split DWARF". I think the DWARF spec rules out a "supplementary object file" itself using "split DWARF", or vice versa, but it's a bit vague.

It's unfortunate that "supplementary object files" and "split DWARF" are different.

It's also very unfortunate that many new DWARF features exist in a GNU flavour and a standard flavour. In practice DWARF-consuming tools need to support both, which is extra work. The DWARF community could learn from the painfully-won wisdom of the Web standards community.

I apologise if I got any of the above wrong. It's complicated and confusing, and documentation is scattered or in some cases nonexistent.

Update Turns out that the ELF per-section compression is more complex than I knew. Any section may be compressed with SHF_COMPRESSED, in which case it starts with a Elf32_External_Chdr or Elf64_External_Chdr. Some Fedora packages use this. However, you can also have compressed sections containing a different sort of header, "ZLIB" followed by an 8-byte big-endian uncompressed size.

Also, it appears the GNU tools will accept .zdebug variants of every .debug. The "z" is supposed to indicate compression, but nothing seems to require that .zdebug be compressed or .debug uncompressed.

Tuesday, 6 March 2018

"Zach": AI Fraud In Christchurch

David Farrier's story is amazing. Go and read it.

I have no doubt whatsoever that this is all a ridiculous fraud. Apart from the implausibility of it all, the purported technical details make no sense. Serious AI outfits use racks of cheap servers, not expensive supercomputers like the Cray XC50. The XC50 is cooled with water, not liquid nitrogen. Training an AI by sending it emails in English would only work if it has already achieved human-level intelligence, in which case a) impressive effort by Albi Whale, purported "boy genius" b) why bother taking the time of medical professionals to train it, when you could be learning much faster from Wikipedia and other resources on the Internet c) why are you applying the pinnacle of mankind's technological achievement to transcribing medical records in Christchurch?

David Whale's emails to Farrier are full of bluster, someone who doesn't know much about computers trying to impress someone whom he thinks also doesn't know much. He's not telling the truth. Robert Seddon-Smith and John Pickering are either active fraudsters or victims.

Most concerning is that it sounds like government health boards are either on the verge of funneling funds to this fraud, or are already doing so. That needs to be stopped.

Friday, 2 March 2018

Tongariro Northern Circuit #2

A couple of weeks ago I went with a friend to do the Tongariro Northern Circuit for the second time in less than a year. The weather wasn't quite as good as last time but we had another great trip.

We drove down to the mountains, stopping at Orakei Korako on the way — a pretty good thermal area. We spent Friday night in Taupo so we could arrive at the first hut, Mangatepopo, in good time on Saturday. We had a swim in Lake Taupo — surprisingly warm. On Saturday morning we drove to Whakapapa with a stop for a two-hour walk around Lake Rotopounamu — lovely and peaceful.

We knew Cyclone Gita was scheduled to hit New Zealand on Tuesday, the planned fourth day of our walk, so I checked at the DoC office at the start of our walk in Whakapapa. They advised us to just go ahead, that if necessary on the fourth day we could walk out to the Desert Road in 1.5 hours instead of returning to Whakapapa over the exposed saddle between Ruapehu and Tongariro.

The first day to Mangatepopo Hut was a bit rainy and the track is in poor condition ... perhaps the worst one-day section of track in the entire Great Walks system. Nevertheless the area is full of pretty wildflowers this time of year, and we got to Mangatepopo in good time, just after 3pm. Having just two in our group — instead of ten on my last tramp! — meant we talked to a lot more people, most of whom seemed to be Americans for some reason. We had a great time, made popcorn and shared it out, taught San Juan to a few people, and got to know some trampers we'd see a lot of for the rest of the walk.

Sunday's walk across Tongariro to Oturere Hut was busy as expected, as we shared the track with hundreds of people doing the Tongariro Crossing that day. Nevertheless the landscape never ceases to amaze. Oturere is quite cramped (we heard it's scheduled for an upgrade in 2019) but we had another great time: more San Juan, more popcorn, and finally a clear view of Ngauruhoe.

On Monday morning at Oturere I overhead one man propounding "religious people think they have all the answers ... must be comforting (to be so ignorant)". That's totally contrary to my experience. Who contemplates the mystery of the Trinity, or their own sin, for example, and comes away thinking they have all the answers? I resisted interjecting.

The forecast for Tuesday was that with cyclone Gita approaching we'd have 100+km/hour winds and heavy rain. That would be a bad combination for a four-hour walk in very exposed terrain between Ruapehu and Ngauruhoe. We decided to accelerate plans and complete Tuesday's leg on Monday. So, we had a two-and-a-half hour walk to Waihohonu hut — still the best hut in New Zealand AFAIK — and after a half-hour break carried on all the way back to Whakapapa (five and a half hours) with short sides trip to Lower Tama Lake and Taranaki Falls. Rain was forecast for the afternoon but we only had a light sprinkle before we arrived at Whakapapa around 5pm. We drove all the way back to Auckland that night without difficulty. All in all, another excellent tramp.

Sunday, 21 January 2018

Neal Stephenson's "Seveneves" (Mild Spoilers)

There's much discussion of orbital mechanics, disguised as a story. The rest isn't as good.

OK, actually I rather enjoyed it, but only because I'm a sucker for apocalyptic fiction and hard-ish science, and I gave immense credit for the chutzpah of his opening sentence, in which the moon explodes for no reason.

I found his treatment of religion more annoying than usual for sci-fi. His atheist wish-fulfillment fantasy "then everyone realized there's no God" is par for the course. Projecting thousands of years of human development without belief in God recurring, and with no other apparent solution to the meaning of life, is sloppy but also usual. What really grates is the ending, which reveals that — surprise! — people do care about having a supernatural purpose and, oddly, a powerful cabal has found one but they're keeping it secret. It reminded me of Contact where after relentlessly bashing religious rubes, at the very end Sagan reveals that the universe has been designed by, if not God, something seriously God-like. I find their lack of faith in lack of faith disturbing.

Wednesday, 17 January 2018

Long-Term Consequences Of Spectre And Its Mitigations

The dust is settling on the initial wave of responses to Spectre and Meltdown. Meltdown was relatively simple to deal with; we can consider it fixed. Spectre is much more difficult and has far-reaching consequences for the software ecosystem.

The community is treating Spectre as two different issues, "variant 1" involving code speculatively executed after a conditional branch, and "variant 2" involving code speculatively executed via an indirect branch whose predicted destination is attacker-controlled. I wish these had better names, but c'est la vie.

Spectre variant 1 mitigations

Proposals for mitigating variant 1 have emerged from Webkit, the Linux kernel, and Microsoft. The former two propose similar ideas: masking array indices so that even speculative array loads can't load out-of-bounds. MSVC takes a different approach, introducing LFENCE instructions to block speculative execution when the load address appears to be guarded by a range check. Unfortunately Microsoft says

It is important to note that there are limits to the analysis that MSVC and compilers in general can perform when attempting to identify instances of variant 1. As such, there is no guarantee that all possible instances of variant 1 will be instrumented under /Qspectre.
This seems to be a great weakness, as developers won't know whether this mitigation is actually effective on their code.

The Webkit and Linux kernel approaches have the virtue of being predictable, but at the cost of requiring manual code changes. The fundamental problem is that in C/C++ the compiler generally does not know with certainty the array length associated with an array lookup, thus the masking code must be introduced manually. Webkit goes further and adds protection against speculative loads guarded by dynamic type checks, but again this must be done manually in many cases since C/C++ have no built-in tagged union type.

I think "safe" languages like Rust should generalize the idea behind Webkit's mitigations: require that speculatively executed code adhere to the memory safety constraints imposed by the type system. This would make Spectre variant 1 a lot harder to exploit. It would subsume every variant 1 mitigation I've seen so far, and could be automatic for safe code. Unsafe Rust code would need to be updated.

Having said that, there could be variant-1 attacks that don't circumvent the type system, that none of these mitigations would block. Consider a browser running JS code:

let x = bigArray[iframeElem.contentWindow.someProperty];
Conceivably that could get compiled to some mix of JIT code and C++ that does
  if (iframeElemOrigin == selfDocumentOrigin) {
    index = ... get someProperty ...
    x = bigArray[index];
  } else {
    ... error ...
  }
The speculatively executed code violates no type system invariants, but could leak the value of the property across origins. This example suggests that complete protection against Spectre variant 1 will require draconian mitigations, either pervasive and expensive code instrumentation or deep (and probably error-prone) analysis.

Spectre variant 2 mitigations

There are two approaches here. One is microcode and silicon changes to CPUs to enable flushing and/or disabling of indirect branch predictors. The other is "retpolines" — replace indirect branches with an instruction sequence that doesn't trigger the indirect branch predictor. (More precisely, that doesn't use the BTB; the RSB prediction is used instead, but its prediction is directed to a safe destination address.) Apparently the Linux community is advising all compilers and assembly writers to avoid all indirect branches on Intel even in user-space. This means, for example, that we should update rr's handwritten assembly to avoid indirect branches. On the other hand, Microsoft is not giving such advice and apparently is not planning to introduce retpoline support in MSVC. I don't know why this difference is occurring, but it seems like a problem.

Assuming the Linux community advice is followed, things get even more complicated. Future CPUs can be secure against variant 2 without requiring retpolines. We will want to avoid retpolines on those CPUs for performance reasons. Also, Intel's future CET control-flow-integrity hardware will not work with retpolines, so we'll want to turn retpolines off for security! So software will need to determine at run-time whether retpolines should be used. JITs and handwritten assembly will need to add code to do that. This is going to be a burden on lots of software developers for a very long time.

Security/performance tradeoffs

There is now a significant performance penalty for running untrusted code. If you know for sure there is no malicious code running in your (virtual) machine you can turn off these mitigations and get significant performance wins. This wasn't really true before. (Unikernels reaped some performance benefits but created too many other problems to be generally useful.) Inventorying the entire collection of software running in your VM to verify that it's all trusted may be difficult in practice and reduces defense-in-depth ... but no doubt people will be tempted to do it.

We could see increased interest in source-based distributions like Gentoo. Recompiling your software stack to include just the mitigations that you need could bring performance benefits.

Javascript implications

The isolation boundary between Javascript and a browser content process' native code is not visible to the CPU, which makes hardware mitigations difficult to use for JS — and any other system running code in the same process with different levels of trust. It's hard to say what the immediate implications of this are, but I think it makes "one site per process" policies in browsers more appealing in the long term, at least as an option to deploy in case some future difficult-to-mitigate vulnerability hits. Right now browsers are trying to keep the problem manageable by making it difficult for JS to extract information from the timing channel (by limiting timer resolution and disabling features like SharedArrayBuffer that can be used to implement high-resolution timers), but this unfortunately limits the power of Web applications compared to native applications. For example, as long as it lasts we can't run idiomatic parallel Rust code in browsers via WebAssembly :-(. Also I suspect in the medium term attackers will find other ways to read the timing channel that will be less feasible to disable.

I think it would be a grave mistake to simply give up on mixing code with different trust labels in the same address space. Apart from having to redesign lot of software, that would set a hard lower bound on the cost of transitioning between trust zones. It would be much better if hardware mitigations can be designed to be usable within a single address space.

Other attacks

Perhaps the biggest question is whether we are seeing just the start of a flood of serious attacks based on Spectre-like ideas. I think it's entirely possible, and if so, then dealing with those attacks piecemeal as they surface is going to be incredibly expensive and painful. There is even a possibility that the cost of mitigations will compound as mitigations interfere with one another and fewer and fewer people are capable of understanding what's going on. Therefore I hope and pray that people in positions of power — CPU vendors, big software vendors, etc — work together to come up with comprehensive, preventative fixes that simply rule out these classes of attacks, and don't let themselves be entirely consumed by demands for immediate responses to zero-day vulnerabilities. I applaud the sentiment of RISC-V's statement to this end, self-serving as it is.