Wednesday 17 January 2018
Long-Term Consequences Of Spectre And Its Mitigations
The dust is settling on the initial wave of responses to Spectre and Meltdown. Meltdown was relatively simple to deal with; we can consider it fixed. Spectre is much more difficult and has far-reaching consequences for the software ecosystem.
The community is treating Spectre as two different issues, "variant 1" involving code speculatively executed after a conditional branch, and "variant 2" involving code speculatively executed via an indirect branch whose predicted destination is attacker-controlled. I wish these had better names, but c'est la vie.
Spectre variant 1 mitigations
Proposals for mitigating variant 1 have emerged from Webkit, the Linux kernel, and Microsoft. The former two propose similar ideas: masking array indices so that even speculative array loads can't load out-of-bounds. MSVC takes a different approach, introducing LFENCE instructions to block speculative execution when the load address appears to be guarded by a range check. Unfortunately Microsoft says
It is important to note that there are limits to the analysis that MSVC and compilers in general can perform when attempting to identify instances of variant 1. As such, there is no guarantee that all possible instances of variant 1 will be instrumented under /Qspectre.This seems to be a great weakness, as developers won't know whether this mitigation is actually effective on their code.
The Webkit and Linux kernel approaches have the virtue of being predictable, but at the cost of requiring manual code changes. The fundamental problem is that in C/C++ the compiler generally does not know with certainty the array length associated with an array lookup, thus the masking code must be introduced manually. Webkit goes further and adds protection against speculative loads guarded by dynamic type checks, but again this must be done manually in many cases since C/C++ have no built-in tagged union type.
I think "safe" languages like Rust should generalize the idea behind Webkit's mitigations: require that speculatively executed code adhere to the memory safety constraints imposed by the type system. This would make Spectre variant 1 a lot harder to exploit. It would subsume every variant 1 mitigation I've seen so far, and could be automatic for safe code. Unsafe Rust code would need to be updated.
Having said that, there could be variant-1 attacks that don't circumvent the type system, that none of these mitigations would block. Consider a browser running JS code:
let x = bigArray[iframeElem.contentWindow.someProperty];Conceivably that could get compiled to some mix of JIT code and C++ that does
if (iframeElemOrigin == selfDocumentOrigin) { index = ... get someProperty ... x = bigArray[index]; } else { ... error ... }The speculatively executed code violates no type system invariants, but could leak the value of the property across origins. This example suggests that complete protection against Spectre variant 1 will require draconian mitigations, either pervasive and expensive code instrumentation or deep (and probably error-prone) analysis.
Spectre variant 2 mitigations
There are two approaches here. One is microcode and silicon changes to CPUs to enable flushing and/or disabling of indirect branch predictors. The other is "retpolines" — replace indirect branches with an instruction sequence that doesn't trigger the indirect branch predictor. (More precisely, that doesn't use the BTB; the RSB prediction is used instead, but its prediction is directed to a safe destination address.) Apparently the Linux community is advising all compilers and assembly writers to avoid all indirect branches on Intel even in user-space. This means, for example, that we should update rr's handwritten assembly to avoid indirect branches. On the other hand, Microsoft is not giving such advice and apparently is not planning to introduce retpoline support in MSVC. I don't know why this difference is occurring, but it seems like a problem.
Assuming the Linux community advice is followed, things get even more complicated. Future CPUs can be secure against variant 2 without requiring retpolines. We will want to avoid retpolines on those CPUs for performance reasons. Also, Intel's future CET control-flow-integrity hardware will not work with retpolines, so we'll want to turn retpolines off for security! So software will need to determine at run-time whether retpolines should be used. JITs and handwritten assembly will need to add code to do that. This is going to be a burden on lots of software developers for a very long time.
Security/performance tradeoffs
There is now a significant performance penalty for running untrusted code. If you know for sure there is no malicious code running in your (virtual) machine you can turn off these mitigations and get significant performance wins. This wasn't really true before. (Unikernels reaped some performance benefits but created too many other problems to be generally useful.) Inventorying the entire collection of software running in your VM to verify that it's all trusted may be difficult in practice and reduces defense-in-depth ... but no doubt people will be tempted to do it.
We could see increased interest in source-based distributions like Gentoo. Recompiling your software stack to include just the mitigations that you need could bring performance benefits.
Javascript implications
The isolation boundary between Javascript and a browser content process' native code is not visible to the CPU, which makes hardware mitigations difficult to use for JS — and any other system running code in the same process with different levels of trust. It's hard to say what the immediate implications of this are, but I think it makes "one site per process" policies in browsers more appealing in the long term, at least as an option to deploy in case some future difficult-to-mitigate vulnerability hits. Right now browsers are trying to keep the problem manageable by making it difficult for JS to extract information from the timing channel (by limiting timer resolution and disabling features like SharedArrayBuffer that can be used to implement high-resolution timers), but this unfortunately limits the power of Web applications compared to native applications. For example, as long as it lasts we can't run idiomatic parallel Rust code in browsers via WebAssembly :-(. Also I suspect in the medium term attackers will find other ways to read the timing channel that will be less feasible to disable.
I think it would be a grave mistake to simply give up on mixing code with different trust labels in the same address space. Apart from having to redesign lot of software, that would set a hard lower bound on the cost of transitioning between trust zones. It would be much better if hardware mitigations can be designed to be usable within a single address space.
Other attacks
Perhaps the biggest question is whether we are seeing just the start of a flood of serious attacks based on Spectre-like ideas. I think it's entirely possible, and if so, then dealing with those attacks piecemeal as they surface is going to be incredibly expensive and painful. There is even a possibility that the cost of mitigations will compound as mitigations interfere with one another and fewer and fewer people are capable of understanding what's going on. Therefore I hope and pray that people in positions of power — CPU vendors, big software vendors, etc — work together to come up with comprehensive, preventative fixes that simply rule out these classes of attacks, and don't let themselves be entirely consumed by demands for immediate responses to zero-day vulnerabilities. I applaud the sentiment of RISC-V's statement to this end, self-serving as it is.
Comments