10th Jan 2018 by Kurt Garloff

Flaws

The vulnerabilities are a combination of modern CPUs doing speculative execution of CPU instructions and CPU caches exposing information on speculatively accessed memory addresses, creating a side-channel that allows slowly reading data that would not be accessible by normal means. The issues are explained in a bit more detail in the following sections.

Readers not interested in these details can skip to the next section "Spectre and Meltdown".

"Sandboxing"

All but the simplest (embedded) computer systems these days are capable of handling multiple tasks (or processes). Computers use memory management techniques to keep the tasks separated. One task can not access the memory owned by another process, except for well-defined and security-policed interprocess communication mechanism (such as system calls or sockets).

Unprivileged Separation

This task separation can be done entirely by a code interpreter, such as a JavaScript engine in your web browser -- when it interprets¹ code from one web site, it ensures that the JavaScript code can not access memory belonging to another browser window by having the appropriate boundary checks in place. Getting all the checks right is not trivial as the history of browser vulnerabilities shows.

Hardware Assisted Separation

For stronger (and easier to get right) protection, sandboxing can also use the CPU features for memory management. This is how the operating system kernel separates the userspace processes from each other (and from the kernel itself): It uses the page-table permission settings that the CPU then enforces. If a process tries to access memory outside of its assigned range, an exception (trap/fault) will be raised and the kernel is invoked to handle it, typically killing the offending process.

In this model, the kernel has a higher privilege level and can thus control what the userspace processes are allowed to do and what not. Setting up the maps for the accessible memory is an important part of this. It allows multi-user multi-process systems where many users can safely use a computer without a lot of interference.

Virtualization uses the same mechanism: Here a piece of software (the host hypervisor) with even higher privileges than the Virtual Machines' (VMs) Operating system kernels controls access to the computer's hardware resources, last not least separating the physical memory between the Virtual Machines.

The hypervisor ensures that no VM can make memory accesses outside of its assigned region -- no VM can read or write memory in the hypervisor host or in anther VM and again CPU memory management techniques are used to enforce this.

The main memory (RAM) of current computers is very slow compared to the speed of the CPUs. While bandwidth has seen reasonable growth, worst case latency to randomly accessing a byte in memory can easily exceed 200 CPU cycles. This means that if the CPU needs to process data which it needs to wait for retrieving from RAM, it may be stalled for more than 200 cycles.

CPUs uses caches to overcome this -- data that is often used will be kept in small areas of very fast memory (cache) on the CPU and this way reducing the latency to a much smaller number of cycles. CPU caches are architecturally invisible to normal (non-system) code -- they just speed up things. The CPU takes care to ensure that cached data is coherent with main memory, so normal applications do not need to consider caching except when doing performance tuning.

The operating system kernel and the hypervisor do need to manage caches explicitly to some degree -- they sometimes need to invalidate caches or ensure write-back caches are written to main memory before passing memory access to a device (such as a network device which has direct memory access).

The CPU caches data that it recently used -- in addition it uses some magic to prefetch data that it thinks it will need in the near future. The goal is to prevent the latency waiting for main memory (cache miss) in most cases, so the CPU's execution units can always be kept busy with useful work. Except for the main data and instruction caches, there are also special caches for preventing the memory management unit from having a cache miss (TLB) or caches that help with decoding assembly instructions.

Whether or not the content of a memory address is cached can be easily observed by a normal user -- the access time is vastly different.

Modern CPUs do many many things in parallel to provide good performance. While one instruction is being executed, others are already decoded, addresses being computed, etc. Modern CPUs have pipelines with 5 -- 20 stages; each stage handling a specific aspect of instruction handling. So even simple in-order CPUs have more than one instruction being underway.

More complex and higher performance CPU designs use out-of-order (OoO) designs: These CPUs can hide the latency of a slow instruction or -- more relevant -- the very high latency of a cache miss by already starting to speculatively execute instructions that come later in the execution stream. The CPUs take care to track dependencies between the instructions (so one instruction that uses the result of another one waits for it), so everything looks like it's happening in order to the code.

The results of speculatively executed instructions are kept in a buffer and the result is kept invisible from the code until the CPU has completed all instructions that precede it in the instruction stream. Only then the results become visible. This completion is called retirement -- CPU instructions are retired in-order even if they have been re-ordered and calculated in a vastly different order behind the scenes.

Well performing CPUs do a lot of speculation -- they have enough buffers to execute dozens of instructions speculatively to avoid stalls. They predict branches (branch target buffer -- BTB) and return addresses to increase the chance to execute the correct instructions.

Nevertheless, sometimes a processor misspeculates, for example because a branch is taken differently than predicted. Speculatively executed instructions after the location of misspeculation should never have been executed. Or a preceding instructions causes a fault (e.g. by accessing memory that is not allowed to be accessed by the memory management configuration). This is called an abort -- the results of the speculatively executed CPU instructions now are undone and never become visible, just as if they had never been (speculatively) executed.

This way of doing caching and speculation to enhance performance is standard practice in CPU design and appeared to be safe -- the cache hardware (with a bit of help from the operating system and the hypervisor if present) kept the memory view coherent and the speculation was carefully undone when it had to be aborted.

Until researchers found out that aborts on most contemporary OoO CPUs don't undo the cache effects, thus allowing code to determine if speculatively executed code has caused certain memory to be loaded into cache or not. This is a leak of memory address information from speculatively executed code, a covert channel to read addresses by doing timing measurements.

It is not just an address leakage. In a two step process, the processor can speculatively read a piece of data from normally inaccessible memory and then use the result to calculate an address depending on the inaccessible data and try to access it. Then cache effects of this are observable, allowing to recover some bits of information from the inaccessible data.

We have thus found a mechanism to read data from protected memory.

While cache timing attacks have been performed by security researchers for years, the new idea of combining speculative execution with cache timing attacks has a lot more severe consequences. To our knowledge, these discoveries have been made by security researchers of Graz University of Technology and other universities and researchers of Google's Project Zero in the course of the first half of 2017. The issue was reported in June 2017 to the chip manufacturers (intel and others) as a plan was made to address this industry-wide issue in coordination with operating system vendors. Public disclosure was planned for Jan 9.

Speculations triggered by rather intrusive (and performance-costly) architectural changes in the Linux kernel started to trigger speculations in December 2017 and by the first week of January, there were security researchers that had public speculations coming rather close to the real problems. This made it safer to publicly disclose the issue and start with communication and fixing rather than give attackers a week where they could have abused the issue.

[1] Whether the interpreter uses Just-In-Time (JIT) compilation to speed up or not is irrelevant here.