About this document
On Aug 15, three new processor design issues (from the Spectre-NG family) were published. They are collectively called Level 1 Terminal Fault (L1TF) or ForeShadow issues. These flaws affect Intel CPUs used in the Open Telekom Cloud and are a severe concern.
The flaws are present in Intel processors, where data is speculatively loaded and used IF it is present in the CPU core's Level 1 data (L1d) cache and despite the Page Table Entry (PTE) that controls access rights indicating the page is not present or requiring other special access privileges.
Like with all the other Spectre family members, the speculatively loaded data can then be used to do address calculations and cause data dependent cache loads which can be detected using timing attacks.
As protections are not observed, these flaws can be used to retrieve data from different and potentially higher privileged security domains (CVE-2018-3620), escaping virtual machine restrictions (CVE-2018-3646) and even Intel's Secure Enclave extensions (SGX, CVE-2018-3615).
Like the other Spectre family members, an attacker needs to be able to run skillfully crafted code on the attacked system and can then extract secrets from that system (Information Disclosure issue). The flaws can not be used to manipulate the systems, except of course abusing the gained secrets to overcome security protections.
As the L1 cache is local to a CPU core, the attacker needs to run the attack on the same core (same or sibling hyperthread) as a process that accesses the sensitive memory and causes it to be loaded into L1 cache either in parallel (hyperthreading) or after a context switch (which in general does not clear the cache).
The SGX flaw ("Foreshadow") was uncovered by an international research team (from KU Leuven and independently from Technion, U Michigan and U Adelaide) in January against SGX (CVE-2018-3615) and reported to Intel, whose researchers then uncovered the similar attack scenarios CVE-2018-3620 and CVE-2018-3646 ("Foreshadow NG"). The flaw was presented at the Usenix Sec 2018 conference, and is described by intel with SA00161 and an intel whitepaper. The ForeShadow web page set up by the researchers also has very good information.
According to the researchers, only out-of-order CPUs from Intel are affected, though some other CPU vendors are still double-checking that none of their CPUs can be exploited in a similar way.
With commit 958f338, Linus Torvalds committed mitigation mechanisms to the upstream Linux kernel which are thus part of 4.19-rc1 and back-ports are part of 4.18.1, 4.17.15, 4.14.63, 4.9.120, 4.4.148 Linux kernels.
The following precautions have been taken:
- The Linux kernel mangles the PTE in a special way ("PTE inversion") for non-present (or otherwise protected) memory pages pointing to invalid physical addresses, causing the L1TF flaw not to trigger.
- The KVM hypervisor flushes the L1d cache if needed on entering a virtual machine. This is done using a new
l1d_flushCPU feature as enabled by Intel's recent (already published) microcode if available and otherwise by just pruning the cache by loading the appropriate amount of data into it.
- The ability to control the L1d flushing policy for the hypervisor by boot-time parameters but also at runtime.
- The ability to disable sibling CPU threads (hyperthreading) at boot time but also dynamically at runtime for the kernel and the KVM hypervisor.
The Xen hypervisor has received similar patches as KVM, see XSA-273.
With the kernel and hypervisor patches in place, the flaw is considered fully mitigated IF simultaneous multihthreading/hyperthreading (SMT/HT) is disabled (or not present or used in the first place) or if VM guests are forced to use shadow page tables (that the hypervisor thus controls).
There is no performance data available currently; while the performance impact of the PTE mangling is expected to not be measurable, the L1d flushing will have an impact on workloads with many VM switches, especially on CPUs without support for
Switching off hyperthreading will negate the positive effect of hyperthreading which typically increases the compute capacity by ~20%.
We are still in discussion whether we need to switch off HT on all flavors or forcing shadow page-table usage (disabling nested paging aka EPT).
There are some discussions on possibilities to mitigate the issue even without switching off hyperthreading. The conceptually most simple approach would be to ensure that two sibling threads are never used by two different VMs or at least not by VMs belonging to different customer projects. This would require CPU pinning and tweaks to oversubscription and also needs to control IRQ handling from the hypervisor to be pinned to CPUs not shared with untrusted VMs.
Understanding our options and deciding the best approach will take another couple of days. In the end, we may end up disabling hyperthreading only for some of the flavors and issue advice to our customers to move to these for security-sensitive workloads.
Security advisories have been published by SUSE, RedHat, Ubuntu, Microsoft, Huawei and others. The advisories explain the issue rather well, there is also a more verbose RedHat Blog article and good coverage on LWN.
Updated kernels and hypervisors have been released or will be released during the next days. We will ensure that the _latest public August images will contain these updates with mitigation for the Intel L1TF flaws.
Please refer to the advisories to understand the boot command line and run time parameters to control L1d flushing and hyperthreading.
In OTC, we use Intel CPUs that are affected.
We are not using Intel's secure enclave (SGX) feature, so we don't rely on additional security from SGX that would now be broken. (CVE-2018-3615)
But of course, customers rely on operating system kernel's protection between processes and kernel memory (CVE-2018-3620). Worse, the security model of our cloud depends on reliable memory isolation of virtual machines (VMs) - CVE-2018-3646.
While attacks require attackers to run specially crafted code on the attacked systems and are hard to carry out, an environment with virtual machines of different customers sharing CPU resources is the place that is where the risk is largest for such attacks.
Customers using our general purpose flavors have the worst exposure; these flavors use simultaneous multithreading (SMT -- called HyperThreading HT by Intel) and several customers may end up sharing the same set of physical CPUs. The situation is a bit better on our high performance flavors, where we are not using CPU oversubscription and best on the flavors where no hyperthreading is used (h2, hl1, e2), so there can be no heavy sharing of CPU cores between different VMs. (See Mitigation below.)
According to our current assessment, Customers using Dedicated Hosts (DeH) or even Bare Metal Systems are safe from leaking data to other customers due to these flaws. Of course, this is even more true for our Hybrid customers for their private OTC implementations, where all of the VMs are under their control and thus trusted.
We are currently working heavily on planning the roll-out of mitigation mechanisms. Please look for updates of this document to learn about the progress of our understanding and review the August OTC Patching document (to be published) to learn details about our patch rollout. The challenge of course will be how we go about SMT/HT, as this will be a painful tradeoff.
CPU hardware engineers have been ignoring security restrictions when speculatively executing instructions. This helps with performance, of course, but relies on micro-architectural state from that speculation not to leak. As we know since January, this is unfortunately untrue. Intel and IBM (Power) have been particularly aggressive with speculation and are thus affected by many issues, ARM and AMD seem to be affected somewhat less.
This particular issue is somewhat curious. For looking up the tag in a VIPT L1 cache, the TLB needs to be consulted -- does Intel's TLB implementation fail to notice the present's bit clearance? What else does it miss? What speculation may we run into on a TLB miss? Allowing TLB entries for PTEs with a cleared present bit also seems like a strange idea -- when it's used, it will always be misspeculation, so it's not helping performance at all, only wasting power and possibly even costing a tiny bit of performance. Maybe just an oversight? Could it be fixed with a future microcode update or CPU stepping? Only Intel knows ...
The software workarounds for these issues, somewhat supported by microcode-introduced hardware features, have all been painful and affecting performance badly to various degrees. We have yet to see more fundamental approaches to fix the hardware -- involving things like stronger hardware partitioning, performing certain security checks synchronously even during speculation, and more careful cleanup after misspeculation, ...
Until that happens, we will continue to observe more ways in which speculative execution exposes architecturally protected data. So for the next years, there will be a painful trade-off between performance and best utilization (by sharing many resources) on one hand and better security at the price of lower performance (avoiding certain optimizations) and higher cost (by un-sharing resources such as disabling hyperthreading, shared usage of CPUs or even systems). Customers will have to understand the security sensitivity of their application and then choose the offerings from a cloud provider that meet the needs. Cloud providers may see more demand for flavors with stronger isolations such as flavors without hyperthreading, without oversubscription (and with pinning), or even Dedicated Hosts and Bare Metal systems or private cloud setups.