Like other cloud providers (e.g. Amazon and Azure), we apply mitigating patches to each host; we have engineered ways to deploy most key patches without having to reboot all hosts which would have lead to service disruptions for applications of our customers that may not be fully cloud-ready.
The first priority for the OTC team is to reestablish the full separation of full isolation between Virtual Machines (VMs) running on the same hypervisor; according to our analysis this is the most critical issue that needs to be addressed in the infrastructure. Here, only Spectre-2 (BTB injections) needs to be addressed.
In OTC, we use intel's microcode patches plus a few changes to the Xen and KVM hypervisor (along the lines of the IBRS) to prevent the BTB poisoning attacks from leaking hypervisor memory.
We have done extensive testing over the last days in our test and reference environments and have confirmed that the patches can be applied at runtime without a host reboot. Microcode updates can be done at runtime (though certain CPUs and patches are not normally recommended to load after the OS has fully booted) and our hypervisors support this. The code changes required in the hypervisors to leverage are relatively contained; we are using the hot-patch mechanism supported by our hypervisors to deploy these without a reboot.
It was found in testing that the microcode updates as of 2018-01-08 from intel does cause trouble on the v4 (Broadwell) CPU generation and on Xeon E7-8880 v3 (used for the SAPHANA flavors), while the tests on other v3 (Haswell) and v5 (Skylake) CPUs did not show any trouble. This means that patching the v4 systems will be delayed. (We expect a delay of roughly one week.)
This means that after the first round of patching with v3 and v5, we will have the following situation
|GenPurp1||c1,c2,s1,m1||Xen||2658v3 + v4||partially (v3)|
|HighPerf1||h1||Xen||2690v3 + v4||partially (v3)|
The Spectre-2 mitigation comes at a performance cost; this is workload dependent -- the most significant hit happens on workloads that have many I/O transactions per second. We expect the performance impact on the s2 flavor to be more visible, as the 10Gbps network capability also requires many packets per seconds. See below for more details on performance impact.
Hypervisor patch schedule
We have scheduled the mass rollout for the patches for the evening (european time) of Thu, 2018-01-11.
The customer visible schedule looks like this (all times in CET):
|1||2018-01-11 17:00||eu-de-01||Deploy microcode updates to all systems in AZ, deploy hot-patches to Xen and KVM hypervisors, test systems and validate effetiveness.|
|2||2018-01-11 21:00||eu-de-02||Deploy microcode updates to all systems in AZ, deploy hot-patches to Xen and KVM hypervisors, test systems and validate effetiveness.|
|3||2018-01-12 12:00||ap-sg-01||Deploy microcode updates to all systems in AZ, deploy hot-patches to Xen and KVM hypervisors, test systems and validate effetiveness.|
As can be seen, we do this availability zone by availability zone, first in Europe/Germany and then Asia/Singapore. We will stop the rollout if the testing fails and are prepared to do rollbacks to the last working state.
As mentioned above this schedule excludes the hosts with v4 CPUs (Bare Metal and high-performance flavors and the newer hosts in the old Xen general purpose pool).
During the patch application, the hosts and the virtual machines continue to run with minimal interruption: The microcode update stops the CPU for a short (sub-second) amount of time -- most applications will at worst see a small performance degradation due to this. Likewise, the hypervisor hot-patching needs to synchronize state between CPUs and thus causes some a performace decrease for a sub-second period as well.
Customers that are extremely performance sensitive or do not trust that the live patching can be done without problems can of course migrate applications into the other availability zone or shut down VMs. Be aware the the patch application time is scheduled for max. 3 hours within the 4 hours windows, so the time betwen 20:00 and 20:59 could be use to migrate applications into eu-de-01, although we clearly not deem this to be necessary.
The Spectre-2 mitigation does come with a performance impact; the overhead of system calls increas a bit, see below for some performance data.
Next steps: Guest kernels
We conducted reviews on on compute hosts to identify all secrets that are stored on those systems; they will be replaced after the patch applications.
We recommend customers to quickly deploy new vendor kernels to also address the flaws that allow userspace processes to read memory from the guest kernel and from other processes and containers running inside the same VM. This can be achieved by online updates or deploying new images. We have provided a almost complete set of new public images, as can be seen below. Note that even in the online update approach, the guest kernel will have to be rebooted to make the changes effective.
The situation for Bare Metal machines is special -- there is no hypervisor or host system that OTC operations can deploy a workaround to, as the kernel is directly under the control of the customer. The Linux images also can be used to deploy updated microcode. We have included the intel microcode (ucode-intel, microcode_ctl, intel-microcode) in our latest images. Alternatively, microcode updates can be delivered in the BIOS/UEFI firmware -- we intend to do so first for idle hardware. For running hardware, we ask customers to tell us when they shut it down and want us to do the firmware update.
We also run some guest systems ourselves or provide special images that run inside customer VMs to provide services such as ELB, DMS, CCE, RDS, MRS, ... We are currently prioritizing the redeployment of these guests based on the risk assessment that's currently happening. Our current assessment is that can update the services that we run without service interruption by leveraging the fact that those are not built on single instances but clusters of machines. For service VMs that run inside the customer-controlled VMs, we will create upgrade procedures that we will ask our customers to follow.
We produce most of the public images on OTC by a fully automated continuous process with automated daily builds (and the possibility to manually trigger additional builds) and tests in our OTC ImageFactory.
We have quickly published new public images as soon as updated (KPTI-enabled) kernels were published by the operating system vendors. We also included the needed microcode updates for Spectre-2 mitigation if it's available. Not that microcode updates can not be done from withing a guest VM, but only from the hypervisor host and on Bare Metal machines.
We observed the first kernel and microcode updates on 2018-01-04 and started publishing image updates in the night. We expect to have this mostly completed before 2018-01-11. We also included the kernel.unprivileged_bpf_disable = 1 setting in the images that use 4.4+ kernels.
As of midgnight (24:00 CET) 2018-01-05, we had published updated public images for CentOS-7, CentOS-6, Oracle Linux Server 7, Oracle LS 6, and SLES12 SP3. (We are always using the latest minor version number available, so CentOS-7 is no 7.0, but 7.4 right now and will be 7.5 as soon as this is out.)
On 2018-01-06, additional updated public images were made available: SLES12 SP2, SLES11 SP4, SLES11 SP4 extended, SLES11 SP4 SAPHANA, openSUSE 42 JeOS, and openSUSE 42 Docker. Debian_9 followed on 2018-01-08.
Note that the images with the _latest label are as new as the latest dated image or newer, so they are good choices for customers to reference by name.
Find a table with the updated images, the version of the updated kernelsand microcode packages (201x* => ucode-intel/intel-microcode, 1.x or 2.x => microcode_ctl). Most interestingly, the number of changelog entries mentioning the CVE numbers of Meltdown-3, Spectre-2 and Spectre-1. (The Spectre-2 = Var2 volumn has kernel plus microcode changelog entries.) Note that these numbers are only an indicator whether an issue has been addressed comprehensively by the OS vendor, but need further analysis.
|OS version||Timestamp (UTC)||kernel||microcode||Var3||Var2||Var1|
|RedHat EL 7(.4)||2018-01-11 08:57||3.10.0-693.11.6.el7||2.1-22.2.el7||63||45+0||11|
|RedHat EL 6(.9)||2018-01-11 11:05||2.6.32-696.18.7.el6||1.17-25.2.el6_9||73||66+0||12|
|CentOS 7(.4)||2018-01-05 02:38||3.10.0-693.11.6.el7||2.1-22.2.el7||63||45+0||11|
|CentOS 6(.9)||2018-01-05 01:35||2.6.32-696.18.7.el6||1.17-25.2.el6_9||73||66+0||12|
|Oracle LS 7(.4)||2018-01-05 09:58||3.10.0-693.11.6.el7||2.1-22.2.el7||63||45+0||11|
|Oracle LS 6(.9)||2018-01-05 13:08||2.6.32-696.18.7.el6||1.17-25.2.el6_9||73||66+0||12|
|SLES 12 SP3||2018-01-05 03:51||4.4.103-6.38.1||20170707-13.8.1||48||0*+2||14|
|SLES 12 SP2||2018-01-06 11:18||4.4.103-92.56.1||20170707-13.8.1||54||0*+2||14|
|SLES 12 SP1||TBD||0||0+0||0|
|SLES 11 SP4||2018-01-06 09:47||3.0.101-108.21.1||1.17-184.108.40.206||29||0*+3||8|
|openSUSE 42(.3)||2018-01-06 01:48||4.4.104-39.1||20170707-13.1||48||0*+2||14|
|Debian 9||2018-01-08 03:24||4.9.65-3+deb9u2||1||0+0||0|
|Fedora 26||2018-01-08 08:14||4.14.11-200.fc26||0**||0+0||0|
|EulerOS 2 (SP2)||2018-01-11 07:43||3.10.0-3220.127.116.11. h44||69||64+0||12|
0* : The SUSE kernel updates include a feature to enable the microcode-dependent Spectre-2/BTB fixes according to the SUSE security team predating but similar to IBRS.
0** : The Fedora 26 kernel has the KPTI patches from 4.14.11, so it is secured and we have validated this. Just no changelog entry. And we could not find traces of Spectre-2 or Spectre-1 mitigation.
In general, the kernel updates in these images do comprehensively address Meltdown-3. Many also have workarounds for Spectre-1 (in eBPF and some drivers). However, currently only RedHat and derivatives (CentOS, Oracle, Euler) and SUSE seem to really address Spectre-2, both by shipping microcode updates as well as leveraging them (IBRS). So we do expect more updates in the future, with additional workarounds (IBRS and/or retpolines).
We are also working with our partner bitnami, who has been providing updated images (based on CentOS-7) to our Marketplace.
Customers can use a check script to test whether the mitigations are active. (The script does currently not detect the SUSE Sepctre-2 mitigation.)
If you want to check the microcode version, you need to ask the kernel to update the microcode. Use echo 1 > /sys/devices/sytem/cpu/microcode/reload on a running Linux system and then check /proc/cpuinfo again. Note that this won't change the microcode of your host CPU in a VM (only the hypervisor can do that or the kernel on BareMetal), but the kernel will reread the version from the CPU this way.
We do want to remember customers that the Spectre-2 fix and also the KPTI workarounds in the (guest) kernel do come with a measurable performance cost.
|getppid() cycles||IBRS||Kernel||(pti=off) nopti||pti=auto (default)||pti=auto nopcid|
|KVM Gold6161 (v5)||no||4.4||120||520||1150|
|KVM Gold6148 (v5)||yes||4.4||130||560||1200|
|KVM Gold6161 (v5)||no||3.10||210||600||1430|
These measurements quantify the overhead for a system call (SYS_getppid) in terms of CPU cycles with KPTI and IBRS enabled kernels from SUSE (4.4) and RedHat (3.10) on Xeon E5-2658v3 (Haswell, c1/c2/s1/m1 flavors, Xen), a Pentium Gold 6148/6161 (v5 - Skylake) on our new KVM s2 pool, on E5-2667v4 (Broadwell) CPUs on BMS and on a physical desktop system with Core i5-4250U (Haswell aka v3). Source: Benchmark by the author.
The overhead observed here is the theoretical worst case scenario. Note that on RedHat 7, only nopti can be used to switch off the KPT Isolation; the pti=on|auto|off syntax is only supported by SUSE kernels. For the Redhat tunables see the Redhat tuning guide.
Note that there is no reason to use nopcid. The benchmark numbers are only here to show how bad things could have been (and might be in a case if someone backports KPTI without pcid support to an old kernel).
As can be seen, the worst-case performance overhead by the Spectre-2 mitigation is very significant on bare metal v4 CPUs; the switch time from userspace to the kernel (by system calls or interrupts) is more than 10x higher on v4 CPUs. The good news is that the v3 (Haswell) and v5 (Skylake) CPUs do not suffer much beyond the impact of KPTI and are thus more in line of what we expected after the communication from intel.
It should be noted that the v4 microcode updates from intel are not final yet. We hope intel will be able to reduce the cost. In general, it appears that that intel ist still struggling with delivering microcode updates that work perfectly in all scenarios. We have deployed new microcode to all v3 systems on 2018-01-11 in eu-de and have thus far been lucky not to observe any crashes yet. Our recommendation to customers that are concerned about this is that they tell the platform to restart VMs automatically (autorecovery) on another host if a host fails.
Even without the Spectre-2 microcode fixes, the constant overhead of system calls and interrupts approximately quadruples by KPTI (with the pcid optimization on, 2x worse otherwise) for KPTI (without the Spectre-2 mitigation), while most other operations are not affected in a measurable way. In macro-benchmarks, a performance degradation was observed be between 0% (e.g. number-crunching) and 50% (syscall heavy), for normal workloads we expect degradation below 10% by KPTI as reported by benchmarks. Our CPUs in OTC all have support for the PCID feature, mitigating the KPTI performance impact somewhat.
While the kernel boot parameter nopti can be used to disable the Meltdown-3 mitigations, this can not be recommended due to the security implications. (In our measurements, the usage of nospec on SUSE kernels to disable the Spectre-2 did not make a difference on the v4 CPUs.)
Some applications do lots of gettimeofday() system calls, a call that is almost as fast as getppid() -- by switching the clocksource (/sys/devices/systems/clocksource/clocksource0/current_clocksource) to tsc, the vsyscall mechanism can be used, avoiding any performance impact by KPTI and IBRS. On new kernels with appropriate hardware, the kvm_clock clocksource is also implemented via vsyscalls, so the KVM default there has roughly the same performance as tsc there.
There are some real-world benchmark numbers from Phoronix that give some realistic idea on the impact of KPTI alone; we have not found comprehensive coverage of the microcode based Spectre-2 mitigation (IBRS) performance impact yet -- but according to our own measurements, it's worse.
There are also some performance results by intel, though the test environment is focused on Desktop processors and Workloads and may only give a rough indication on typical server workload impact.
It should be noted that the performance impact is very much dependent on the workload; workloads that do many I/O operations per seconds are hit the worst. Number crunching on the other hand is not affected in a measurable way.