10th Jan 2018 by Kurt Garloff

OTC Patching

Like other cloud providers (e.g. Amazon and Azure), we apply mitigating patches to each host; we have engineered ways to deploy most key patches without having to reboot all hosts which would have lead to service disruptions for applications of our customers that may not be fully cloud-ready.

Patch strategy

The first priority for the OTC team is to reestablish the full separation of full isolation between Virtual Machines (VMs) running on the same hypervisor; according to our analysis this is the most critical issue that needs to be addressed in the infrastructure. Here, only Spectre-2 (BTB injections) needs to be addressed.

In OTC, we use intel's microcode patches plus a few changes to the Xen and KVM hypervisor (along the lines of the IBRS) to prevent the BTB poisoning attacks from leaking hypervisor memory.

We have done extensive testing over the last days in our test and reference environments and have confirmed that the patches can be applied at runtime without a host reboot. Microcode updates can be done at runtime (though certain CPUs and patches are not normally recommended to load after the OS has fully booted) and our hypervisors support this. The code changes required in the hypervisors to leverage are relatively contained; we are using the hot-patch mechanism supported by our hypervisors to deploy these without a reboot.

It was found in testing that the microcode updates as of 2018-01-08 from intel does cause trouble on the v4 (Broadwell) CPU generation and on Xeon E7-8880 v3 (used for the SAPHANA flavors), while the tests on other v3 (Haswell) and v5 (Skylake) CPUs did not show any trouble. This means that patching the v4 systems will be delayed. (We expect a delay of roughly one week.)

This means that after the first round of patching with v3 and v5, we will have the following situation

Pool	Flavors	Hype	CPUs	Patch status
GenPurp1	c1,c2,s1,m1	Xen	2658v3 + v4	partially (v3)
DiskInt	d1	Xen	2690v3	fully
SapHana	e1,e2	Xen	8880v3	cwk 3
vGPU	g1	Xen	2690v3	fully
GPU	g2	Xen	2690v4	cwk 3
HighPerf1	h1	Xen	2690v3 + v4	partially (v3)
HighPerf2	h2	KVM	2667v4	cwk 3
HighPerf1L	h1l	KVM	2690v4	cwk 3
MemOpt2	m2	KVM	2690v4	cwk 3
GenPurp2	s2	KVM	PG6161 (v5)	fully
BareMetal	physical	---	8890v4,2667v4	cwk 3

The Spectre-2 mitigation comes at a performance cost; this is workload dependent -- the most significant hit happens on workloads that have many I/O transactions per second. We expect the performance impact on the s2 flavor to be more visible, as the 10Gbps network capability also requires many packets per seconds. See below for more details on performance impact.

Hypervisor patch schedule

We have scheduled the mass rollout for the patches for the evening (european time) of Thu, 2018-01-11.

The customer visible schedule looks like this (all times in CET):

Phase	Start time	AZ	Description
1	2018-01-11 17:00	eu-de-01	Deploy microcode updates to all systems in AZ, deploy hot-patches to Xen and KVM hypervisors, test systems and validate effetiveness.
2	2018-01-11 21:00	eu-de-02	Deploy microcode updates to all systems in AZ, deploy hot-patches to Xen and KVM hypervisors, test systems and validate effetiveness.
3	2018-01-12 12:00	ap-sg-01	Deploy microcode updates to all systems in AZ, deploy hot-patches to Xen and KVM hypervisors, test systems and validate effetiveness.
4	2018-01-12 15:00	ap-sg-02	Deploy microcode updates to all systems in AZ, deploy hot-patches to Xen and KVM hypervisors, test systems and validate effetiveness.

As can be seen, we do this availability zone by availability zone, first in Europe/Germany and then Asia/Singapore. We will stop the rollout if the testing fails and are prepared to do rollbacks to the last working state.

As mentioned above this schedule excludes the hosts with v4 CPUs (Bare Metal and high-performance flavors and the newer hosts in the old Xen general purpose pool).

During the patch application, the hosts and the virtual machines continue to run with minimal interruption: The microcode update stops the CPU for a short (sub-second) amount of time -- most applications will at worst see a small performance degradation due to this. Likewise, the hypervisor hot-patching needs to synchronize state between CPUs and thus causes some a performace decrease for a sub-second period as well.

Customers that are extremely performance sensitive or do not trust that the live patching can be done without problems can of course migrate applications into the other availability zone or shut down VMs. Be aware the the patch application time is scheduled for max. 3 hours within the 4 hours windows, so the time betwen 20:00 and 20:59 could be use to migrate applications into eu-de-01, although we clearly not deem this to be necessary.

The Spectre-2 mitigation does come with a performance impact; the overhead of system calls increas a bit, see below for some performance data.

Next steps: Guest kernels

We conducted reviews on on compute hosts to identify all secrets that are stored on those systems; they will be replaced after the patch applications.

We recommend customers to quickly deploy new vendor kernels to also address the flaws that allow userspace processes to read memory from the guest kernel and from other processes and containers running inside the same VM. This can be achieved by online updates or deploying new images. We have provided a almost complete set of new public images, as can be seen below. Note that even in the online update approach, the guest kernel will have to be rebooted to make the changes effective.

The situation for Bare Metal machines is special -- there is no hypervisor or host system that OTC operations can deploy a workaround to, as the kernel is directly under the control of the customer. The Linux images also can be used to deploy updated microcode. We have included the intel microcode (ucode-intel, microcode_ctl, intel-microcode) in our latest images. Alternatively, microcode updates can be delivered in the BIOS/UEFI firmware -- we intend to do so first for idle hardware. For running hardware, we ask customers to tell us when they shut it down and want us to do the firmware update.

We also run some guest systems ourselves or provide special images that run inside customer VMs to provide services such as ELB, DMS, CCE, RDS, MRS, ... We are currently prioritizing the redeployment of these guests based on the risk assessment that's currently happening. Our current assessment is that can update the services that we run without service interruption by leveraging the fact that those are not built on single instances but clusters of machines. For service VMs that run inside the customer-controlled VMs, we will create upgrade procedures that we will ask our customers to follow.

Image updates

We produce most of the public images on OTC by a fully automated continuous process with automated daily builds (and the possibility to manually trigger additional builds) and tests in our OTC ImageFactory.

We have quickly published new public images as soon as updated (KPTI-enabled) kernels were published by the operating system vendors. We also included the needed microcode updates for Spectre-2 mitigation if it's available. Not that microcode updates can not be done from withing a guest VM, but only from the hypervisor host and on Bare Metal machines.

We observed the first kernel and microcode updates on 2018-01-04 and started publishing image updates in the night. We expect to have this mostly completed before 2018-01-11. We also included the kernel.unprivileged_bpf_disable = 1 setting in the images that use 4.4+ kernels.

As of midgnight (24:00 CET) 2018-01-05, we had published updated public images for CentOS-7, CentOS-6, Oracle Linux Server 7, Oracle LS 6, and SLES12 SP3. (We are always using the latest minor version number available, so CentOS-7 is no 7.0, but 7.4 right now and will be 7.5 as soon as this is out.)

On 2018-01-06, additional updated public images were made available: SLES12 SP2, SLES11 SP4, SLES11 SP4 extended, SLES11 SP4 SAPHANA, openSUSE 42 JeOS, and openSUSE 42 Docker. Debian_9 followed on 2018-01-08.

Note that the images with the _latest label are as new as the latest dated image or newer, so they are good choices for customers to reference by name.

Find a table with the updated images, the version of the updated kernelsand microcode packages (201x* => ucode-intel/intel-microcode, 1.x or 2.x => microcode_ctl). Most interestingly, the number of changelog entries mentioning the CVE numbers of Meltdown-3, Spectre-2 and Spectre-1. (The Spectre-2 = Var2 volumn has kernel plus microcode changelog entries.) Note that these numbers are only an indicator whether an issue has been addressed comprehensively by the OS vendor, but need further analysis.

OS version	Timestamp (UTC)	kernel	microcode	Var3	Var2	Var1
RedHat EL 7(.4)	2018-01-11 08:57	3.10.0-693.11.6.el7	2.1-22.2.el7	63	45+0	11
RedHat EL 6(.9)	2018-01-11 11:05	2.6.32-696.18.7.el6	1.17-25.2.el6_9	73	66+0	12
CentOS 7(.4)	2018-01-05 02:38	3.10.0-693.11.6.el7	2.1-22.2.el7	63	45+0	11
CentOS 6(.9)	2018-01-05 01:35	2.6.32-696.18.7.el6	1.17-25.2.el6_9	73	66+0	12
Oracle LS 7(.4)	2018-01-05 09:58	3.10.0-693.11.6.el7	2.1-22.2.el7	63	45+0	11
Oracle LS 6(.9)	2018-01-05 13:08	2.6.32-696.18.7.el6	1.17-25.2.el6_9	73	66+0	12
SLES 12 SP3	2018-01-05 03:51	4.4.103-6.38.1	20170707-13.8.1	48	0*+2	14
SLES 12 SP2	2018-01-06 11:18	4.4.103-92.56.1	20170707-13.8.1	54	0*+2	14
SLES 12 SP1	TBD			0	0+0	0
SLES 11 SP4	2018-01-06 09:47	3.0.101-108.21.1	1.17-102.83.6.1	29	0*+3	8
openSUSE 42(.3)	2018-01-06 01:48	4.4.104-39.1	20170707-13.1	48	0*+2	14
Debian 9	2018-01-08 03:24	4.9.65-3+deb9u2		1	0+0	0
Fedora 26	2018-01-08 08:14	4.14.11-200.fc26		0**	0+0	0
Ubuntu 16.04	2018-01-09	4.4.0-108-generic		3	0+0	0
Ubuntu 14.04	2018-01-10	3.13.0-139-generic		1	0+0	0
EulerOS 2 (SP2)	2018-01-11 07:43	3.10.0-327.59.59.46. h44		69	64+0	12
Windows 2016	2018-01-08
Windows 2012	2018-01-08
Windows 2008	2018-01-09

0* : The SUSE kernel updates include a feature to enable the microcode-dependent Spectre-2/BTB fixes according to the SUSE security team predating but similar to IBRS.

0** : The Fedora 26 kernel has the KPTI patches from 4.14.11, so it is secured and we have validated this. Just no changelog entry. And we could not find traces of Spectre-2 or Spectre-1 mitigation.

In general, the kernel updates in these images do comprehensively address Meltdown-3. Many also have workarounds for Spectre-1 (in eBPF and some drivers). However, currently only RedHat and derivatives (CentOS, Oracle, Euler) and SUSE seem to really address Spectre-2, both by shipping microcode updates as well as leveraging them (IBRS). So we do expect more updates in the future, with additional workarounds (IBRS and/or retpolines).

We are also working with our partner bitnami, who has been providing updated images (based on CentOS-7) to our Marketplace.

Customers can use a check script to test whether the mitigations are active. (The script does currently not detect the SUSE Sepctre-2 mitigation.)

If you want to check the microcode version, you need to ask the kernel to update the microcode. Use echo 1 > /sys/devices/sytem/cpu/microcode/reload on a running Linux system and then check /proc/cpuinfo again. Note that this won't change the microcode of your host CPU in a VM (only the hypervisor can do that or the kernel on BareMetal), but the kernel will reread the version from the CPU this way.

Performance impact

We do want to remember customers that the Spectre-2 fix and also the KPTI workarounds in the (guest) kernel do come with a measurable performance cost.

getppid() cycles	IBRS	Kernel	(pti=off) nopti	pti=auto (default)	pti=auto nopcid
Xen E5-2658v3	no	4.4	150	630	1400
Xen E5-2658v3	yes	4.4	150	750	1700
KVM Gold6161 (v5)	no	4.4	120	520	1150
KVM Gold6148 (v5)	yes	4.4	130	560	1200
KVM Gold6161 (v5)	no	3.10	210	600	1430
BMS E5-2667v4	no	4.4	140	580	900
BMS E5-2667v4	yes	4.4	2380	2850	3100
BMS E5-2667v4	no	3.10	250	700	950
BMS E5-2667v4	yes	3.10	2800	3350	3430
i5-4250U (v3)	no	4.4	115
i5-4250U (v3)	yes	4.4	180	550	800

These measurements quantify the overhead for a system call (SYS_getppid) in terms of CPU cycles with KPTI and IBRS enabled kernels from SUSE (4.4) and RedHat (3.10) on Xeon E5-2658v3 (Haswell, c1/c2/s1/m1 flavors, Xen), a Pentium Gold 6148/6161 (v5 - Skylake) on our new KVM s2 pool, on E5-2667v4 (Broadwell) CPUs on BMS and on a physical desktop system with Core i5-4250U (Haswell aka v3). Source: Benchmark by the author.

The overhead observed here is the theoretical worst case scenario. Note that on RedHat 7, only nopti can be used to switch off the KPT Isolation; the pti=on|auto|off syntax is only supported by SUSE kernels. For the Redhat tunables see the Redhat tuning guide.

Note that there is no reason to use nopcid. The benchmark numbers are only here to show how bad things could have been (and might be in a case if someone backports KPTI without pcid support to an old kernel).

As can be seen, the worst-case performance overhead by the Spectre-2 mitigation is very significant on bare metal v4 CPUs; the switch time from userspace to the kernel (by system calls or interrupts) is more than 10x higher on v4 CPUs. The good news is that the v3 (Haswell) and v5 (Skylake) CPUs do not suffer much beyond the impact of KPTI and are thus more in line of what we expected after the communication from intel.

It should be noted that the v4 microcode updates from intel are not final yet. We hope intel will be able to reduce the cost. In general, it appears that that intel ist still struggling with delivering microcode updates that work perfectly in all scenarios. We have deployed new microcode to all v3 systems on 2018-01-11 in eu-de and have thus far been lucky not to observe any crashes yet. Our recommendation to customers that are concerned about this is that they tell the platform to restart VMs automatically (autorecovery) on another host if a host fails.

Even without the Spectre-2 microcode fixes, the constant overhead of system calls and interrupts approximately quadruples by KPTI (with the pcid optimization on, 2x worse otherwise) for KPTI (without the Spectre-2 mitigation), while most other operations are not affected in a measurable way. In macro-benchmarks, a performance degradation was observed be between 0% (e.g. number-crunching) and 50% (syscall heavy), for normal workloads we expect degradation below 10% by KPTI as reported by benchmarks. Our CPUs in OTC all have support for the PCID feature, mitigating the KPTI performance impact somewhat.

While the kernel boot parameter nopti can be used to disable the Meltdown-3 mitigations, this can not be recommended due to the security implications. (In our measurements, the usage of nospec on SUSE kernels to disable the Spectre-2 did not make a difference on the v4 CPUs.)

Some applications do lots of gettimeofday() system calls, a call that is almost as fast as getppid() -- by switching the clocksource (/sys/devices/systems/clocksource/clocksource0/current_clocksource) to tsc, the vsyscall mechanism can be used, avoiding any performance impact by KPTI and IBRS. On new kernels with appropriate hardware, the kvm_clock clocksource is also implemented via vsyscalls, so the KVM default there has roughly the same performance as tsc there.

There are some real-world benchmark numbers from Phoronix that give some realistic idea on the impact of KPTI alone; we have not found comprehensive coverage of the microcode based Spectre-2 mitigation (IBRS) performance impact yet -- but according to our own measurements, it's worse.

There are also some performance results by intel, though the test environment is focused on Desktop processors and Workloads and may only give a rough indication on typical server workload impact.

It should be noted that the performance impact is very much dependent on the workload; workloads that do many I/O operations per seconds are hit the worst. Number crunching on the other hand is not affected in a measurable way.