30th Apr 2018 by Kurt Garloff

Benchmarks

KPTI/IBRS/Retpoline Microbenchmarks

To evaluate the various mitigation approaches against Spectre-2 (BTI) and Meltdown-3, we have performed a number of measurements. These are micro-benchmarks that try to benchmark operations that are supposedly most affected by the workarounds to mitigate these CPU design bugs. All tests have been performed on instances in the Open Telekom Cloud, most of them on Bare Metal Flavors.

Old results

These results were produced in early January after the Linux vendors have published kernels with KPTI mitigation against Meltdown-3 and IBRS/IBPB/STIBP mitigation based on new microcode provided CPU features (which intel had to retract for many CPUs).

(pti=off)

nopti

| pti=auto (default) |

pti=auto

nopcid


Xen E5-2658v3	no	no	4.4	150	630	1400
Xen E5-2658v3	yes	no	4.4	150	750	1700
KVM Gold6161 (v5)	no	no	4.4	120	520	1150
KVM Gold6148 (v5)	yes	no	4.4	130	560	1200
KVM Gold6161 (v5)	no	no	3.10	210	600	1430
BMS E5-2667v4	NA	no	4.4	140	580	900
BMS E5-2667v4	NA	yes	4.4	2380	2850	3100
BMS E5-2667v4	NA	no	3.10	250	700	950
BMS E5-2667v4	NA	yes	3.10	2800	3350	3430
i5-4250U (v3)	NA	no	4.4	115	550
i5-4250U (v3)	NA	yes	4.4		2150

As can be seen, the performance impact of KPTI is bad, very bad without PCID and the impact of IBRS based Spectre-2 mitigation in the kernel is dramatic.

Meanwhile we have kernels and compilers which implement the retpoline mitigation mechanism. See below for some more numbers.

New System Call Measurements

The same [syscall test program][2] has been used as for the old results. The benchmark calls the getppid() system call in a row and measures the time per call in ns. This is a worst case scenario for the mitigation: With KPTI an address space switch has been added to the system call path to mitigate Meltdown-3; in order to avoid branch targets being injected across privilege transitions, the caches for indirect jumps need to be protected for a system call -- both mitigations thus are in the system call path. (The same happens for interrupts which also normally incur a transition from userspace to kernel mode.)

We report ns here, as these are rather constant even when the CPU frequency changes -- looks like the time required for the operations that the CPU does when entering kernel space are not very much dependent on the CPU frequency. (Even the old results were obtained from taking the rather constant time measurements and multiplying with the nominal frequency of the CPU, not the current one.)

All new tests have been performed on a bare metal system in the Open Telekom Cloud. The system features a pair of Broadwell CPUs E5-2667v4, with microcode 0xb000021 (no uC) and the withdrawn 0xb000025 (w/ uC).

![System Call overhead][3]

With retpoline kernels now available (openSUSE 42.3, SLES12 SP3 and SP2 since early February 2018, 4.14.19), we can compare the mitigation versus the old IBRS results.

Miti:Kernel [ns/syscall]	RHEL7	SLES12SP3 January	SLES12SP3 Feb no uC	SLES12SP3 Feb w/ uC	4.14.19 (4.8.5r)
None	90	NA	45	45	45
IBRS/Retpol	790	670	50	65	60
KPTI	230	NA	170	190	200
Both	960	820	190	230	220
Both nopcid	960	880	270	315	330

Used kernel versions and settings:

RHEL 7: Kernel 3.10.0-693.11.6.el7, microcode 0xb000025, None = ibpb/ibrs/pti_enabled=0/0/0, IBRS = 1/1/0, KPTI = 0/0/1, Both = 1/1/1.
SLES12 SP3 January: kernel-default-4.4.103-6.38.1, microcode 0xb000025, None = pti=off boot parameter.
SLES12 SP3 February no uC: kernel-default-4.4.114-94.11.3, microcode 0xb000021, None = pti=off spectre_v2=off boot parameters
SLES12 SP3 February w/ uC: kernel-default-4.4.114-94.11.3, microcode 0xb000025, None = pti=off spectre_v2=off boot parameters
4.14.19 (4.8.5r): A vanilla 4.14.19 kernel (from linux-stable git), compiled with the retpoline enabled gcc-4.8.5 from SLES12 SP3, microcode 0xb000025, None = pti=off spectre_v2=off boot parameters.

The results can also be seen in the graph on the right.

There are a few noteworthy observations:

The 3.10 kernel is systematically slower than 4.4+ for system calls. It seems that nocpid does not make a difference there.
System call overhead through KPTI loses 140ns (2.6x) on 3.10 kernel and 150 ns (3.3x) on 4.x kernels.
Without pcid, things would be significantly worse.
IBRS Spectre-2 mitigation is incredibly expensive as can be seen by the RHEL-7 and SLES12 January results, with over 600ns loss (7.5 -- 14x) per system call.
Not visible from the numbers shown here: The enablement of ibpb in the RHEL7 kernel only has minimal performance impact, while the ibrs is huge.
Worst case performance loss is 770ns (4.4.x) -- 870ns (3.10.x) ns when both IBRS and KPTI are used.
retpolines on the other hand are much much better than IBRS as can be seen on the SLES12 February and 4.14.19 results; the overhead per system call is in the order of 15ns (35%) and thus smaller than the difference between older 3.10 and newer 4.4+ kernels.
The new microcode does have a small impact even on retpoline kernels; the assumption is that there is minimal usage of ibpb on Broadwell CPUs to fully mitigate Spectre-2.
The SUSE kernel is pretty much in line with latest 4.14.19.
Worst case performance loss on 4.4+ kernels is ~180ns; still 4x fewer system calls per second are possible.

New Pipe Context Switch Measurements

A second equally simple benchmark has been created: Writing a small amount of data into a [pipe][4] and reading it back from another process. As the pipe buffer is small, this will also result in a huge number of system calls. In addition, there will be a lot of context switches -- so we should see how well the PCID feature works in presence of multiple processes; the minimal PCID support in KAISER patches prior to 4.4.14 does only enable two context IDs -- one for the kernel and one for the userspace process; this will not work as well when there is more than one userspace process involved.

This is a less theoretical worst case scenario -- sending lots of small pieces of data between processes via a pipe or via network will expose performance losses by the mitigation, but as we send real data, this is a bit more realistic.

![Time for one round of 512B pipe write/read in us (same thread)][5]

The program binds the reader and writer ends of the pipe to a CPU (hyperthread = HT).

The test is done once forcing both ends to run on the same HT. This results in the worst performance, as the HT needs to do lots of context switches.

Miti:Kernel [ns/512B rw]	RHEL7	SLES12SP3 January	SLES12SP3 Feb no uC	SLES12SP3 Feb w/ uC	4.14.19 (4.8.5r)
None	890	NA	455	455	470
Retpol/IBRS	3290	2560	555	555	540
KPTI	1490	NA	740	800	770
Both	3700	2900	830	910	880
Both nopcid	3700	3120	1120	1250	1260

Parameters used: 512 5000000 0 0 -- this does 5M 512byte pipe writes and reads and binds both the reader and writer process to hyperthread 0.

The test is repeated using two hyperthreads of the same CPU core, benefiting from cache effects though avoiding context switches. The reported time is the elapsed time; the CPU time is higher, as two CPUs (HTs) are involved. Note that in most (but not all!) scenarios, the data exchange via HTs of the same core give the best performance. Parameters: 512 5000000 0 16 -- 16 is the sibling HT of 0.

![Time for one round of 512B pipe write/read in us (sibling thread)][6]

Miti:Kernel [ns/512B rw]	RHEL7	SLES12SP3 January	SLES12SP3 Feb no uC	SLES12SP3 Feb w/ uC	4.14.19 (4.8.5r)
None	550	NA	335	335	335
Retpol/IBRS	2000	1615	375	385	370
KPTI	870	NA	475	495	490
Both	2140	1695	520	560	540
Both nopcid	2325	1900	780	850	890

The settings are the same as in the above system call benchmarks.

Observations:

The relative performance degradation from IBRS for both cases is roughly the same and huge again. 3.5x -- 5.5x slower execution with IBRS alone and 4.2 -- 6.4x slower with IBRS and KPTI. The absolute slowdown is worse for the single-HT case, though the statement varies for relative slowdown.
Retpolines are much nicer again -- combined with KPTI, we get an 1.66 -- 2x slowdown.
The new microcode does have a small visible effect on retpolines on our Broadwell system.
4.14.19 seems to be a tiny bit better than SUSE's 4.4 kernel. The old RHEL 7 kernel again is way behind.

Finally, we force the kernel to even increase the context switch rate to the maximum by doing alternating reads and writes through two pipes, forcing the processe to run in an alternating manner per pipe read/write. This should show how badly PCID is needed and possibly even surface advantages from the more complete PCID implementation in 4.14.19. We run the test on the same hyperthread, so one HT needs to constantly switch between two proceesses and the kernel. Parameters: -b 512 1000000 0 0

![Time for one 512B pipe write/read alternating (same hyperthread)][7]

Miti:Kernel [ns/512B rw]	RHEL7	SLES12SP3 January	SLES12SP3 Feb no uC	SLES12SP3 Feb w/ uC	4.14.19 (4.8.5r)
None	1370	NA	920	940	910
Retpol/IBRS	4860	4210	1070	1110	1050
KPTI	1880	NA	1480	1580	1230
Both	5270	4760	1650	1720	1410
Both nopcid	5270	4850	1810	1880	2000

Observations:

Most of the above observations still hold true.
The impact of the new microcode on the SLES12SP3 kernel is larger in this context-switch heavy workload. I assume this is caused by some usage of IBPB.
The more complete PCID implementation in 4.14.19 indeed seems to provide advantages here -- kPTI overhead is smaller in this scenario with 4.14.19. Switching PCID off (with nopcid) makes the impact of kPTI worse; by far the most significantly so on 4.14.19.
The relative performance loss is a bit smaller here compared with the benchmark that is leass heavy on context switches. The reason is that context switches were always a lot more heavy than system calls.

Conclusions

The Linux community has chosen retpolines for good reasons. Not only did intel fail to deliver working microcode for the CPU based mitigation against Spectre-2, but the performance impact was really bad. Retpolines do perform much much better and appear to have a smaller impact than the well understood KPTI patches. SUSE seems to have been well advised to follow the upstream approach with retpolines. Let's hope that the remaining open questions on the completeness of retpoline protection on newest intel CPUs can be clarified or solved with minimal CPU support (which intel then hopefully succeeds in delivering in a stable way soon).

It should be noted that the pipe benchmark is roughly a worst case scenario -- this means that we can expect applications that do process lots of small data packets to lose at worst half of their performance when protected by Retpolines+KPTI in the kernel. (Note that additional mitigation against Spectre-1 and Spectre-2 in userspace might add additional performance losses.) When protected by an IBRS+KPTI kernel (with supporting microcode), performance losses of this worst case scenario could be up to 85% (or 540% if you count the reverse).

Real-world workloads will be better off; depending on the usage, you can have no measurable impact ... [Phoronix][8] has done some benchmarks that might be worth a look and here are some [newer Phoronix results][9] comparing unmitigated 4.14.0 with mitigated 4.15 and 4.16-git with mitigations on and off.

[Brendan Gregg][10] has done an excellent analysis that investigates the drivers of performance loss caused by the mitigation patches; his work gives admins a possibility to classify their workload and estimate the impact of KPTI mitigation. He has not looked at IBRS and Retpolines yet, unfortunately, but I expect that will change soon.