Benchmarks
KPTI/IBRS/Retpoline Microbenchmarks
To evaluate the various mitigation approaches against Spectre-2 (BTI) and Meltdown-3, we have performed a number of measurements. These are micro-benchmarks that try to benchmark operations that are supposedly most affected by the workarounds to mitigate these CPU design bugs. All tests have been performed on instances in the Open Telekom Cloud, most of them on Bare Metal Flavors.
Old results
These results were produced in early January after the Linux vendors have published kernels with KPTI mitigation against Meltdown-3 and IBRS/IBPB/STIBP mitigation based on new microcode provided CPU features (which intel had to retract for many CPUs).
| getppid() cycles | IBRS Hype | IBRS Kernel | Kernel version |
- (pti=off)
nopti
| pti=auto (default) |
- pti=auto
nopcid
Xen E5-2658v3 no no 4.4 150 630 1400 Xen E5-2658v3 yes no 4.4 150 750 1700 KVM Gold6161 (v5) no no 4.4 120 520 1150 KVM Gold6148 (v5) yes no 4.4 130 560 1200 KVM Gold6161 (v5) no no 3.10 210 600 1430 BMS E5-2667v4 NA no 4.4 140 580 900 BMS E5-2667v4 NA yes 4.4 2380 2850 3100 BMS E5-2667v4 NA no 3.10 250 700 950 BMS E5-2667v4 NA yes 3.10 2800 3350 3430 i5-4250U (v3) NA no 4.4 115 550 i5-4250U (v3) NA yes 4.4 2150
As can be seen, the performance impact of KPTI is bad, very bad without PCID and the impact of IBRS based Spectre-2 mitigation in the kernel is dramatic.
Meanwhile we have kernels and compilers which implement the retpoline mitigation mechanism. See below for some more numbers.
New System Call Measurements
The same [syscall test program][2] has been used as for the old results. The benchmark calls the getppid()
system call in a row and measures the time per call in ns. This is a worst case scenario for the mitigation: With KPTI an address space switch has been added to the system call path to mitigate Meltdown-3; in order to avoid branch targets being injected across privilege transitions, the caches for indirect jumps need to be protected for a system call -- both mitigations thus are in the system call path. (The same happens for interrupts which also normally incur a transition from userspace to kernel mode.)
We report ns here, as these are rather constant even when the CPU frequency changes -- looks like the time required for the operations that the CPU does when entering kernel space are not very much dependent on the CPU frequency. (Even the old results were obtained from taking the rather constant time measurements and multiplying with the nominal frequency of the CPU, not the current one.)
All new tests have been performed on a bare metal system in the Open Telekom Cloud. The system features a pair of Broadwell CPUs E5-2667v4, with microcode 0xb000021 (no uC) and the withdrawn 0xb000025 (w/ uC).
![System Call overhead][3]
With retpoline kernels now available (openSUSE 42.3, SLES12 SP3 and SP2 since early February 2018, 4.14.19), we can compare the mitigation versus the old IBRS results.
Miti:Kernel [ns/syscall] | RHEL7 | SLES12SP3 January | SLES12SP3 Feb no uC | SLES12SP3 Feb w/ uC | 4.14.19 (4.8.5r) |
---|---|---|---|---|---|
None | 90 | NA | 45 | 45 | 45 |
IBRS/Retpol | 790 | 670 | 50 | 65 | 60 |
KPTI | 230 | NA | 170 | 190 | 200 |
Both | 960 | 820 | 190 | 230 | 220 |
Both nopcid | 960 | 880 | 270 | 315 | 330 |
Used kernel versions and settings:
- RHEL 7: Kernel 3.10.0-693.11.6.el7, microcode 0xb000025, None = ibpb/ibrs/pti_enabled=0/0/0, IBRS = 1/1/0, KPTI = 0/0/1, Both = 1/1/1.
- SLES12 SP3 January: kernel-default-4.4.103-6.38.1, microcode 0xb000025, None = pti=off boot parameter.
- SLES12 SP3 February no uC: kernel-default-4.4.114-94.11.3, microcode 0xb000021, None = pti=off spectre_v2=off boot parameters
- SLES12 SP3 February w/ uC: kernel-default-4.4.114-94.11.3, microcode 0xb000025, None = pti=off spectre_v2=off boot parameters
- 4.14.19 (4.8.5r): A vanilla 4.14.19 kernel (from linux-stable git), compiled with the retpoline enabled gcc-4.8.5 from SLES12 SP3, microcode 0xb000025, None = pti=off spectre_v2=off boot parameters.
The results can also be seen in the graph on the right.
There are a few noteworthy observations:
- The 3.10 kernel is systematically slower than 4.4+ for system calls. It seems that nocpid does not make a difference there.
- System call overhead through KPTI loses 140ns (2.6x) on 3.10 kernel and 150 ns (3.3x) on 4.x kernels.
- Without pcid, things would be significantly worse.
- IBRS Spectre-2 mitigation is incredibly expensive as can be seen by the RHEL-7 and SLES12 January results, with over 600ns loss (7.5 -- 14x) per system call.
- Not visible from the numbers shown here: The enablement of ibpb in the RHEL7 kernel only has minimal performance impact, while the ibrs is huge.
- Worst case performance loss is 770ns (4.4.x) -- 870ns (3.10.x) ns when both IBRS and KPTI are used.
- retpolines on the other hand are much much better than IBRS as can be seen on the SLES12 February and 4.14.19 results; the overhead per system call is in the order of 15ns (35%) and thus smaller than the difference between older 3.10 and newer 4.4+ kernels.
- The new microcode does have a small impact even on retpoline kernels; the assumption is that there is minimal usage of ibpb on Broadwell CPUs to fully mitigate Spectre-2.
- The SUSE kernel is pretty much in line with latest 4.14.19.
- Worst case performance loss on 4.4+ kernels is ~180ns; still 4x fewer system calls per second are possible.
New Pipe Context Switch Measurements
A second equally simple benchmark has been created: Writing a small amount of data into a [pipe][4] and reading it back from another process. As the pipe buffer is small, this will also result in a huge number of system calls. In addition, there will be a lot of context switches -- so we should see how well the PCID feature works in presence of multiple processes; the minimal PCID support in KAISER patches prior to 4.4.14 does only enable two context IDs -- one for the kernel and one for the userspace process; this will not work as well when there is more than one userspace process involved.
This is a less theoretical worst case scenario -- sending lots of small pieces of data between processes via a pipe or via network will expose performance losses by the mitigation, but as we send real data, this is a bit more realistic.
![Time for one round of 512B pipe write/read in us (same thread)][5]
The program binds the reader and writer ends of the pipe to a CPU (hyperthread = HT).
The test is done once forcing both ends to run on the same HT. This results in the worst performance, as the HT needs to do lots of context switches.
Miti:Kernel [ns/512B rw] | RHEL7 | SLES12SP3 January | SLES12SP3 Feb no uC | SLES12SP3 Feb w/ uC | 4.14.19 (4.8.5r) |
---|---|---|---|---|---|
None | 890 | NA | 455 | 455 | 470 |
Retpol/IBRS | 3290 | 2560 | 555 | 555 | 540 |
KPTI | 1490 | NA | 740 | 800 | 770 |
Both | 3700 | 2900 | 830 | 910 | 880 |
Both nopcid | 3700 | 3120 | 1120 | 1250 | 1260 |
Parameters used: 512 5000000 0 0
-- this does 5M 512byte pipe writes and reads and binds both the reader and writer process to hyperthread 0.
The test is repeated using two hyperthreads of the same CPU core, benefiting from cache effects though avoiding context switches. The reported time is the elapsed time; the CPU time is higher, as two CPUs (HTs) are involved. Note that in most (but not all!) scenarios, the data exchange via HTs of the same core give the best performance. Parameters: 512 5000000 0 16
-- 16 is the sibling HT of 0.
![Time for one round of 512B pipe write/read in us (sibling thread)][6]
Miti:Kernel [ns/512B rw] | RHEL7 | SLES12SP3 January | SLES12SP3 Feb no uC | SLES12SP3 Feb w/ uC | 4.14.19 (4.8.5r) |
---|---|---|---|---|---|
None | 550 | NA | 335 | 335 | 335 |
Retpol/IBRS | 2000 | 1615 | 375 | 385 | 370 |
KPTI | 870 | NA | 475 | 495 | 490 |
Both | 2140 | 1695 | 520 | 560 | 540 |
Both nopcid | 2325 | 1900 | 780 | 850 | 890 |
The settings are the same as in the above system call benchmarks.
Observations:
- The relative performance degradation from IBRS for both cases is roughly the same and huge again. 3.5x -- 5.5x slower execution with IBRS alone and 4.2 -- 6.4x slower with IBRS and KPTI. The absolute slowdown is worse for the single-HT case, though the statement varies for relative slowdown.
- Retpolines are much nicer again -- combined with KPTI, we get an 1.66 -- 2x slowdown.
- The new microcode does have a small visible effect on retpolines on our Broadwell system.
- 4.14.19 seems to be a tiny bit better than SUSE's 4.4 kernel. The old RHEL 7 kernel again is way behind.
Finally, we force the kernel to even increase the context switch rate to the maximum by doing alternating reads and writes through two pipes, forcing the processe to run in an alternating manner per pipe read/write. This should show how badly PCID is needed and possibly even surface advantages from the more complete PCID implementation in 4.14.19. We run the test on the same hyperthread, so one HT needs to constantly switch between two proceesses and the kernel. Parameters: -b 512 1000000 0 0
![Time for one 512B pipe write/read alternating (same hyperthread)][7]
Miti:Kernel [ns/512B rw] | RHEL7 | SLES12SP3 January | SLES12SP3 Feb no uC | SLES12SP3 Feb w/ uC | 4.14.19 (4.8.5r) |
---|---|---|---|---|---|
None | 1370 | NA | 920 | 940 | 910 |
Retpol/IBRS | 4860 | 4210 | 1070 | 1110 | 1050 |
KPTI | 1880 | NA | 1480 | 1580 | 1230 |
Both | 5270 | 4760 | 1650 | 1720 | 1410 |
Both nopcid | 5270 | 4850 | 1810 | 1880 | 2000 |
Observations:
- Most of the above observations still hold true.
- The impact of the new microcode on the SLES12SP3 kernel is larger in this context-switch heavy workload. I assume this is caused by some usage of IBPB.
- The more complete PCID implementation in 4.14.19 indeed seems to provide advantages here -- kPTI overhead is smaller in this scenario with 4.14.19. Switching PCID off (with
nopcid
) makes the impact of kPTI worse; by far the most significantly so on 4.14.19. - The relative performance loss is a bit smaller here compared with the benchmark that is leass heavy on context switches. The reason is that context switches were always a lot more heavy than system calls.
Conclusions
The Linux community has chosen retpolines for good reasons. Not only did intel fail to deliver working microcode for the CPU based mitigation against Spectre-2, but the performance impact was really bad. Retpolines do perform much much better and appear to have a smaller impact than the well understood KPTI patches. SUSE seems to have been well advised to follow the upstream approach with retpolines. Let's hope that the remaining open questions on the completeness of retpoline protection on newest intel CPUs can be clarified or solved with minimal CPU support (which intel then hopefully succeeds in delivering in a stable way soon).
It should be noted that the pipe benchmark is roughly a worst case scenario -- this means that we can expect applications that do process lots of small data packets to lose at worst half of their performance when protected by Retpolines+KPTI in the kernel. (Note that additional mitigation against Spectre-1 and Spectre-2 in userspace might add additional performance losses.) When protected by an IBRS+KPTI kernel (with supporting microcode), performance losses of this worst case scenario could be up to 85% (or 540% if you count the reverse).
Real-world workloads will be better off; depending on the usage, you can have no measurable impact ... [Phoronix][8] has done some benchmarks that might be worth a look and here are some [newer Phoronix results][9] comparing unmitigated 4.14.0 with mitigated 4.15 and 4.16-git with mitigations on and off.
[Brendan Gregg][10] has done an excellent analysis that investigates the drivers of performance loss caused by the mitigation patches; his work gives admins a possibility to classify their workload and estimate the impact of KPTI mitigation. He has not looked at IBRS and Retpolines yet, unfortunately, but I expect that will change soon.