Brand Claim Brand Claim
by Kurt Garloff

Benchmarks

KPTI/IBRS/Retpoline Microbenchmarks

To evaluate the various mitigation approaches against Spectre-2 (BTI) and Meltdown-3, we have performed a number of measurements. These are micro-benchmarks that try to benchmark operations that are supposedly most affected by the workarounds to mitigate these CPU design bugs. All tests have been performed on instances in the Open Telekom Cloud, most of them on Bare Metal Flavors.

Old results

These results were produced in early January after the Linux vendors have published kernels with KPTI mitigation against Meltdown-3 and IBRS/IBPB/STIBP mitigation based on new microcode provided CPU features (which intel had to retract for many CPUs).

| getppid() cycles | IBRS Hype | IBRS Kernel | Kernel version |

(pti=off)

nopti

| pti=auto (default) |

pti=auto

nopcid

Xen E5-2658v3 no no 4.4 150 630 1400
Xen E5-2658v3 yes no 4.4 150 750 1700
KVM Gold6161 (v5) no no 4.4 120 520 1150
KVM Gold6148 (v5) yes no 4.4 130 560 1200
KVM Gold6161 (v5) no no 3.10 210 600 1430
BMS E5-2667v4 NA no 4.4 140 580 900
BMS E5-2667v4 NA yes 4.4 2380 2850 3100
BMS E5-2667v4 NA no 3.10 250 700 950
BMS E5-2667v4 NA yes 3.10 2800 3350 3430
i5-4250U (v3) NA no 4.4 115 550
i5-4250U (v3) NA yes 4.4 2150

As can be seen, the performance impact of KPTI is bad, very bad without PCID and the impact of IBRS based Spectre-2 mitigation in the kernel is dramatic.

Meanwhile we have kernels and compilers which implement the retpoline mitigation mechanism. See below for some more numbers.

New System Call Measurements

The same [syscall test program][2] has been used as for the old results. The benchmark calls the getppid() system call in a row and measures the time per call in ns. This is a worst case scenario for the mitigation: With KPTI an address space switch has been added to the system call path to mitigate Meltdown-3; in order to avoid branch targets being injected across privilege transitions, the caches for indirect jumps need to be protected for a system call -- both mitigations thus are in the system call path. (The same happens for interrupts which also normally incur a transition from userspace to kernel mode.)

We report ns here, as these are rather constant even when the CPU frequency changes -- looks like the time required for the operations that the CPU does when entering kernel space are not very much dependent on the CPU frequency. (Even the old results were obtained from taking the rather constant time measurements and multiplying with the nominal frequency of the CPU, not the current one.)

All new tests have been performed on a bare metal system in the Open Telekom Cloud. The system features a pair of Broadwell CPUs E5-2667v4, with microcode 0xb000021 (no uC) and the withdrawn 0xb000025 (w/ uC).

![System Call overhead][3]

With retpoline kernels now available (openSUSE 42.3, SLES12 SP3 and SP2 since early February 2018, 4.14.19), we can compare the mitigation versus the old IBRS results.

Miti:Kernel [ns/syscall] RHEL7 SLES12SP3 January SLES12SP3 Feb no uC SLES12SP3 Feb w/ uC 4.14.19 (4.8.5r)
None 90 NA 45 45 45
IBRS/Retpol 790 670 50 65 60
KPTI 230 NA 170 190 200
Both 960 820 190 230 220
Both nopcid 960 880 270 315 330

Used kernel versions and settings:

  • RHEL 7: Kernel 3.10.0-693.11.6.el7, microcode 0xb000025, None = ibpb/ibrs/pti_enabled=0/0/0, IBRS = 1/1/0, KPTI = 0/0/1, Both = 1/1/1.
  • SLES12 SP3 January: kernel-default-4.4.103-6.38.1, microcode 0xb000025, None = pti=off boot parameter.
  • SLES12 SP3 February no uC: kernel-default-4.4.114-94.11.3, microcode 0xb000021, None = pti=off spectre_v2=off boot parameters
  • SLES12 SP3 February w/ uC: kernel-default-4.4.114-94.11.3, microcode 0xb000025, None = pti=off spectre_v2=off boot parameters
  • 4.14.19 (4.8.5r): A vanilla 4.14.19 kernel (from linux-stable git), compiled with the retpoline enabled gcc-4.8.5 from SLES12 SP3, microcode 0xb000025, None = pti=off spectre_v2=off boot parameters.

The results can also be seen in the graph on the right.

There are a few noteworthy observations:

  • The 3.10 kernel is systematically slower than 4.4+ for system calls. It seems that nocpid does not make a difference there.
  • System call overhead through KPTI loses 140ns (2.6x) on 3.10 kernel and 150 ns (3.3x) on 4.x kernels.
  • Without pcid, things would be significantly worse.
  • IBRS Spectre-2 mitigation is incredibly expensive as can be seen by the RHEL-7 and SLES12 January results, with over 600ns loss (7.5 -- 14x) per system call.
  • Not visible from the numbers shown here: The enablement of ibpb in the RHEL7 kernel only has minimal performance impact, while the ibrs is huge.
  • Worst case performance loss is 770ns (4.4.x) -- 870ns (3.10.x) ns when both IBRS and KPTI are used.
  • retpolines on the other hand are much much better than IBRS as can be seen on the SLES12 February and 4.14.19 results; the overhead per system call is in the order of 15ns (35%) and thus smaller than the difference between older 3.10 and newer 4.4+ kernels.
  • The new microcode does have a small impact even on retpoline kernels; the assumption is that there is minimal usage of ibpb on Broadwell CPUs to fully mitigate Spectre-2.
  • The SUSE kernel is pretty much in line with latest 4.14.19.
  • Worst case performance loss on 4.4+ kernels is ~180ns; still 4x fewer system calls per second are possible.

New Pipe Context Switch Measurements

A second equally simple benchmark has been created: Writing a small amount of data into a [pipe][4] and reading it back from another process. As the pipe buffer is small, this will also result in a huge number of system calls. In addition, there will be a lot of context switches -- so we should see how well the PCID feature works in presence of multiple processes; the minimal PCID support in KAISER patches prior to 4.4.14 does only enable two context IDs -- one for the kernel and one for the userspace process; this will not work as well when there is more than one userspace process involved.

This is a less theoretical worst case scenario -- sending lots of small pieces of data between processes via a pipe or via network will expose performance losses by the mitigation, but as we send real data, this is a bit more realistic.

![Time for one round of 512B pipe write/read in us (same thread)][5]

The program binds the reader and writer ends of the pipe to a CPU (hyperthread = HT).

The test is done once forcing both ends to run on the same HT. This results in the worst performance, as the HT needs to do lots of context switches.

Miti:Kernel [ns/512B rw] RHEL7 SLES12SP3 January SLES12SP3 Feb no uC SLES12SP3 Feb w/ uC 4.14.19 (4.8.5r)
None 890 NA 455 455 470
Retpol/IBRS 3290 2560 555 555 540
KPTI 1490 NA 740 800 770
Both 3700 2900 830 910 880
Both nopcid 3700 3120 1120 1250 1260

Parameters used: 512 5000000 0 0 -- this does 5M 512byte pipe writes and reads and binds both the reader and writer process to hyperthread 0.

The test is repeated using two hyperthreads of the same CPU core, benefiting from cache effects though avoiding context switches. The reported time is the elapsed time; the CPU time is higher, as two CPUs (HTs) are involved. Note that in most (but not all!) scenarios, the data exchange via HTs of the same core give the best performance. Parameters: 512 5000000 0 16 -- 16 is the sibling HT of 0.

![Time for one round of 512B pipe write/read in us (sibling thread)][6]

Miti:Kernel [ns/512B rw] RHEL7 SLES12SP3 January SLES12SP3 Feb no uC SLES12SP3 Feb w/ uC 4.14.19 (4.8.5r)
None 550 NA 335 335 335
Retpol/IBRS 2000 1615 375 385 370
KPTI 870 NA 475 495 490
Both 2140 1695 520 560 540
Both nopcid 2325 1900 780 850 890

The settings are the same as in the above system call benchmarks.

Observations:

  • The relative performance degradation from IBRS for both cases is roughly the same and huge again. 3.5x -- 5.5x slower execution with IBRS alone and 4.2 -- 6.4x slower with IBRS and KPTI. The absolute slowdown is worse for the single-HT case, though the statement varies for relative slowdown.
  • Retpolines are much nicer again -- combined with KPTI, we get an 1.66 -- 2x slowdown.
  • The new microcode does have a small visible effect on retpolines on our Broadwell system.
  • 4.14.19 seems to be a tiny bit better than SUSE's 4.4 kernel. The old RHEL 7 kernel again is way behind.

Finally, we force the kernel to even increase the context switch rate to the maximum by doing alternating reads and writes through two pipes, forcing the processe to run in an alternating manner per pipe read/write. This should show how badly PCID is needed and possibly even surface advantages from the more complete PCID implementation in 4.14.19. We run the test on the same hyperthread, so one HT needs to constantly switch between two proceesses and the kernel. Parameters: -b 512 1000000 0 0

![Time for one 512B pipe write/read alternating (same hyperthread)][7]

Miti:Kernel [ns/512B rw] RHEL7 SLES12SP3 January SLES12SP3 Feb no uC SLES12SP3 Feb w/ uC 4.14.19 (4.8.5r)
None 1370 NA 920 940 910
Retpol/IBRS 4860 4210 1070 1110 1050
KPTI 1880 NA 1480 1580 1230
Both 5270 4760 1650 1720 1410
Both nopcid 5270 4850 1810 1880 2000

Observations:

  • Most of the above observations still hold true.
  • The impact of the new microcode on the SLES12SP3 kernel is larger in this context-switch heavy workload. I assume this is caused by some usage of IBPB.
  • The more complete PCID implementation in 4.14.19 indeed seems to provide advantages here -- kPTI overhead is smaller in this scenario with 4.14.19. Switching PCID off (with nopcid) makes the impact of kPTI worse; by far the most significantly so on 4.14.19.
  • The relative performance loss is a bit smaller here compared with the benchmark that is leass heavy on context switches. The reason is that context switches were always a lot more heavy than system calls.

Conclusions

The Linux community has chosen retpolines for good reasons. Not only did intel fail to deliver working microcode for the CPU based mitigation against Spectre-2, but the performance impact was really bad. Retpolines do perform much much better and appear to have a smaller impact than the well understood KPTI patches. SUSE seems to have been well advised to follow the upstream approach with retpolines. Let's hope that the remaining open questions on the completeness of retpoline protection on newest intel CPUs can be clarified or solved with minimal CPU support (which intel then hopefully succeeds in delivering in a stable way soon).

It should be noted that the pipe benchmark is roughly a worst case scenario -- this means that we can expect applications that do process lots of small data packets to lose at worst half of their performance when protected by Retpolines+KPTI in the kernel. (Note that additional mitigation against Spectre-1 and Spectre-2 in userspace might add additional performance losses.) When protected by an IBRS+KPTI kernel (with supporting microcode), performance losses of this worst case scenario could be up to 85% (or 540% if you count the reverse).

Real-world workloads will be better off; depending on the usage, you can have no measurable impact ... [Phoronix][8] has done some benchmarks that might be worth a look and here are some [newer Phoronix results][9] comparing unmitigated 4.14.0 with mitigated 4.15 and 4.16-git with mitigations on and off.

[Brendan Gregg][10] has done an excellent analysis that investigates the drivers of performance loss caused by the mitigation patches; his work gives admins a possibility to classify their workload and estimate the impact of KPTI mitigation. He has not looked at IBRS and Retpolines yet, unfortunately, but I expect that will change soon.