Impact of Meltdown kernel updates on Hercules performance
First data (2018-01-14)
The Kernel page-table isolation (KPTI) patches recently introduced to mitigate the Meltdown security vulnerability increases the overhead seen by system calls and will thus impact system performance. I wondered whether that can be seen with Hercules, and indeed there are cases where the instruction timing increases by more than a factor of two !
I used the s370_perf instruction time benchmark, now available as GitHub project wfjm/s370_perf.
I ran the benchmark, under MVS 3.8J
with
Hercules
as included in
tk4-,
in a dual CPU configuration (NUMCPU=2 MAXCPU=2
) before and after
the updates fighting Spectre/Meltdown were installed.
The CS
, CDS
, and TS
tests in the
lock missed configuration show a clear effect, times are up by more
than a factor of two, and all other tests stay the same within measurement
precision. See the test reports
- https://github.com/wfjm/s370-perf/blob/master/data/2018-01-14_sys1-a.dat
- https://github.com/wfjm/s370-perf/blob/master/data/2018-01-14_sys1-b.dat
T292
, T297
, and T621
.
Summarized
Tag Comment : before after T292 LR;CS R,R,m (ne) : 333.92 726.15 T297 LR;CDS R,R,m (ne) : 334.79 742.46 T621 MVI;TS m (ones) : 342.58 729.77
As said, all other instruction times are essentially unchanged.
What happened is easy to explain.
The CS
, CDS
, and TS
emulation
code contains
if (sysblk.cpus > 1) sched_yield();to get spinlocks in the lock missed case efficiently handled. That's why the lock missed case shows a substantially slower instruction time than the lock taken case (which takes only about 80-90 µs). So this test is essentially a system call benchmark, thus very sensitive to the KPTI patch.
Really nice to see this with such clarity. The practical impact for normal code is likely negligible though, that's why I resisted the temptation to title the thread 'Hercules a factor 2 slower' :).
More data and analysis (2018-01-28)
The Meltdown vulnerability is caused by a combination of
- out-of-order execution
- speculative execution
- sub-optimal handling of L1 cache and TLB
- which leads to delayed exceptions
- which allows a side-channel attack
The key culprit is the delayed exceptions. This is a feature of the concrete implementation of the processor architecture, not of a processor architecture itself. Therefore for example Intel has this unfortunate feature, while AMD claims it has not.
Vulnerable is the host CPU and of course not an emulated CPU. The side-channel attack requires good time resolution, so it's imho unlikely that System/390 code executed by Hercules can be either source or target of an attack.
What one sees is only the performance impact coming from the mitigation action. The Kernel page-table isolation (KPTI) patches rolled out by all OS vendors slow down system calls, the amount depends on CPU generation and OS version. Newer Intel CPUs, Haswell or later, support Process Context Identifiers (PCID), and newer Kernels, like Linux 4.14.11 and later, can use this to reduce the performance impact of KPTI. In general older CPUs with older OS versions will take a bigger performance hit than newer CPUs with newer Kernel versions.
The text case sys1
shown in the last posting was generated on
- Intel(R) Core(TM)2 Duo CPU E8400
- Ubuntu 16.04 LTS with a 4.4.0 Linux Kernel
nbk2
on
- Intel(R) Core(TM) i5 CPU M520
- Ubuntu 14.04 LTS with a 3.13.0 Linux Kernel
- VitualBox 5.0.12 r104815
- Windows 7
The test reports are under
- https://github.com/wfjm/s370-perf/blob/master/data/2018-01-21_nbk2-a.dat
- https://github.com/wfjm/s370-perf/blob/master/data/2018-01-21_nbk2-b.dat
In this case one gets (instruction times in ns)
Tag Comment : before after T292 LR;CS R,R,m (ne) : 2291.28 3854.92 T297 LR;CDS R,R,m (ne) : 2295.46 3831.74 T621 MVI;TS m (ones) : 2320.39 3812.82
Comparing both systems with s370_perf_sum gives
Tag Comment : sys1-a sys1-b nbk2-a nbk2-b T100 LR R,R : 3.07 3.06 3.53 3.56 T101 LA R,n : 3.91 3.90 4.07 4.09 T102 L R,m : 12.81 12.80 11.86 11.90 T110 ST R,m : 12.79 12.79 12.32 12.23 ... T292 LR;CS R,R,m (ne) : 333.92 726.15 2291.28 3854.92 T297 LR;CDS R,R,m (ne) : 334.79 742.46 2295.46 3831.74 T621 MVI;TS m (ones) : 342.58 729.77 2320.39 3812.82
Observations are
- simple instructions, like
LR
,LA
,L
, orST
, have very similar speeds on both systems. - lock misses are apparently more costly in a Linux under VitualBox under
Windows environment. Not too astonishing, most likely all three layers
get into action to process the
sched_yield()
. - the relative KPTI patch impact is smaller on the nbk2 system, which is slow anyway. So hard to judge what's behind this.
Both systems fall likely in the 'old CPU' plus 'old Kernel' category and thus show the worst-case impact of the KPTI kernel patches.