Meltdown patches and Hercules performance

Impact of Meltdown kernel updates on Hercules performance

First data (2018-01-14)

The Kernel page-table isolation (KPTI) patches recently introduced to mitigate the Meltdown security vulnerability increases the overhead seen by system calls and will thus impact system performance. I wondered whether that can be seen with Hercules, and indeed there are cases where the instruction timing increases by more than a factor of two !

I used the s370_perf instruction time benchmark, now available as GitHub project wfjm/s370_perf.

I ran the benchmark, under MVS 3.8J with Hercules as included in tk4-, in a dual CPU configuration (NUMCPU=2 MAXCPU=2) before and after the updates fighting Spectre/Meltdown were installed. The CS, CDS, and TS tests in the lock missed configuration show a clear effect, times are up by more than a factor of two, and all other tests stay the same within measurement precision. See the test reports

and inspect tests T292, T297, and T621. Summarized

Tag   Comment                :    before     after
T292  LR;CS R,R,m (ne)       :    333.92    726.15
T297  LR;CDS R,R,m (ne)      :    334.79    742.46
T621  MVI;TS m (ones)        :    342.58    729.77

As said, all other instruction times are essentially unchanged. What happened is easy to explain. The CS, CDS, and TS emulation code contains

  if (sysblk.cpus > 1)  sched_yield();

to get spinlocks in the lock missed case efficiently handled. That's why the lock missed case shows a substantially slower instruction time than the lock taken case (which takes only about 80-90 µs). So this test is essentially a system call benchmark, thus very sensitive to the KPTI patch.

Really nice to see this with such clarity. The practical impact for normal code is likely negligible though, that's why I resisted the temptation to title the thread 'Hercules a factor 2 slower' :).

More data and analysis (2018-01-28)

The Meltdown vulnerability is caused by a combination of

out-of-order execution
speculative execution
sub-optimal handling of L1 cache and TLB
which leads to delayed exceptions
which allows a side-channel attack

The key culprit is the delayed exceptions. This is a feature of the concrete implementation of the processor architecture, not of a processor architecture itself. Therefore for example Intel has this unfortunate feature, while AMD claims it has not.

Vulnerable is the host CPU and of course not an emulated CPU. The side-channel attack requires good time resolution, so it's imho unlikely that System/390 code executed by Hercules can be either source or target of an attack.

What one sees is only the performance impact coming from the mitigation action. The Kernel page-table isolation (KPTI) patches rolled out by all OS vendors slow down system calls, the amount depends on CPU generation and OS version. Newer Intel CPUs, Haswell or later, support Process Context Identifiers (PCID), and newer Kernels, like Linux 4.14.11 and later, can use this to reduce the performance impact of KPTI. In general older CPUs with older OS versions will take a bigger performance hit than newer CPUs with newer Kernel versions.

The text case sys1 shown in the last posting was generated on

Intel(R) Core(TM)2 Duo CPU E8400
Ubuntu 16.04 LTS with a 4.4.0 Linux Kernel

I've done another test case nbk2 on

Intel(R) Core(TM) i5 CPU M520
Ubuntu 14.04 LTS with a 3.13.0 Linux Kernel
VitualBox 5.0.12 r104815
Windows 7

The test reports are under

In this case one gets (instruction times in ns)

Tag   Comment                :    before     after
T292  LR;CS R,R,m (ne)       :   2291.28   3854.92
T297  LR;CDS R,R,m (ne)      :   2295.46   3831.74
T621  MVI;TS m (ones)        :   2320.39   3812.82

Comparing both systems with s370_perf_sum gives

Tag   Comment                :    sys1-a    sys1-b    nbk2-a    nbk2-b
T100  LR R,R                 :      3.07      3.06      3.53      3.56
T101  LA R,n                 :      3.91      3.90      4.07      4.09
T102  L R,m                  :     12.81     12.80     11.86     11.90
T110  ST R,m                 :     12.79     12.79     12.32     12.23
...
T292  LR;CS R,R,m (ne)       :    333.92    726.15   2291.28   3854.92
T297  LR;CDS R,R,m (ne)      :    334.79    742.46   2295.46   3831.74
T621  MVI;TS m (ones)        :    342.58    729.77   2320.39   3812.82

Observations are

simple instructions, like LR, LA, L, or ST, have very similar speeds on both systems.
lock misses are apparently more costly in a Linux under VitualBox under Windows environment. Not too astonishing, most likely all three layers get into action to process the sched_yield().
the relative KPTI patch impact is smaller on the nbk2 system, which is slow anyway. So hard to judge what's behind this.

Both systems fall likely in the 'old CPU' plus 'old Kernel' category and thus show the worst-case impact of the KPTI kernel patches.

For original posting to Yahoo! Group - Hercules-390 see topic 82874. Dead link since 2020-12-15: Yahoo! Groups was discontinued by Verizon.

Posted:	2018-01-14
Update:	2022-04-21
Tags:	mvs

Newest:	2018-12-02
	2018-06-16
	2018-06-03
	2018-05-26
	2018-05-20
	2018-05-06
	2018-04-29
	2018-01-14
	2017-12-30
Oldest:	2017-11-05

about	posted 2018-01-14 & updated 2022-04-21
This is a private hobbyist website no impressum or privacy protection statement required see GitHub terms