Microarchitecture
Because the w11a microarchitecture is very similar to the original 11/70
processor, the KB11-C CPU, the instruction timing in clock cycles is also very
similar. A register-register operation takes two clock cycles, a more
involved case like an "add @r1,a(r2)
" for example takes 12 cycles.
Notable exceptions are the MUL (5 instead of 22 cycles) and DIV (23
instead of 46 cycles) instructions.
Clock Rate
On Spartan-3,6,7 or Artix-7 classs FPGA's the w11a systems run with a clock frequency of at least 50 MHz. Specifically:
Familie FPGA Board System Clock Comment Spartan-7 xc7s50-1 Arty S7 w11a_as7 80 MHz MMCM (Viv 2019.1) Artix-7 xc7a35t-1 Cmod A7 w11a_c7 80 MHz MMCM (Viv 2019.1) Artix-7 xc7a35t-1l Arty A7 w11a_arty 72 MHz MMCM (Viv 2019.1) Artix-7 xc7a35t-1 Basys3 w11a_b3 80 MHz MMCM (Viv 2019.1) Artix-7 xc7a100t-1 Nexys A7 w11a_n4d 80 MHz MMCM (Viv 2019.1) Artix-7 xc7a100t-1 Nexys4 w11a_n4 80 MHz MMCM (Viv 2019.1) Spartan-6 xc6slx16-2 Nexys3 w11a_n3 64 MHz DCM (ISE 14.7) Spartan-3 xc3s1200e-4 Nexys2 w11a_n2 52 MHz DCM Spartan-3 xc3s1000-4 S3board w11a_s3 50 MHz no DCM
Expected Performance
Compared to KB11-C CPU: The KB11-C CPU had a 150 ns micro cycle time. Both the w11a and the 11/70 have a cache, which greatly reduces the impact of memory latencies. So one expects that the w11a is about a factor 50/6.7 or 7.5 faster than the original PDP-11/70.
Compared to J11 CPU: This later ASIC implementation of the 11/70 ran with up to 20 MHz clock rate. It needed 4 clocks per microcycle, resulting in a 200 ns micro cycle time. However, the J11 had a significantly improved microarchitecture yielding up to a factor of two better cpi (cycles-per-instruction) value. So one expects that the w11a is at least a factor (50/20)*(4/1)*(1/2) or 5 faster than the fastest J11 based system, the PDP-11/93.
Benchmarks
The Dhrystone 2 and Tower of Hanoi benchmark codes taken from the 'BYTE UNIX Benchmark' were used to compare the w11a with real PDP-11's and other processors. The w11a values were determined for both boards, the comparison values obtained from Michael Schneider's benchmark collection:
Type OS CPU/cache (Mhz) Dhry2 Hanoi Dhry Hanoi Dhry [lps] [lps] /MHz /MHz /Han w11a_s3 V0.5 BSD 2.11 w11a 8k (50) 11510 160.8 230 3.2 71.6 w11a_n2 V0.5 BSD 2.11 w11a 8k (50) 11519 160.4 230 3.2 71.8 w11a_n2 V0.51 BSD 2.11 w11a 8k (58) 13218 186.1 228 3.2 71.0 w11a_n4 V0.73 BSD 2.11 w11a 64k (80) 18095 250.7 226 3.1 72.2 pdp-11/53+ BSD 2.11 KDJ11-SD (4.5)* 828 12.2 184 2.7 67.8 Mac SE/30 A/UX 68030 (16) 3042 81.8 190 5.1 37.2 SUN 3/60 NetBSD 68020 (20) 6934 121.3 346 6.1 57.3 DECstation 2100 NetBSD R2000 (12) 13206 155.5 1100 13.0 85.2 NeXT N1100 NetBSD 68040 (25) 26882 386.1 1075 15.4 69.6 HP 9000/433t NetBSD 68040 (40) 55763 960.3 1394 24.0 58.1 NCR system 3230 NetBSD i486DX/2 (66) 63464 993.1 961 15.0 63.9 NCR system 3230 NetBSD i486DX/4 (100) 75010 1022.3 750 10.2 73.4 Power Mac G4 Gentoo PPC7455 (1400) 3713k 46.6k 2652 33.3 79.6 Lenovo TS S10 Gentoo i686 E8400(3000) 16464k 262.6k 5488 87.5 62.7
Note that the J11 system is listed with an effective microcycle rate of 4.5 MHz rather the chip clock rate of 18 MHz. This is also consistent with Bob Supnik's notes on the J11 where the J11 is classified as '4.5 MHz'. This gives a more meaningful value for the Dhry/MHz or 'Dhrystone per MHz' column. For a fair comparison, it is also important to remark that the PDP-11/53+ systems didn't have a cache and were therefore about a factor 2.3 slower than a PDP-11/93 with cache (see comparison , explaining the large factor between the w11a_s3 and the 11/53 benchmark results.
The Dhrystone, Tower of Hanoi, and 'syscall' benchmarks were also run on a simulated PDP-11 using SimH version V3.8-1 and natively on a Linux system. In both cases a Kubuntu 10.4 system with an Intel Core2 Duo E8400 CPU was used, cpufreg was fixed to 3 GHz.
System Platform (MHz) Dhry2 Hanoi syscall syscall syscall [lps] [lps] [lps] /Hanoi /Dhry2 2.11BSD w11a_s3 V0.5 (50) 11510 160.8 7080 44.0 0.615 2.11BSD w11a_n2 V0.5 (50) 11519 160.4 6888 42.9 0.598 2.11BSD w11a_n2 V0.51 (58) 13218 186.1 7616 40.9 0.576 2.11BSD w11a_n4 V0.73 (80) 18095 250.7 10837 43.2 0.599 2.11BSD SimH on Intel E8400 (--) 17174 250.0 10713 42.9 0.623 2.11BSD SimH on Rasp Pi 2 B (--) 4477 63.3 2651 41.9 0.592 2.11BSD SimH on Rasp Pi 3 b (--) 7294 104.5 4217 41.3 0.578 Ubuntu 10.4 Intel E8400 (3000) 10785k 74.1k 1020k 13.8 0.095
Some observations are:
- The Nexys2 and Nexys3 boards have a significantly larger main memory latency than the S3BOARD. Because Dhry2 and Hanoi run almost completely in cache and do rarely writes they execute with equal speed. syscall is more sensitive to memory latencies, either due to more cache misses or delays from write-thru's.
- The simulated PDP-11 on a vintage 2009 3 GHz PC is about as fast as the current FPGA implementation on an Artix-7. However, there is certainly room for improvement on the FPGA side, either with faster devices (e.g. a Kintext-7 instead of an Artix-7) and/or an improved microarchitecture (like J11 or even better).
- Comparing the arithmetic benchmarks (Dhry2 and Hanoi) with the 'syscall' benchmark on SimH-simulated 2.11BSD and native Linux suggests that the system call overhead, normalized by processor speed, is larger for contemporary Linux than for 2.11BSD.