2.11BSD: kernel panic after a 'here document' in tcsh
Detecting the problem (2017-06-06)
Using 2.11BSD Version 447 I found that a 'here document' in
tcsh
leads to a kernel panic.
It's absolutely reproducible on my system, both
when runnig it on my FPGA PDP-11
w11a
or in simh
. Just doing
tcsh cat << EOFis enough, and I get
ka6 31333 aps 147472 pc 161324 ps 30004 ov 4 cpuerr 20 trap type 0 panic: trap syncing disks... donelooking at the crash dump gives
cd /etc/crash ./why 4 Backtrace: 0147372: _boot(05000,0100) from ~panic+072 0147414: _etext(011350) from ~trap+0350 0147450: ~trap() from call+040 0147516: _psignal(0101520,0160750) from ~trap+0364 0147554: ~trap() from call+040so the crash is in
psignal
, which is afaik the kernel internal
mechanism to dispatch signals.
Refining the problem description (2017-06-08)
'here documents' are available and work fine insh
and
csh
. And are in fact used, examples
/usr/adm/daily (a /bin/sh script) su uucp << EOF /etc/uucp/clean.daily EOF /usr/crash/why (a /bin/csh script) adb -k {unix,core}.$1 << 'EOF' version/sn"Backtrace:"n $c 'EOF'
211bsd uses split I/D space and uses all 64 kB I space for code. The top 8 kB are in fact the overlay area, and the crash happened in overlay 4 (as indicated by ov 4). With a simple
nm /unix | sort | grep " 4"one gets
161254 t ~psignal 4 162302 t ~issignal 4so the crash is just 050 bytes after the entry point of
psignal
.
So the PC address is fine and not the problem. For psignal
look at
https://www.retro11.de/ouxr/211bsd/usr/src/sys/sys/kern_sig.c.html#s:_psignalthe crash must be one of the first lines.
psignal
is an internal
kernel function, called from
https://www.retro11.de/ouxr/211bsd/usr/src/sys/sys/kern_sig.c.html#xref:s:_psignaland has nothing to do with the
libc
function psignal
,
see the man page
psignal.0.html
and the source
psignal.c.html.
Whatever tcsh
does, it should not lead to a kernel panic, and if it does,
it is primarily a bug of the kernel. It looks like there are two issues,
one in tcsh
, and one in the kernel. I've got a hunch where this
might come from, but that will take a weekend or two to check on.
Finding the problem(s) (2017-06-10)
Two remarks by Johnny Billquist on June 7th and June 9th were very helpful, the essential hint was Johnny's observation that on his system he gets an "Illegal instruction - core dumped" and no kernel panic.I'm using a self-build PDP 11/70 on an FPGA, see GitHub w11 project and w11 home page, which doesn't have a floating-point unit (yet). Therefore the kernel is built with floating-point emulation, thus with
FPSIM YES # floating-point simulator
In a kernel with FPSIM
activated the trap handler
trap()
, see
trap.c.html,
calls for each user mode illegal instruction trap fpsim()
. In case
it was a floating-point instruction fpsim()
emulates it, returns 0,
and trap()
simply returns. If not, fpsim()
returns
the abort signal type, and trap()
calls psignal()
with this signal type, which in general will terminate the offending process.
The kernel panic is due to a coding error in mch_fpsim.s
.
Look in the source code
mch_fpsim.s.html
after label badins
:
badins: / Illegal Instruction mov $SIGILL.,r0 br 2bThe constant
SIGILL
is defined in assym.h
as
#define SIGILL 4.Thus after substitution the mov instruction is
mov $4..,r0with *two dots* !!! The
as
assembler generates from this
mov #160750,r0
So r0
will contain a invalid signal number, which is returned
by fpsim()
to trap()
.
This signal number is passed to psignal()
, which starts with
mask = sigmask(sig); prop = sigprop[sig];The access to
sigprop[sig]
results in an address in IO space,
causes a UNIBUS timeout, and in consequence the kernel panic.
After fixing the "$SIGILL." to "$SIGILL" (removing the extraneous '.') and
three similar cases the kernel doesn't panic anymore, tcsh
crashed with an
illegal instruction trap.
Remains the question of why tcsh
runs onto an illegal instruction.
Getting now a tcsh core dump adb
gives the answer
adb tcsh tcsh.core $c 0172774: _rscan(0176024,0174434) from ~heredoc+0246 0176040: _heredoc(067676) from ~execute+0234 0176126: _execute(067040,01512,0,0) from ~execute+03410 0176222: _execute(066754,01512,0,0) from ~process+01224 0176274: _process(01) from ~main+06030 0177414: _main() from start+0104
heredoc()
, which is located in OV1
,
calls rscan()
, which is in OV6
with
rscan(Dv, Dtestq);where
Dtestq
is a function pointer to Dtestq()
,
which is as heredoc()
in OV1
.
rscan()
, which has the signature
rscan(t, f) register Char **t; void (*f) ();uses
f
in the statement
(*f) (*p++);
The problem is that
heredoc()
andDtestq()
are inOV1
- that's why in the end
~Dtestq
is using a function pointer, like for all overlay internal function invocations rscan()
is inOV6
, when it's called, the overlay is switchedOV1
->OV6
- this invalidates the function pointer, which points to some random code location, which happens to hold '000045', causing a trap.
It is clear that in this context _Dtestq
, the forwarder in the
base, must be used and not ~Dtestq
, the entry point in the
overlay. The generated code for rscan(Dv, Dtestq)
is
~heredoc+0230: mov $0174434,(sp) # arg Dtestq: uses ~Dtestq ~heredoc+0234: mov r5,-(sp) ~heredoc+0236: add $0177764,(sp) # arg Dv ~heredoc+0242: jsr pc,*$_rscanSince
rscan()
is very small and only used by
heredoc()
I simply moved the code of rscan()
from sh.glob.c
(OV6
) to sh.dol.c
where also heredoc()
and Dtestq()
is defined.
After that tcsh
works fine with here documents
./tcsh cat >x.x << EOF 1 $TERM $PWD EOF cat x.x 1 vt100-long /usr/src/bin/tcsh
Bottom line
fpsim
was broken all the timetcsh
was broken all the time
I'll convert this into proper patches and send them to Steven, but this will take some time because I've to tidy up my system to be again in the position to provide proper and clean patch sets.
P.S.: debugging the kernel issue was quite easy because the w11a CPU has three essential build into the CPU debug tools:- a CPU monitor, which records 144 bits of processor state for the last 256 instructions or vector fetches, see pdp11_dmcmon.vhd.
- a breakpoint unit that allows to set instruction or data breakpoints, see pdp11_dmhbpt.vhd.
- an ibus monitor which records the last 512 ibus transactions, see ibd_ibmon.vhd.
nc ....pc cprptnzvc ..dsrc ..ddst ..dres vmaddr vmdata # # the "(*f) (*p++)" in tcsh, running onto an illegal instruction # 15 145210 uu00-.... 000105 173052 000105 w d 173052 000105 mov r0,(sp) 25 145212 uu00-.... 173050 174434 174434 w d 173050 145216 jsr pc, at n(r5) 19 174434 uu00-.... 000010 173064 000010 r i 174434 000045 ?000045? 1 174434 uu00-.... 000012 173064 000012 r d 000010 000045 !VFETCH 010 RIT # # the "mov $SIGILL.,r0" in fpsim(), load 160750 instead of 000004 # 17 160744 ku00-n..c 160750 000045 160750 r i 160746 160750 mov #n,r0 14 160750 ku00-n..c 160752 160750 160732 r i 160750 000770 br .-14 # # the "sigprop[sig]" access in psignal(), which accesses 174036 # which leads to a external bus (or UNIBUS) time out and IIT trap # 23 161314 ku00-.z.. 000000 147500 000000 w d 147500 000000 mov r1,n(r5) 9 161320 ku00-.z.. 174036 000000 000000 Ebto 174036 013066 movb n(r3),r0 3 161320 ku00-.z.. 000006 000000 000006 r d 000004 013066 !VFETCH 004 IIT
cd /usr/src/sys/pdp diff mch_fpsim.s.orig mch_fpsim.s 249c249 < mov $SIGTRAP.,r0 --- > mov $SIGTRAP,r0 / wfjm: fixed constant usage, here and below 252c252 < mov $SIGILL.,r0 --- > mov $SIGILL,r0 257c257 < mov $SIGSEGV.,r0 --- > mov $SIGSEGV,r0 273c273 < mov $SIGFPE.,r0 --- > mov $SIGFPE,r0
cd /usr/src/bin/tcsh diff sh.dol.c.orig sh.dol.c 673a674,691 > /* wfjm: 2017-06-11: moved rscan from sh.glob.c to sh.dol.c > * rscan must be in same overlay as Dtestq and heredoc. > * If they are in different overlays, the function pointer to Dtestq > * passed in heredoc to rscan will be invalid after the overlay switch > * to rscan. > */ > void > rscan(t, f) > register Char **t; > void (*f) (); > { > register Char *p; > > while (p = *t++) > while (*p) > (*f) (*p++); > } > diff sh.glob.c.orig sh.glob.c 536,547d535 < rscan(t, f) < register Char **t; < void (*f) (); < { < register Char *p; < < while (p = *t++) < while (*p) < (*f) (*p++); < } < < void
BSD2.11 patch 453 (2019-10-15)
The corrections finally made it into the BSD2.11 patch 453 released by Steven Schultz in 2019-10-15Subject: fp simulator kernel crash, tcsh here doc crash, welcome y2k bug Index: src/sys/pdp/mch_fpsim.s, src/bin/tcsh/(sh.glob.c,sh.dol.c,sh.decls.h), src/local/welcome/welcome.cSee https://www.retro11.de/data/211bsd/patches/453.