Kernel panic, tcsh, and here documents

2.11BSD: kernel panic after a 'here document' in tcsh

Detecting the problem (2017-06-06)

Using 2.11BSD Version 447 I found that a 'here document' in tcsh leads to a kernel panic. It's absolutely reproducible on my system, both when runnig it on my FPGA PDP-11 w11a or in simh. Just doing

tcsh
cat << EOF

is enough, and I get

ka6 31333 aps 147472
pc 161324 ps 30004
ov 4
cpuerr 20
trap type 0
panic: trap
syncing disks... done

looking at the crash dump gives

cd /etc/crash
./why 4
  Backtrace:
  0147372: _boot(05000,0100) from    ~panic+072
  0147414: _etext(011350) from ~trap+0350
  0147450: ~trap() from call+040
  0147516: _psignal(0101520,0160750) from ~trap+0364
  0147554: ~trap() from call+040

so the crash is in psignal, which is afaik the kernel internal mechanism to dispatch signals.

Refining the problem description (2017-06-08)

'here documents' are available and work fine in sh and csh. And are in fact used, examples

/usr/adm/daily     (a /bin/sh script)
  su uucp << EOF
        /etc/uucp/clean.daily
  EOF

/usr/crash/why     (a /bin/csh script)
  adb -k {unix,core}.$1 << 'EOF'
  version/sn"Backtrace:"n
  $c
  'EOF'

211bsd uses split I/D space and uses all 64 kB I space for code. The top 8 kB are in fact the overlay area, and the crash happened in overlay 4 (as indicated by ov 4). With a simple

nm /unix | sort | grep " 4"

one gets

161254 t ~psignal 4
162302 t ~issignal 4

so the crash is just 050 bytes after the entry point of psignal. So the PC address is fine and not the problem. For psignal look at

   https://www.retro11.de/ouxr/211bsd/usr/src/sys/sys/kern_sig.c.html#s:_psignal

the crash must be one of the first lines. psignal is an internal kernel function, called from

   https://www.retro11.de/ouxr/211bsd/usr/src/sys/sys/kern_sig.c.html#xref:s:_psignal

and has nothing to do with the libc function psignal, see the man page psignal.0.html and the source psignal.c.html.

Whatever tcsh does, it should not lead to a kernel panic, and if it does, it is primarily a bug of the kernel. It looks like there are two issues, one in tcsh, and one in the kernel. I've got a hunch where this might come from, but that will take a weekend or two to check on.

Finding the problem(s) (2017-06-10)

Two remarks by Johnny Billquist on June 7th and June 9th were very helpful, the essential hint was Johnny's observation that on his system he gets an "Illegal instruction - core dumped" and no kernel panic.

I'm using a self-build PDP 11/70 on an FPGA, see GitHub w11 project and w11 home page, which doesn't have a floating-point unit (yet). Therefore the kernel is built with floating-point emulation, thus with

FPSIM   YES      # floating-point simulator

In a kernel with FPSIM activated the trap handler trap(), see trap.c.html, calls for each user mode illegal instruction trap fpsim(). In case it was a floating-point instruction fpsim() emulates it, returns 0, and trap() simply returns. If not, fpsim() returns the abort signal type, and trap() calls psignal() with this signal type, which in general will terminate the offending process.

The kernel panic is due to a coding error in mch_fpsim.s. Look in the source code mch_fpsim.s.html after label badins:

badins:                         / Illegal Instruction
      mov     $SIGILL.,r0
      br      2b

The constant SIGILL is defined in assym.h as

#define SIGILL 4.

Thus after substitution the mov instruction is

      mov     $4..,r0

with *two dots* !!! The as assembler generates from this

      mov #160750,r0

So r0 will contain a invalid signal number, which is returned by fpsim() to trap(). This signal number is passed to psignal(), which starts with

      mask = sigmask(sig);
      prop = sigprop[sig];

The access to sigprop[sig] results in an address in IO space, causes a UNIBUS timeout, and in consequence the kernel panic.

After fixing the "$SIGILL." to "$SIGILL" (removing the extraneous '.') and three similar cases the kernel doesn't panic anymore, tcsh crashed with an illegal instruction trap.

Remains the question of why tcsh runs onto an illegal instruction. Getting now a tcsh core dump adb gives the answer

adb tcsh tcsh.core
  $c
    0172774: _rscan(0176024,0174434) from ~heredoc+0246
    0176040: _heredoc(067676) from ~execute+0234
    0176126: _execute(067040,01512,0,0) from ~execute+03410
    0176222: _execute(066754,01512,0,0) from ~process+01224
    0176274: _process(01) from ~main+06030
    0177414: _main() from start+0104

heredoc(), which is located in OV1, calls rscan(), which is in OV6 with

      rscan(Dv, Dtestq);

where Dtestq is a function pointer to Dtestq(), which is as heredoc() in OV1. rscan(), which has the signature

rscan(t, f)
   register Char **t;
   void    (*f) ();

uses f in the statement

   (*f) (*p++);

The problem is that

heredoc() and Dtestq() are in OV1
that's why in the end ~Dtestq is using a function pointer, like for all overlay internal function invocations
rscan() is in OV6, when it's called, the overlay is switched OV1 -> OV6
this invalidates the function pointer, which points to some random code location, which happens to hold '000045', causing a trap.

It is clear that in this context _Dtestq, the forwarder in the base, must be used and not ~Dtestq, the entry point in the overlay. The generated code for rscan(Dv, Dtestq) is

~heredoc+0230:  mov     $0174434,(sp)         # arg Dtestq: uses ~Dtestq
~heredoc+0234:  mov     r5,-(sp)
~heredoc+0236:  add     $0177764,(sp)         # arg Dv
~heredoc+0242:  jsr     pc,*$_rscan

Since rscan() is very small and only used by heredoc() I simply moved the code of rscan() from sh.glob.c (OV6) to sh.dol.c where also heredoc() and Dtestq() is defined.

After that tcsh works fine with here documents

./tcsh
cat >x.x << EOF
  1
  $TERM
  $PWD
  EOF
  
cat x.x
  1
  vt100-long
  /usr/src/bin/tcsh

Bottom line

fpsim was broken all the time
tcsh was broken all the time

I'll convert this into proper patches and send them to Steven, but this will take some time because I've to tidy up my system to be again in the position to provide proper and clean patch sets.

P.S.: debugging the kernel issue was quite easy because the w11a CPU has three essential build into the CPU debug tools:

a CPU monitor, which records 144 bits of processor state for the last 256 instructions or vector fetches, see pdp11_dmcmon.vhd.
a breakpoint unit that allows to set instruction or data breakpoints, see pdp11_dmhbpt.vhd.
an ibus monitor which records the last 512 ibus transactions, see ibd_ibmon.vhd.

After setting a breakpoint on the trap 004/010 handler an inspection of the instruction trace gave the essential information. Below is a very condensed and annotated excerpt

  nc ....pc cprptnzvc ..dsrc ..ddst ..dres      vmaddr vmdata
#
# the "(*f) (*p++)" in tcsh, running onto an illegal instruction
#
  15 145210 uu00-.... 000105 173052 000105 w  d 173052 000105 mov r0,(sp)
  25 145212 uu00-.... 173050 174434 174434 w  d 173050 145216 jsr pc, at n(r5)
  19 174434 uu00-.... 000010 173064 000010 r  i 174434 000045 ?000045?
   1 174434 uu00-.... 000012 173064 000012 r  d 000010 000045 !VFETCH 010 RIT
#
# the "mov $SIGILL.,r0" in fpsim(), load 160750 instead of 000004
#
  17 160744 ku00-n..c 160750 000045 160750 r  i 160746 160750 mov #n,r0
  14 160750 ku00-n..c 160752 160750 160732 r  i 160750 000770 br .-14
#
# the "sigprop[sig]" access in psignal(), which accesses 174036
# which leads to a external bus (or UNIBUS) time out and IIT trap
#
  23 161314 ku00-.z.. 000000 147500 000000 w  d 147500 000000 mov r1,n(r5)
   9 161320 ku00-.z.. 174036 000000 000000 Ebto 174036 013066 movb n(r3),r0
   3 161320 ku00-.z.. 000006 000000 000006 r  d 000004 013066 !VFETCH 004 IIT

For original thread on TUHS see posting 011650 and follow the Next message links.

Here for reference the three patches

cd /usr/src/sys/pdp
  
diff mch_fpsim.s.orig mch_fpsim.s
249c249
<       mov     $SIGTRAP.,r0
---
>       mov     $SIGTRAP,r0     / wfjm: fixed constant usage, here and below
252c252
<       mov     $SIGILL.,r0
---
>       mov     $SIGILL,r0
257c257
<       mov     $SIGSEGV.,r0
---
>       mov     $SIGSEGV,r0
273c273
<       mov     $SIGFPE.,r0
---
>       mov     $SIGFPE,r0

cd /usr/src/bin/tcsh
  
diff sh.dol.c.orig sh.dol.c
673a674,691
> /* wfjm: 2017-06-11: moved rscan from sh.glob.c to  sh.dol.c
>  *   rscan must be in same overlay as Dtestq and heredoc.
>  *   If they are in different overlays, the function pointer to Dtestq
>  *   passed in heredoc to rscan will be invalid after the overlay switch
>  *   to rscan.
> */
> void
> rscan(t, f)
>     register Char **t;
>     void    (*f) ();
> {
>     register Char *p;
> 
>     while (p = *t++)
>       while (*p)
>           (*f) (*p++);
> }
> 

diff sh.glob.c.orig sh.glob.c
536,547d535
< rscan(t, f)
<     register Char **t;
<     void    (*f) ();
< {
<     register Char *p;
< 
<     while (p = *t++)
<       while (*p)
<           (*f) (*p++);
< }
< 
< void

BSD2.11 patch 453 (2019-10-15)

The corrections finally made it into the BSD2.11 patch 453 released by Steven Schultz in 2019-10-15

  Subject: fp simulator kernel crash, tcsh here doc crash, welcome y2k bug
  Index: src/sys/pdp/mch_fpsim.s, src/bin/tcsh/(sh.glob.c,sh.dol.c,sh.decls.h),
         src/local/welcome/welcome.c

See https://www.retro11.de/data/211bsd/patches/453.

Posted:	2017-06-06
Update:	2022-06-08
Tags:	211bsd

Newest:	2017-06-06
	2009-01-04
Oldest:	2007-01-03

about	posted 2017-06-06 & updated 2022-06-08
This is a private hobbyist website no impressum or privacy protection statement required see GitHub terms