2.11BSD: kernel panic after a 'here document' in tcsh

Detecting the problem (2017-06-06)

Using 2.11BSD Version 447 I found that a 'here document' in tcsh leads to a kernel panic. It's absolutely reproducible on my system, both when run it on my FPGA PDP-11 w11a or in simh. Just doing

cat << EOF
is enough, and I get
ka6 31333 aps 147472
pc 161324 ps 30004
ov 4
cpuerr 20
trap type 0
panic: trap
syncing disks... done
looking at the crash dump gives
cd /etc/crash
./why 4
  0147372: _boot(05000,0100) from    ~panic+072
  0147414: _etext(011350) from ~trap+0350
  0147450: ~trap() from call+040
  0147516: _psignal(0101520,0160750) from ~trap+0364
  0147554: ~trap() from call+040
so the crash is in psignal, which is afaik the kernel internal mechanism to dispatch signals.

Refining the problem description (2017-06-08)

'here documents' are available and work fine in sh and csh. And are in fact used, examples
/usr/adm/daily     (a /bin/sh script)
  su uucp << EOF

/usr/crash/why     (a /bin/csh script)
  adb -k {unix,core}.$1 << 'EOF'

211bsd uses split I/D space and uses all 64 kB I space for code. The top 8 kB are in fact the overlay area, and the crash happened in overlay 4 (as indicated by ov 4). With a simple

nm /unix | sort | grep " 4"
one gets
161254 t ~psignal 4
162302 t ~issignal 4
so the crash is just 050 bytes after the entry point of psignal. So the PC address is fine and not the problem. For psignal look at
the crash must be one of the first lines. psignal is an internal kernel function, called from
and has nothing to do with the libc function psignal, see the man page psignal.0.html and the source psignal.c.html.

Whatever tcsh does, it should not lead to a kernel panic, and if it does, it is primarily a bug of the kernel. It looks like there are two issues, one in tcsh, and one in the kernel. I've a hunch were this might come from, but that will take a weekend or two to check on.

Finding the problem(s) (2017-06-10)

Two remarks by Johnny Billquist on June 7th and June 9th where very helpful, the essential hint was Johnny's observation that on his system he gets an "Illegal instruction - core dumped" and no kernel panic.

I'm using a self-build PDP 11/70 on an FPGA, see GitHub w11 project and w11 home page, which doesn't have a floating point unit (yet). Therefore the kernel is build with floating point emulation, thus with

FPSIM   YES      # floating point simulator

In a kernel with FPSIM activated the trap handler trap(), see trap.c.html, calls for each user mode illegal instruction trap fpsim(). In case it was a floating point instruction fpsim() emulates it, returns 0, and trap() simply returns. If not, fpsim() returns the abort signal type, and trap() calls psignal() with this signal type, which in general will terminate the offending process.

The kernel panic is due to a coding error in mch_fpsim.s. Look in the source code mch_fpsim.s.html after label badins:

badins:                         / Illegal Instruction
      mov     $SIGILL.,r0
      br      2b
The constant SIGILL is defined in assym.h as
#define SIGILL 4.
Thus after substitution the mov instruction is
      mov     $4..,r0
with *two dots* !!! The as assembler generates from this
      mov #160750,r0

So r0 will contain a invalid signal number, which is returned by fpsim() to trap(). This signal number is passed to psignal(), which starts with

      mask = sigmask(sig);
      prop = sigprop[sig];
The access to sigprop[sig] results into an address in IO space, causes an UNIBUS timeout, and in consequence the kernel panic.

After fixing the "$SIGILL." to "$SIGILL" (removing the extraneous '.') and three similar cases the kernel doesn't panic anymore, tcsh crashed with an illegal instruction trap.

Remains the question why tcsh runs onto an illegal instruction. Getting now a tcsh core dump adb gives the answer

adb tcsh tcsh.core
    0172774: _rscan(0176024,0174434) from ~heredoc+0246
    0176040: _heredoc(067676) from ~execute+0234
    0176126: _execute(067040,01512,0,0) from ~execute+03410
    0176222: _execute(066754,01512,0,0) from ~process+01224
    0176274: _process(01) from ~main+06030
    0177414: _main() from start+0104
heredoc(), which is located in OV1, calls rscan(), which is in OV6 with
      rscan(Dv, Dtestq);
where Dtestq is a function pointer to Dtestq(), which is as heredoc() in OV1. rscan(), which has the signature
rscan(t, f)
   register Char **t;
   void    (*f) ();
uses f in the statement
   (*f) (*p++);

The problem is that

It is clear that in this context _Dtestq, the forwarder in the base, must be used and not ~Dtestq, the entry point in the overlay. The generated code for rscan(Dv, Dtestq) is

~heredoc+0230:  mov     $0174434,(sp)         # arg Dtestq: uses ~Dtestq
~heredoc+0234:  mov     r5,-(sp)
~heredoc+0236:  add     $0177764,(sp)         # arg Dv
~heredoc+0242:  jsr     pc,*$_rscan
Since rscan() is very small and only used by heredoc() I simply moved the code of rscan() from sh.glob.c (OV6) to sh.dol.c where also heredoc() and Dtestq() is defined.

After that tcsh works fine with here documents

cat >x.x << EOF
cat x.x

Bottom line

I'll convert this into proper patches and send them to Steven, but this will take some time because I've to tidy up my system to be again in the position to provide proper and clean patch sets.

P.S.: debugging the kernel issue was quite easy because the w11a CPU has three essential build into the CPU debug tools: After setting a breakpoint on the trap 004/010 handler an inspection of the instruction trace gave the essential information. Below a very condensed and annotated excerpt
  nc ....pc cprptnzvc ..dsrc ..ddst ..dres      vmaddr vmdata
# the "(*f) (*p++)" in tcsh, running onto an illegal instruction
  15 145210 uu00-.... 000105 173052 000105 w  d 173052 000105 mov r0,(sp)
  25 145212 uu00-.... 173050 174434 174434 w  d 173050 145216 jsr pc, at n(r5)
  19 174434 uu00-.... 000010 173064 000010 r  i 174434 000045 ?000045?
   1 174434 uu00-.... 000012 173064 000012 r  d 000010 000045 !VFETCH 010 RIT
# the "mov $SIGILL.,r0" in fpsim(), load 160750 instead of 000004
  17 160744 ku00-n..c 160750 000045 160750 r  i 160746 160750 mov #n,r0
  14 160750 ku00-n..c 160752 160750 160732 r  i 160750 000770 br .-14
# the "sigprop[sig]" access in psignal(), which accesses 174036
# which leads to a external bus (or UNIBUS) time out and IIT trap
  23 161314 ku00-.z.. 000000 147500 000000 w  d 147500 000000 mov r1,n(r5)
   9 161320 ku00-.z.. 174036 000000 000000 Ebto 174036 013066 movb n(r3),r0
   3 161320 ku00-.z.. 000006 000000 000006 r  d 000004 013066 !VFETCH 004 IIT
For original thread on TUHS see posting 011650 and follow the Next message links.
Here for reference the three patches
cd /usr/src/sys/pdp
diff mch_fpsim.s.orig mch_fpsim.s
<       mov     $SIGTRAP.,r0
>       mov     $SIGTRAP,r0     / wfjm: fixed constant usage, here and below
<       mov     $SIGILL.,r0
>       mov     $SIGILL,r0
<       mov     $SIGSEGV.,r0
>       mov     $SIGSEGV,r0
<       mov     $SIGFPE.,r0
>       mov     $SIGFPE,r0  
cd /usr/src/bin/tcsh
diff sh.dol.c.orig sh.dol.c
> /* wfjm: 2017-06-11: moved rscan from sh.glob.c to  sh.dol.c
>  *   rscan must be in same overlay as Dtestq and heredoc.
>  *   If they are in different overlays, the function pointer to Dtestq
>  *   passed in heredoc to rscan will be invalid after the overlay switch
>  *   to rscan.
> */
> void
> rscan(t, f)
>     register Char **t;
>     void    (*f) ();
> {
>     register Char *p;
>     while (p = *t++)
>       while (*p)
>           (*f) (*p++);
> }

diff sh.glob.c.orig sh.glob.c
< rscan(t, f)
<     register Char **t;
<     void    (*f) ();
< {
<     register Char *p;
<     while (p = *t++)
<       while (*p)
<           (*f) (*p++);
< }
< void