[j-nsp] what happens if HDD on routing-engine fails during the router operation?

Wed Jun 26 14:20:32 EDT 2013

Interestingly enough, last night I had an EX switch have something happen
with its onboard flash, and the thing ate it pretty hard.

Came back up with errors like this, and then just crashed again shortly.

Jun 26 00:48:14  tor-205-a.sv.<snipped> fpc0 Route TCAM rows need not be
redirected on device 0.

Jun 26 00:48:14  tor-205-a.sv.<snipped> fpc0 Route TCAM rows need not be
redirected on device 1.

Jun 26 00:48:15  tor-205-a.sv.<snipped> fpc0 PFEM: Enabling traffic for dev
0

Jun 26 00:48:15  tor-205-a.sv.<snipped> chassisd[985]:
LIBJSNMP_SA_PARTIAL_SEND_FRAG: Attempted to send 68 bytes, actually sent 4
bytes

Jun 26 00:48:15  tor-205-a.sv.<snipped> chassisd[985]:
LIBJSNMP_SA_PARTIAL_SEND_REM: Queuing message remainder, 64 bytes

Jun 26 00:48:15  tor-205-a.sv.<snipped> fpc0 PFEM: Enabling traffic for dev
1

Jun 26 00:48:17  tor-205-a.sv.<snipped> /kernel: RT_PFE: RT msg op 1
(PREFIX ADD) failed, err 5 (Invalid)

Jun 26 00:48:17  tor-205-a.sv.<snipped> chassisd[985]:
LIBJSNMP_SA_PARTIAL_SEND_FRAG: Attempted to send 68 bytes, actually sent 52
bytes

Jun 26 00:48:17  tor-205-a.sv.<snipped> chassisd[985]:
LIBJSNMP_SA_PARTIAL_SEND_REM: Queuing message remainder, 16 bytes

Jun 26 00:48:19  tor-205-a.sv.<snipped> chassisd[985]:
LIBJSNMP_SA_PARTIAL_SEND_FRAG: Attempted to send 68 bytes, actually sent 56
bytes

Jun 26 00:48:20  tor-205-a.sv.<snipped> chassisd[985]:
LIBJSNMP_SA_PARTIAL_SEND_REM: Queuing message remainder, 12 bytes

Jun 26 00:48:21  tor-205-a.sv.<snipped> lldpd[1009]:
LIBESPTASK_SNMP_CONN_RETRY: snmp_epi_reg_refresh: reattempting connection
to SNMP agent (register MIBs): Resource temporarily unavailable

Jun 26 00:48:22  tor-205-a.sv.<snipped> /kernel: (da0:umass-sim0:0:0:0):
READ(10). CDB: 28 0 0 19 6a 0 0 0 20 0

Jun 26 00:48:22  tor-205-a.sv.<snipped> /kernel: (da0:umass-sim0:0:0:0):
CAM Status: SCSI Status Error

Jun 26 00:48:22  tor-205-a.sv.<snipped> /kernel: (da0:umass-sim0:0:0:0):
SCSI Status: Check Condition

Jun 26 00:48:22  tor-205-a.sv.<snipped> /kernel: (da0:umass-sim0:0:0:0):
MEDIUM ERROR asc:11,0

Jun 26 00:48:22  tor-205-a.sv.<snipped> /kernel: (da0:umass-sim0:0:0:0):
Unrecovered read error

Jun 26 00:48:22  tor-205-a.sv.<snipped> /kernel: (da0:umass-sim0:0:0:0):
Retrying Command (per Sense Data)

Jun 26 00:48:23  tor-205-a.sv.<snipped> /kernel: (da0:umass-sim0:0:0:0):
READ(10). CDB: 28 0 0 19 6b 0 0 0 80 0

Jun 26 00:48:23  tor-205-a.sv.<snipped> /kernel: (da0:umass-sim0:0:0:0):
CAM Status: SCSI Status Error

Jun 26 00:48:23  tor-205-a.sv.<snipped> /kernel: (da0:umass-sim0:0:0:0):
SCSI Status: Check Condition

Jun 26 00:48:23  tor-205-a.sv.<snipped> /kernel: (da0:umass-sim0:0:0:0):
ILLEGAL REQUEST asc:20,0

Jun 26 00:48:23  tor-205-a.sv.<snipped> /kernel: (da0:umass-sim0:0:0:0):
Invalid command operation code

Jun 26 00:48:23  tor-205-a.sv.<snipped> /kernel: (da0:umass-sim0:0:0:0):
Unretryable error

Jun 26 00:48:23  tor-205-a.sv.<snipped> /kernel:
g_vfs_done():da0s3e[READ(offset=67502080, length=65536)]error = 22

Jun 26 00:48:23  tor-205-a.sv.<snipped> /kernel: vnode_pager_getpages: I/O
read error

Jun 26 00:48:23  tor-205-a.sv.<snipped> /kernel: vm_fault: pager read
error, pid 1047 (cp)

Jun 26 00:48:25  tor-205-a.sv.<snipped> fpc0 pfe_pme_max 24

Jun 26 00:48:25  tor-205-a.sv.<snipped> fpc0 PFEMAN: Sent Resync request to
Master

Jun 26 00:48:25  tor-205-a.sv.<snipped> fpc0
MRVL-L2:mrvl_brg_port_stg_entry_set(),293:l2ifl not found for ifl 4!

Jun 26 00:48:25  tor-205-a.sv.<snipped> fpc0
MRVL-L2:mrvl_brg_port_stg_create(),539:Port-STG-Set failed(Invalid
Params:-2)

Jun 26 00:48:25  tor-205-a.sv.<snipped> fpc0
RT-HAL,rt_entry_add_msg_proc,2790: l2_halp_vectors->l2_entry_create failed

Jun 26 00:48:25  tor-205-a.sv.<snipped> fpc0
RT-HAL,rt_entry_add_msg_proc,2883: proto MSTI,len 48 prefix 00004:00254 nh
82

Jun 26 00:48:25  tor-205-a.sv.<snipped> fpc0 RT-HAL,rt_msg_handler,597:
route process failed

On Wed, Jun 26, 2013 at 5:16 AM, Martin T <m4rtntns at gmail.com> wrote:

> I did not try "set chassis redundancy failover on-disk-failure" as
> this should be for GRES configuration, but I have single RE both in
> M10i and M20.
>
>
> regards,
> Martin
>
> 2013/6/26, Per Granath <per.granath at gcc.com.cy>:
> > Note that this is two different configurations:
> >
> > set chassis routing-engine on-disk-failure disk-failure-action reboot
> > set chassis redundancy failover on-disk-failure
> >
> > Did you try both?
> >
> >
> > -----Original Message-----
> > From: Martin T [mailto:m4rtntns at gmail.com]
> > Sent: Wednesday, June 26, 2013 11:58 AM
> > To: Per Granath
> > Cc: merlyn at geeks.org; juniper-nsp at puck.nether.net
> > Subject: Re: [j-nsp] what happens if HDD on routing-engine fails during
> the
> > router operation?
> >
> > Hi,
> >
> > I did now :) However, it had no effect. On the other hand, dismounting
> the
> > /var is not near the same as completely removing or failure of the HDD
> on a
> > working routing-engine.
> >
> >
> > Example with M20:
> >
> > root at M20> show configuration chassis
> > routing-engine {
> >     on-disk-failure disk-failure-action reboot; }
> >
> > root at M20> show system processes brief
> > last pid:  1475;  load averages:  0.00,  0.12,  0.15  up 0+00:11:35
> > 07:08:28
> > 105 processes: 3 running, 86 sleeping, 16 waiting
> >
> > Mem: 136M Active, 115M Inact, 32M Wired, 132M Cache, 69M Buf, 1580M Free
> > Swap: 2048M Total, 2048M Free
> >
> >
> >
> >
> > root at M20> start shell csh
> > root at M20% mount
> > /dev/ad0s1a on / (ufs, local, noatime)
> > devfs on /dev (devfs, local)
> > devfs on /dev/ (devfs, local, noatime, noexec, read-only)
> > /dev/md0 on /packages/mnt/jbase (cd9660, local, noatime, read-only)
> > /dev/md1 on /packages/mnt/jkernel-9.4R3.5 (cd9660, local, noatime,
> > read-only)
> > /dev/md2 on /packages/mnt/jpfe-M40-9.4R3.5 (cd9660, local, noatime,
> > read-only)
> > /dev/md3 on /packages/mnt/jdocs-9.4R3.5 (cd9660, local, noatime,
> read-only)
> > /dev/md4 on /packages/mnt/jroute-9.4R3.5 (cd9660, local, noatime,
> > read-only)
> > /dev/md5 on /packages/mnt/jcrypto-9.4R3.5 (cd9660, local, noatime,
> > read-only)
> > /dev/md6 on /packages/mnt/jpfe-common-9.4R3.5 (cd9660, local, noatime,
> > read-only)
> > /dev/md7 on /tmp (ufs, local, noatime, soft-updates)
> > /dev/md8 on /mfs (ufs, local, noatime, soft-updates) /dev/ad0s1e on
> /config
> > (ufs, local, noatime) procfs on /proc (procfs, local, noatime)
> /dev/ad1s1f
> > on /var (ufs, local, noatime) root at M20% umount -f /var root at M20% mount
> > /dev/ad0s1a on / (ufs, local, noatime) devfs on /dev (devfs, local)
> devfs on
> > /dev/ (devfs, local, noatime, noexec, read-only)
> > /dev/md0 on /packages/mnt/jbase (cd9660, local, noatime, read-only)
> > /dev/md1 on /packages/mnt/jkernel-9.4R3.5 (cd9660, local, noatime,
> > read-only)
> > /dev/md2 on /packages/mnt/jpfe-M40-9.4R3.5 (cd9660, local, noatime,
> > read-only)
> > /dev/md3 on /packages/mnt/jdocs-9.4R3.5 (cd9660, local, noatime,
> read-only)
> > /dev/md4 on /packages/mnt/jroute-9.4R3.5 (cd9660, local, noatime,
> > read-only)
> > /dev/md5 on /packages/mnt/jcrypto-9.4R3.5 (cd9660, local, noatime,
> > read-only)
> > /dev/md6 on /packages/mnt/jpfe-common-9.4R3.5 (cd9660, local, noatime,
> > read-only)
> > /dev/md7 on /tmp (ufs, local, noatime, soft-updates)
> > /dev/md8 on /mfs (ufs, local, noatime, soft-updates) /dev/ad0s1e on
> /config
> > (ufs, local, noatime) procfs on /proc (procfs, local, noatime) root at M20%
> > exit exit
> >
> > root at M20> ?
> > No valid completions
> > root at M20>
> > error: unknown command: .noop-command
> >
> >
> > root at M20> Jun 26 07:09:49 init: can't chdir to /var/tmp/: No such file
> or
> > directory Jun 26 07:09:54 init: can't chdir to /var/tmp/: No such file or
> > directory Jun 26 07:09:59 init: can't chdir to /var/tmp/: No such file or
> > directory Jun 26 07:10:04 init: can't chdir to /var/tmp/: No such file or
> > directory Jun 26 07:10:04 init: can't chdir to /var/tmp/: No such file or
> > directory
> >
> >
> >
> > Example with M10i:
> >
> > root at M10i> show configuration chassis
> > routing-engine {
> >     on-disk-failure disk-failure-action reboot; }
> >
> > root at M10i> show system processes brief
> > last pid:  1473;  load averages:  3.97,  1.22,  0.47  up 0+00:02:46
> > 08:17:13
> > 111 processes: 5 running, 89 sleeping, 17 waiting
> >
> > Mem: 181M Active, 54M Inact, 33M Wired, 216M Cache, 69M Buf, 1012M Free
> > Swap: 2048M Total, 2048M Free
> >
> >
> >
> >
> > root at M10i> start shell csh
> > root at M10i% mount
> > /dev/ad0s1a on / (ufs, local, noatime)
> > devfs on /dev (devfs, local, multilabel)
> > /dev/md0 on /packages/mnt/jbase (cd9660, local, noatime, read-only,
> > verified)
> > /dev/md1 on /packages/mnt/jkernel-10.4R12.4 (cd9660, local, noatime,
> > read-only, verified)
> > /dev/md2 on /packages/mnt/jpfe-M7i-10.4R12.4 (cd9660, local, noatime,
> > read-only)
> > /dev/md3 on /packages/mnt/jdocs-10.4R12.4 (cd9660, local, noatime,
> > read-only, verified)
> > /dev/md4 on /packages/mnt/jroute-10.4R12.4 (cd9660, local, noatime,
> > read-only, verified)
> > /dev/md5 on /packages/mnt/jcrypto-10.4R12.4 (cd9660, local, noatime,
> > read-only, verified)
> > /dev/md6 on /packages/mnt/jpfe-common-10.4R12.4 (cd9660, local, noatime,
> > read-only)
> > /dev/md7 on /packages/mnt/jruntime-10.4R12.4 (cd9660, local, noatime,
> > read-only, verified)
> > /dev/md8 on /tmp (ufs, asynchronous, local, noatime)
> > /dev/md9 on /mfs (ufs, asynchronous, local, noatime) /dev/ad0s1e on
> /config
> > (ufs, local, noatime) procfs on /proc (procfs, local, noatime)
> /dev/ad1s1f
> > on /var (ufs, local, noatime) root at M10i% umount -f /var root at M10i% mount
> > /dev/ad0s1a on / (ufs, local, noatime) devfs on /dev (devfs, local,
> > multilabel)
> > /dev/md0 on /packages/mnt/jbase (cd9660, local, noatime, read-only,
> > verified)
> > /dev/md1 on /packages/mnt/jkernel-10.4R12.4 (cd9660, local, noatime,
> > read-only, verified)
> > /dev/md2 on /packages/mnt/jpfe-M7i-10.4R12.4 (cd9660, local, noatime,
> > read-only)
> > /dev/md3 on /packages/mnt/jdocs-10.4R12.4 (cd9660, local, noatime,
> > read-only, verified)
> > /dev/md4 on /packages/mnt/jroute-10.4R12.4 (cd9660, local, noatime,
> > read-only, verified)
> > /dev/md5 on /packages/mnt/jcrypto-10.4R12.4 (cd9660, local, noatime,
> > read-only, verified)
> > /dev/md6 on /packages/mnt/jpfe-common-10.4R12.4 (cd9660, local, noatime,
> > read-only)
> > /dev/md7 on /packages/mnt/jruntime-10.4R12.4 (cd9660, local, noatime,
> > read-only, verified)
> > /dev/md8 on /tmp (ufs, asynchronous, local, noatime)
> > /dev/md9 on /mfs (ufs, asynchronous, local, noatime) /dev/ad0s1e on
> /config
> > (ufs, local, noatime) procfs on /proc (procfs, local, noatime) root at M10i
> %
> > Jun 26 08:18:04 init: can't chdir to /var/tmp/: No such file or directory
> > exit exit
> >
> > root at M10i> Jun 26 08:18:09 init: can't chdir to /var/tmp/: No such file
> or
> > directory ?
> > No valid completions
> > root at M10i> Jun 26 08:18:15 init: can't chdir to /var/tmp/: No such file
> or
> > directory Jun 26 08:18:20 init: can't chdir to /var/tmp/: No such file or
> > directory Jun 26 08:18:20 init: can't chdir to /var/tmp/: No such file or
> > directory
> >
> >
> > One other important thing what happens if HDD fails is that swap space is
> > lost. This is probably rather critical with for example RE-333-256.
> > In addition, looks like the RE-850 has no problems with booting up
> without
> > the HDD while RE-600 or RE-333 do not boot up without HDD..
> >
> >
> > Still, what exactly makes the RE reload when HDD is lost?
> >
> >
> > regards,
> > Martin
> >
> > 2013/6/26, Per Granath <per.granath at gcc.com.cy>:
> >> Did you try it with this configuration?
> >>
> >> chassis {
> >>     redundancy {
> >>         failover {
> >>             on-loss-of-keepalives;
> >>             on-disk-failure;
> >>         }
> >>     }
> >> }
> >>
> >>
> >>
> >> _______________________________________________
> >> juniper-nsp mailing list juniper-nsp at puck.nether.net
> >> https://puck.nether.net/mailman/listinfo/juniper-nsp
> >>
> >
> _______________________________________________
> juniper-nsp mailing list juniper-nsp at puck.nether.net
> https://puck.nether.net/mailman/listinfo/juniper-nsp
>

-- 
Thanks,
Morgan