[j-nsp] what happens if HDD on routing-engine fails during the router operation?

Martin T m4rtntns at gmail.com
Tue Jun 25 11:27:29 EDT 2013


Hi,

M and MX series routing-engines have HDD(or SSD) installed which has a
UFS and is mounted to /var. /var directory contains many important
sub-directories like "log" for log files, "crash" for core-dumps,
"tmp" for some temporary files etc. However, what happens if HDD fails
while the routing-engine is operational? As there is no easy way to
remove a HDD on an operating RE, I dismounted HDD from file-system on
an operational routing-engine. First example is with M20(RE-600):

root at M20> show chassis hardware detail | match ad
  ad0     245 MB  SanDisk SDCFB-256    101120L0703U0953  Compact Flash
  ad1   28615 MB  FUJITSU MHR2030AT D  NJ69T3A14196      Hard Disk

root at M20> start shell sh
# uname -a
JUNOS M20 9.4R3.5 JUNOS 9.4R3.5 #0: 2009-07-24 23:24:53 UTC
builder at firth.juniper.net:/volume/build/junos/9.4/release/9.4R3.5/obj-i386/sys/compile/JUNIPER
 i386
# mount
/dev/ad0s1a on / (ufs, local, noatime)
devfs on /dev (devfs, local)
devfs on /dev/ (devfs, local, noatime, noexec, read-only)
/dev/md0 on /packages/mnt/jbase (cd9660, local, noatime, read-only)
/dev/md1 on /packages/mnt/jkernel-9.4R3.5 (cd9660, local, noatime, read-only)
/dev/md2 on /packages/mnt/jpfe-M40-9.4R3.5 (cd9660, local, noatime, read-only)
/dev/md3 on /packages/mnt/jdocs-9.4R3.5 (cd9660, local, noatime, read-only)
/dev/md4 on /packages/mnt/jroute-9.4R3.5 (cd9660, local, noatime, read-only)
/dev/md5 on /packages/mnt/jcrypto-9.4R3.5 (cd9660, local, noatime, read-only)
/dev/md6 on /packages/mnt/jpfe-common-9.4R3.5 (cd9660, local, noatime,
read-only)
/dev/md7 on /tmp (ufs, local, noatime, soft-updates)
/dev/md8 on /mfs (ufs, local, noatime, soft-updates)
/dev/ad0s1e on /config (ufs, local, noatime)
procfs on /proc (procfs, local, noatime)
/dev/ad1s1f on /var (ufs, local, noatime)
# umount -f /var
# mount
/dev/ad0s1a on / (ufs, local, noatime)
devfs on /dev (devfs, local)
devfs on /dev/ (devfs, local, noatime, noexec, read-only)
/dev/md0 on /packages/mnt/jbase (cd9660, local, noatime, read-only)
/dev/md1 on /packages/mnt/jkernel-9.4R3.5 (cd9660, local, noatime, read-only)
/dev/md2 on /packages/mnt/jpfe-M40-9.4R3.5 (cd9660, local, noatime, read-only)
/dev/md3 on /packages/mnt/jdocs-9.4R3.5 (cd9660, local, noatime, read-only)
/dev/md4 on /packages/mnt/jroute-9.4R3.5 (cd9660, local, noatime, read-only)
/dev/md5 on /packages/mnt/jcrypto-9.4R3.5 (cd9660, local, noatime, read-only)
/dev/md6 on /packages/mnt/jpfe-common-9.4R3.5 (cd9660, local, noatime,
read-only)
/dev/md7 on /tmp (ufs, local, noatime, soft-updates)
/dev/md8 on /mfs (ufs, local, noatime, soft-updates)
/dev/ad0s1e on /config (ufs, local, noatime)
procfs on /proc (procfs, local, noatime)
# clJun 25 12:03:55 init: can't chdir to /var/tmp/: No such file or directory
^R

# Jun 25 12:04:01 init: can't chdir to /var/tmp/: No such file or directory

# exit

root at M20> Jun 25 12:04:06 init: can't chdir to /var/tmp/: No such file
or directory
error: unknown command: .noop-command


WARNING: cli has been replaced by an updated version:
CLI release 9.4R3.5 built by builder on 2009-07-24 23:11:30 UTC
Restart cli using the new version ? [yes,no] (yes)

Restarting cli ...
Jun 25 12:04:11 init: can't chdir to /var/tmp/: No such file or directory
Jun 25 12:04:11 init: can't chdir to /var/tmp/: No such file or directory
could not open user interface connection: management daemon not responding
Retry connection attempts ? [yes,no] (yes) yes
could not open user interface connection: management daemon not responding
Retry connection attempts ? [yes,no] (yes) no
root at M20% ps aux | grep mgd
root at M20% /usr/sbin/mgd -N
mgd: error: could not open database: /var/run/db/schema.db: No such
file or directory
mgd: error: Database open failed for file '/var/run/db/schema.db': No
such file or directory
mgd: error: could not open database schema: /var/run/db/schema.db
mgd: error: could not open database schema
mgd: error: database schema is out of date, rebuilding it
mgd: error: could not open database: /var/run/db/juniper.data: No such
file or directory
mgd: error: Database open failed for file '/var/run/db/juniper.data':
No such file or directory
mgd: error: Cannot read configuration: Could not open configuration database
mgd: error: daemon MGD detects existing daemon using lock file
'/var/run/mgd.pid'
root at M20% mount /dev/ad1s1f /var
root at M20% /usr/sbin/mgd
root at M20% cli
root at M20>



Second example is with M10i(RE-850):

root at M10i> show chassis hardware detail | match ad
  ad0     999 MB  SILICONSYSTEMS INC 1GB C9183198528209048W01 Compact Flash
  ad1   38154 MB  FUJITSU MHV2040AS    NT19T842CY34      Hard Disk

root at M10i> start shell sh
# uname -a
JUNOS M10i 10.4R12.4 JUNOS 10.4R12.4 #0: 2013-01-09 10:01:08 UTC
builder at larth.juniper.net:/volume/build/junos/10.4/release/10.4R12.4/obj-i386/bsd/sys/compile/JUNIPER
 i386
# mount
/dev/ad0s1a on / (ufs, local, noatime)
devfs on /dev (devfs, local, multilabel)
/dev/md0 on /packages/mnt/jbase (cd9660, local, noatime, read-only, verified)
/dev/md1 on /packages/mnt/jkernel-10.4R12.4 (cd9660, local, noatime,
read-only, verified)
/dev/md2 on /packages/mnt/jpfe-M7i-10.4R12.4 (cd9660, local, noatime, read-only)
/dev/md3 on /packages/mnt/jdocs-10.4R12.4 (cd9660, local, noatime,
read-only, verified)
/dev/md4 on /packages/mnt/jroute-10.4R12.4 (cd9660, local, noatime,
read-only, verified)
/dev/md5 on /packages/mnt/jcrypto-10.4R12.4 (cd9660, local, noatime,
read-only, verified)
/dev/md6 on /packages/mnt/jpfe-common-10.4R12.4 (cd9660, local,
noatime, read-only)
/dev/md7 on /packages/mnt/jruntime-10.4R12.4 (cd9660, local, noatime,
read-only, verified)
/dev/md8 on /tmp (ufs, asynchronous, local, noatime)
/dev/md9 on /mfs (ufs, asynchronous, local, noatime)
/dev/ad0s1e on /config (ufs, local, noatime)
procfs on /proc (procfs, local, noatime)
/dev/ad1s1f on /var (ufs, local, noatime)
# umount -f /var
# mount
/dev/ad0s1a on / (ufs, local, noatime)
devfs on /dev (devfs, local, multilabel)
/dev/md0 on /packages/mnt/jbase (cd9660, local, noatime, read-only, verified)
/dev/md1 on /packages/mnt/jkernel-10.4R12.4 (cd9660, local, noatime,
read-only, verified)
/dev/md2 on /packages/mnt/jpfe-M7i-10.4R12.4 (cd9660, local, noatime, read-only)
/dev/md3 on /packages/mnt/jdocs-10.4R12.4 (cd9660, local, noatime,
read-only, verified)
/dev/md4 on /packages/mnt/jroute-10.4R12.4 (cd9660, local, noatime,
read-only, verified)
/dev/md5 on /packages/mnt/jcrypto-10.4R12.4 (cd9660, local, noatime,
read-only, verified)
/dev/md6 on /packages/mnt/jpfe-common-10.4R12.4 (cd9660, local,
noatime, read-only)
/dev/md7 on /packages/mnt/jruntime-10.4R12.4 (cd9660, local, noatime,
read-only, verified)
/dev/md8 on /tmp (ufs, asynchronous, local, noatime)
/dev/md9 on /mfs (ufs, asynchronous, local, noatime)
/dev/ad0s1e on /config (ufs, local, noatime)
procfs on /proc (procfs, local, noatime)
# exit

root at M10i> sho
           ^
unknown command.
root at M10i> show
           ^
unknown command.
root at M10i> ?
No valid completions
root at M10i> start
           ^
unknown command.
root at M10i> exit
           ^
unknown command.

root at M10i>
error: unknown command: .noop-command


root at M10i>
error: unknown command: .noop-command


root at M10i> Jun 25 13:24:38 init: can't chdir to /var/tmp/: No such
file or directory
Jun 25 13:24:43 init: can't chdir to /var/tmp/: No such file or directory



In case of M10i(RE-850) I waited for few hours after unmounting the
/var for some watchdog timer to kick in, but nothing happened. Finally
I just remounted the HDD and restarted the mgd process. RE worked as
it should.
According to KB19024, at least "Hard drive access suddenly lost" is
one of the reasons which cause watchdog timer to reload the routing
engine. Is the watchdog timer triggered only in case the HDD is
physically removed aka HDD fails? What exactly does this watchdog
timer check?


regards,
Martin


More information about the juniper-nsp mailing list