[j-nsp] what happens if HDD on routing-engine fails during the router operation?
Martin T
m4rtntns at gmail.com
Tue Jun 25 11:27:29 EDT 2013
Hi,
M and MX series routing-engines have HDD(or SSD) installed which has a
UFS and is mounted to /var. /var directory contains many important
sub-directories like "log" for log files, "crash" for core-dumps,
"tmp" for some temporary files etc. However, what happens if HDD fails
while the routing-engine is operational? As there is no easy way to
remove a HDD on an operating RE, I dismounted HDD from file-system on
an operational routing-engine. First example is with M20(RE-600):
root at M20> show chassis hardware detail | match ad
ad0 245 MB SanDisk SDCFB-256 101120L0703U0953 Compact Flash
ad1 28615 MB FUJITSU MHR2030AT D NJ69T3A14196 Hard Disk
root at M20> start shell sh
# uname -a
JUNOS M20 9.4R3.5 JUNOS 9.4R3.5 #0: 2009-07-24 23:24:53 UTC
builder at firth.juniper.net:/volume/build/junos/9.4/release/9.4R3.5/obj-i386/sys/compile/JUNIPER
i386
# mount
/dev/ad0s1a on / (ufs, local, noatime)
devfs on /dev (devfs, local)
devfs on /dev/ (devfs, local, noatime, noexec, read-only)
/dev/md0 on /packages/mnt/jbase (cd9660, local, noatime, read-only)
/dev/md1 on /packages/mnt/jkernel-9.4R3.5 (cd9660, local, noatime, read-only)
/dev/md2 on /packages/mnt/jpfe-M40-9.4R3.5 (cd9660, local, noatime, read-only)
/dev/md3 on /packages/mnt/jdocs-9.4R3.5 (cd9660, local, noatime, read-only)
/dev/md4 on /packages/mnt/jroute-9.4R3.5 (cd9660, local, noatime, read-only)
/dev/md5 on /packages/mnt/jcrypto-9.4R3.5 (cd9660, local, noatime, read-only)
/dev/md6 on /packages/mnt/jpfe-common-9.4R3.5 (cd9660, local, noatime,
read-only)
/dev/md7 on /tmp (ufs, local, noatime, soft-updates)
/dev/md8 on /mfs (ufs, local, noatime, soft-updates)
/dev/ad0s1e on /config (ufs, local, noatime)
procfs on /proc (procfs, local, noatime)
/dev/ad1s1f on /var (ufs, local, noatime)
# umount -f /var
# mount
/dev/ad0s1a on / (ufs, local, noatime)
devfs on /dev (devfs, local)
devfs on /dev/ (devfs, local, noatime, noexec, read-only)
/dev/md0 on /packages/mnt/jbase (cd9660, local, noatime, read-only)
/dev/md1 on /packages/mnt/jkernel-9.4R3.5 (cd9660, local, noatime, read-only)
/dev/md2 on /packages/mnt/jpfe-M40-9.4R3.5 (cd9660, local, noatime, read-only)
/dev/md3 on /packages/mnt/jdocs-9.4R3.5 (cd9660, local, noatime, read-only)
/dev/md4 on /packages/mnt/jroute-9.4R3.5 (cd9660, local, noatime, read-only)
/dev/md5 on /packages/mnt/jcrypto-9.4R3.5 (cd9660, local, noatime, read-only)
/dev/md6 on /packages/mnt/jpfe-common-9.4R3.5 (cd9660, local, noatime,
read-only)
/dev/md7 on /tmp (ufs, local, noatime, soft-updates)
/dev/md8 on /mfs (ufs, local, noatime, soft-updates)
/dev/ad0s1e on /config (ufs, local, noatime)
procfs on /proc (procfs, local, noatime)
# clJun 25 12:03:55 init: can't chdir to /var/tmp/: No such file or directory
^R
# Jun 25 12:04:01 init: can't chdir to /var/tmp/: No such file or directory
# exit
root at M20> Jun 25 12:04:06 init: can't chdir to /var/tmp/: No such file
or directory
error: unknown command: .noop-command
WARNING: cli has been replaced by an updated version:
CLI release 9.4R3.5 built by builder on 2009-07-24 23:11:30 UTC
Restart cli using the new version ? [yes,no] (yes)
Restarting cli ...
Jun 25 12:04:11 init: can't chdir to /var/tmp/: No such file or directory
Jun 25 12:04:11 init: can't chdir to /var/tmp/: No such file or directory
could not open user interface connection: management daemon not responding
Retry connection attempts ? [yes,no] (yes) yes
could not open user interface connection: management daemon not responding
Retry connection attempts ? [yes,no] (yes) no
root at M20% ps aux | grep mgd
root at M20% /usr/sbin/mgd -N
mgd: error: could not open database: /var/run/db/schema.db: No such
file or directory
mgd: error: Database open failed for file '/var/run/db/schema.db': No
such file or directory
mgd: error: could not open database schema: /var/run/db/schema.db
mgd: error: could not open database schema
mgd: error: database schema is out of date, rebuilding it
mgd: error: could not open database: /var/run/db/juniper.data: No such
file or directory
mgd: error: Database open failed for file '/var/run/db/juniper.data':
No such file or directory
mgd: error: Cannot read configuration: Could not open configuration database
mgd: error: daemon MGD detects existing daemon using lock file
'/var/run/mgd.pid'
root at M20% mount /dev/ad1s1f /var
root at M20% /usr/sbin/mgd
root at M20% cli
root at M20>
Second example is with M10i(RE-850):
root at M10i> show chassis hardware detail | match ad
ad0 999 MB SILICONSYSTEMS INC 1GB C9183198528209048W01 Compact Flash
ad1 38154 MB FUJITSU MHV2040AS NT19T842CY34 Hard Disk
root at M10i> start shell sh
# uname -a
JUNOS M10i 10.4R12.4 JUNOS 10.4R12.4 #0: 2013-01-09 10:01:08 UTC
builder at larth.juniper.net:/volume/build/junos/10.4/release/10.4R12.4/obj-i386/bsd/sys/compile/JUNIPER
i386
# mount
/dev/ad0s1a on / (ufs, local, noatime)
devfs on /dev (devfs, local, multilabel)
/dev/md0 on /packages/mnt/jbase (cd9660, local, noatime, read-only, verified)
/dev/md1 on /packages/mnt/jkernel-10.4R12.4 (cd9660, local, noatime,
read-only, verified)
/dev/md2 on /packages/mnt/jpfe-M7i-10.4R12.4 (cd9660, local, noatime, read-only)
/dev/md3 on /packages/mnt/jdocs-10.4R12.4 (cd9660, local, noatime,
read-only, verified)
/dev/md4 on /packages/mnt/jroute-10.4R12.4 (cd9660, local, noatime,
read-only, verified)
/dev/md5 on /packages/mnt/jcrypto-10.4R12.4 (cd9660, local, noatime,
read-only, verified)
/dev/md6 on /packages/mnt/jpfe-common-10.4R12.4 (cd9660, local,
noatime, read-only)
/dev/md7 on /packages/mnt/jruntime-10.4R12.4 (cd9660, local, noatime,
read-only, verified)
/dev/md8 on /tmp (ufs, asynchronous, local, noatime)
/dev/md9 on /mfs (ufs, asynchronous, local, noatime)
/dev/ad0s1e on /config (ufs, local, noatime)
procfs on /proc (procfs, local, noatime)
/dev/ad1s1f on /var (ufs, local, noatime)
# umount -f /var
# mount
/dev/ad0s1a on / (ufs, local, noatime)
devfs on /dev (devfs, local, multilabel)
/dev/md0 on /packages/mnt/jbase (cd9660, local, noatime, read-only, verified)
/dev/md1 on /packages/mnt/jkernel-10.4R12.4 (cd9660, local, noatime,
read-only, verified)
/dev/md2 on /packages/mnt/jpfe-M7i-10.4R12.4 (cd9660, local, noatime, read-only)
/dev/md3 on /packages/mnt/jdocs-10.4R12.4 (cd9660, local, noatime,
read-only, verified)
/dev/md4 on /packages/mnt/jroute-10.4R12.4 (cd9660, local, noatime,
read-only, verified)
/dev/md5 on /packages/mnt/jcrypto-10.4R12.4 (cd9660, local, noatime,
read-only, verified)
/dev/md6 on /packages/mnt/jpfe-common-10.4R12.4 (cd9660, local,
noatime, read-only)
/dev/md7 on /packages/mnt/jruntime-10.4R12.4 (cd9660, local, noatime,
read-only, verified)
/dev/md8 on /tmp (ufs, asynchronous, local, noatime)
/dev/md9 on /mfs (ufs, asynchronous, local, noatime)
/dev/ad0s1e on /config (ufs, local, noatime)
procfs on /proc (procfs, local, noatime)
# exit
root at M10i> sho
^
unknown command.
root at M10i> show
^
unknown command.
root at M10i> ?
No valid completions
root at M10i> start
^
unknown command.
root at M10i> exit
^
unknown command.
root at M10i>
error: unknown command: .noop-command
root at M10i>
error: unknown command: .noop-command
root at M10i> Jun 25 13:24:38 init: can't chdir to /var/tmp/: No such
file or directory
Jun 25 13:24:43 init: can't chdir to /var/tmp/: No such file or directory
In case of M10i(RE-850) I waited for few hours after unmounting the
/var for some watchdog timer to kick in, but nothing happened. Finally
I just remounted the HDD and restarted the mgd process. RE worked as
it should.
According to KB19024, at least "Hard drive access suddenly lost" is
one of the reasons which cause watchdog timer to reload the routing
engine. Is the watchdog timer triggered only in case the HDD is
physically removed aka HDD fails? What exactly does this watchdog
timer check?
regards,
Martin
More information about the juniper-nsp
mailing list