[j-nsp] mx960 crashed

Wed Apr 4 15:02:09 EDT 2018

Thanks

Jtac says “DMA errors indicate a faulty SSD”  processing RMA of RE1

agould at mx960> show chassis routing-engine | grep "reason|slot|uptime"

  Slot 0:

    Uptime                         19 days, 23 hours, 37 minutes, 40 seconds

    Last reboot reason             0x4000:VJUNOS reboot

  Slot 1:

    Uptime                         5 hours, 52 minutes, 22 seconds

    Last reboot reason             0x800:reboot due to exception

No   /dev/mapper/   seen anywhere 

RE0

…

  /dev/kmem            Last changed: Mar 15 14:14:49

  /dev/ksyms           Last changed: Mar 15 14:15:16

  /dev/led/            Last changed: Mar 15 14:14:52

  /dev/mch             Last changed: Mar 15 14:23:50

  /dev/md0             Last changed: Mar 15 14:23:50

  /dev/md0.uzip        Last changed: Mar 15 14:14:49

  /dev/md1             Last changed: Mar 15 14:14:53

  /dev/md1.uzip        Last changed: Mar 15 14:14:53

…

BU RE1

…

  /dev/ksyms           Last changed: Apr 04 08:08:38

  /dev/led/            Last changed: Apr 04 08:08:16

  /dev/mch             Last changed: Apr 04 08:09:06

  /dev/md0             Last changed: Apr 04 08:09:06

  /dev/md0.uzip        Last changed: Apr 04 08:08:13

…

From: Graham Brown [mailto:juniper-nsp at grahambrown.info] 
Sent: Wednesday, April 4, 2018 12:55 PM
To: Aaron Gould
Cc: juniper-nsp at puck.nether.net
Subject: Re: [j-nsp] mx960 crashed

Hi Aaron,

Looks like you have a core file there, raise a TAC case and they’ll be able to determine the root cause for you. 

Also see if there is anything readable in /dev/mapper/jvg_P-jlvmrootrw/var/crash/override_reboot_reason that may point you to the root cause. 

HTH,

Graham

On Thu, 5 Apr 2018 at 01:13, Aaron Gould <aaron1 at gvtc.com> wrote:

Still in the process of turning up this new 5-node mx960 100 gig ring, and I
went to the backup re console and went to login and saw this happen.

Any idea why this happened and how do I tshoot cause ?

FreeBSD/amd64 (mx960) (ttyu0)

login: root

Password:SysRq : Trigger a crash

dmar: DRHD: handling fault status reg 2

dmar: DMAR:[DMA Write] Request device [02:10.1] fault addr 151d9000

DMAR:[fault reason 01] Present bit in root entry is clear

dmar: DRHD: handling fault status reg 102

dmar: DMAR:[DMA Write] Request device [02:10.1] fault addr 1540c000

DMAR:[fault reason 01] Present bit in root entry is clear

dmar: DRHD: handling fault status reg 202

dmar: DMAR:[DMA Write] Request device [02:10.1] fault addr 152fc000

DMAR:[fault reason 01] Present bit in root entry is clear

dmar: DRHD: handling fault status reg 302

dmar: DMAR:[DMA Write] Request device [06:0a.0] fault addr 1ed10d000

DMAR:[fault reason 01] Present bit in root entry is clear

dmar: DRHD: handling fault status reg 402

dmar: DMAR:[DMA Read] Request device [06:0a.0] fault addr ad4a000

DMAR:[fault reason 01] Present bit in root entry is clear

dmar: INTR-REMAP: Request device [[05:00.1] fault index 40

INTR-REMAP:[fault reason 34] Present field in the IRTE entry is clear

dmar: DRHD: handling fault status reg 602

dmar: DMAR:[DMA Read] Request device [06:0a.0] fault addr ad4a000

DMAR:[fault reason 01] Present bit in root entry is clear

dmar: DRHD: handling fault status reg 702

dmar: DMAR:[DMA Read] Request device [06:0a.0] fault addr ad4a000

DMAR:[fault reason 01] Present bit in root entry is clear

  4 logical volume(s) in volume group "jvg_S" now active

  4 logical volume(s) in volume group "jvg_P" now active

Override SW Exception reboot reason saved to
/dev/mapper/jvg_P-jlvmrootrw/var/crash/override_reboot_reason

Compressed level 31 vmhost kernel core will be dumped to jvg_P-jlvmrootrw.

Please use crash utility to analyze the core

Copying data                       : [ 13 %]

.

.

.

Copying data                       : [100 %]

The dumpfile is saved to vmcore-0-compressed-201804041305.

makedumpfile Completed.

(many more boot messages seen)

-Aaron

_______________________________________________
juniper-nsp mailing list juniper-nsp at puck.nether.net
https://puck.nether.net/mailman/listinfo/juniper-nsp

-- 

-sent from my iPhone; please excuse spelling, grammar and brevity-