[c-nsp] Strange crash with 7206 VXR NPE-400 IO-FE

Ed Ravin eravin at panix.com
Sat Sep 9 01:06:00 EDT 2006

I've been getting a weird problem with a 7206 VXR NPE-400 with IO-FE
and a PA-GE.  The symptom is that the router works fine until you
power it down.  When it reboots, it runs the bootloader and then:

  %ERR-1-GT64120 (PCI-0): Fatal error, PCI Master abort
   GT=0xB4000000, cause=0x0100E483, mask=0x0ED01F00, real_cause=0x00000400

Then it dumps out a whole pile of diagnostic data and reloads, and does
the same thing again ad infinitum.  A full sequence is shown below.

The story behind this router is that we bought it used - I noticed that
the bootflash on the IO-FE didn't support the NPE-400 so I thought I'd
be helpful and put in one that did (the 12.0S version that was the subject
of some discussion here a bunch of months ago).  Shortly after that, my
troubles began.

I Googled for the error message above, and got exactly one useful hit -
the page was in Russian, but with the help of Babelfish and an expat
acquaintance, I eventually figured out that the guy thought the problem
was a loose NPE-400 even though it seemed to be "thoroughly secured".  So
I took out the NPE-400, shoved it back in, and voila, router stopped crashing.

I brought the router to another site and started using it - then did
some config changes and decided to boot it from power off just to make
sure everything was OK.  Alas, those nasty error messages popped up again,
and once again, the router was useless until I re-seated the NPE.

I then erased the suspect bootflash, rebooted, still no good. Reseated
the NPE-400 again, stuck in a 12.0 non-S bootflash, and after a couple
of reseats and reboots, the router seemed to be stable.  But I'm rather
worried that it will start again.

I'm suspecting some issue with the bootflash change, since the symptoms
only happen on powerup, not reload, and bootflash seems to be only read
on powerup.  But what kind of software errors can only be fixed by
reseating the NPE-400?  And yes, the NPE-400 is in there nice and tight,
tied down by the captive screws.

The full dump is below.  Anyone have any suggestions?

Bad CPU ID 00002732
System Bootstrap, Version 12.1(20000710:044039) [nlaw-121E_npeb 117], DEVELOPMENT SOFTWARE
Copyright (c) 1994-2000 by cisco Systems, Inc.
C7200 platform with 524288 Kbytes of main memory

Self decompressing the image : #################################################################################################]

Cisco IOS Software, 7200 Software (C7200-ADVIPSERVICESK9-M), Version 12.4(9)T, RELEASE SOFTWARE (fc1)
Copyright (c) 1986-2006 by Cisco Systems, Inc.
Compiled Fri 16-Jun-06 17:27 by prod_rel_team
Image text-base: 0x60009084, data-base: 0x6308C000

%ERR-1-GT64120 (PCI-0): Fatal error, PCI Master abort
 GT=0xB4000000, cause=0x0100E483, mask=0x0ED01F00, real_cause=0x00000400

   Possible software fault. Upon reccurence, please collect
   crashinfo, "show tech" and contact Cisco Technical Support.

 bus_err_high=0x00000000, bus_err_low=0x00000000, addr_decode_err=0x00000470

%ERR-1-FATAL: Fatal error interrupt, No reloading
 err_stat=0x1, err_enable=0xFF, mgmt_event=0x40

GT64120 External PCI Configuration registers:
 Vendor / Device ID   : 0xAB112046 (b/s 0x462011AB)
 Status / Command     : 0x4601A062 (b/s 0x62A00146)
 Class / Revision     : 0x11008005 (b/s 0x05800011)
 Latency              : 0x0F000000 (b/s 0x0000000F)
 RAS[1:0] Base        : 0x00000000 (b/s 0x00000000)
 RAS[3:2] Base        : 0x00000010 (b/s 0x10000000)
 CS[2:0] Base         : 0x00000000 (b/s 0x00000000)
 CS[3] Base           : 0x00000000 (b/s 0x00000000)
 Mem Map Base         : 0x00000014 (b/s 0x14000000)
 IO Map Base          : 0x00000000 (b/s 0x00000000)
 Subsystem Vendor / D : 0x00000000 (b/s 0x00000000)
 Int Pin / Line       : 0x00010000 (b/s 0x00000100)
 Swap RAS[1:0] Base   : 0x000000C0 (b/s 0xC0000000)
 Swap RAS[3:2] Base   : 0x000000D0 (b/s 0xD0000000)
 Swap CS[3] Base      : 0x00000000 (b/s 0x00000000)
 Vendor / Device ID   : 0xAB112046 (b/s 0x462011AB)
 Status / Command     : 0x4601A022 (b/s 0x22A00146)
 Class / Revision     : 0x11008005 (b/s 0x05800011)
 Latency              : 0x0F000000 (b/s 0x0000000F)
 RAS[1:0] Base        : 0x00000000 (b/s 0x00000000)
 RAS[3:2] Base        : 0x00000010 (b/s 0x10000000)
 CS[2:0] Base         : 0x00000000 (b/s 0x00000000)
 CS[3] Base           : 0x00000000 (b/s 0x00000000)
 Mem Map Base         : 0x00000014 (b/s 0x14000000)
 IO Map Base          : 0x00000000 (b/s 0x00000000)
 Subsystem Vendor / D : 0x00000000 (b/s 0x00000000)
 Int Pin / Line       : 0x00010000 (b/s 0x00000100)
 Swap RAS[1:0] Base   : 0x000000C0 (b/s 0xC0000000)
 Swap RAS[3:2] Base   : 0x000000D0 (b/s 0xD0000000)
 Swap CS[3] Base      : 0x00000000 (b/s 0x00000000)

System bridge dump:

Bridge 0, for PA Bay 0 (I/O Card, PCMCIA, Interfaces), Handle=0
DEC21150 bridge chip, Primary Bus 0, Secondary Bus 1,config=0x0
(0x00):dev, vendor id       = 0x00231011
(0x04):status, command      = 0x02B00147
(0x08):class code, revid    = 0x06040006
(0x0C):hdr, lat timer, cls  = 0x00012E10
(0x18):sec lat,cls & bus no = 0x18020100
(0x1C):sec status, io base  = 0x02A03101
(0x20):mem base & limit     = 0x48704000
(0x24):prefetch membase/lim = 0x0001FF01
(0x30):io base/lim upper16  = 0x00000000
(0x3C):bridge ctrl          = 0x00030000
(0x40):arb/serr, chip ctrl  = 0x02000000
(0x64):serr disable, gpio   = 0xF0000000
(0x68):sec clk ctrl,serrsta = 0x000001FF

PA bridge dump:

Bridge 4, Port Adaptor 1, Handle=1

Invalid bridge chip, vendor/id=0xFFFFFFFF

Bridge 5, Port Adaptor 2, Handle=2

Invalid bridge chip, vendor/id=0xFFFFFFFF

=== Flushing messages (17:30:57 UTC Fri Sep 8 2006) ===

Buffered messages:
Queued messages:

 17:30:57 UTC Fri Sep 8 2006: Interrupt exception, CPU signal 22, PC = 0x0

   Possible software fault. Upon reccurence,  please collect
   crashinfo, "show tech" and contact Cisco Technical Support.

$0 : 00000000, AT : 00000000, v0 : 00000000, v1 : 00000000
a0 : 00000000, a1 : 00000000, a2 : 00000000, a3 : 00000000
t0 : 00000000, t1 : 00000000, t2 : 00000000, t3 : 00000000
t4 : 00000000, t5 : 00000000, t6 : 00000000, t7 : 00000000
s0 : 00000000, s1 : 00000000, s2 : 00000000, s3 : 00000000
s4 : 00000000, s5 : 00000000, s6 : 00000000, s7 : 00000000
t8 : 00000000, t9 : 00000000, k0 : 00000000, k1 : 00000000
gp : 00000000, sp : 00000000, s8 : 00000000, ra : 00000000
EPC  : 00000000, ErrorEPC : 00000000, SREG     : 00000000
MDLO : 00000000, MDHI     : 00000000, BadVaddr : 00000000
CacheErr : 00000000, DErrAddr0 : 00000000, DErrAddr1 : 00000000
Cause 00000000 (Code 0x0): Interrupt exception

File bootflash:crashinfo_20060908-173058 Device Error :No such device
No warm reboot Storage 
*** System received an Error Interrupt ***
signal= 0x16, code= 0x0, context= 0x656afce0
PC = 0x6072b520, Cause = 0x20, Status Reg = 0x34008002

