[c-nsp] Sup720 hang while writing SP crashinfo?

Tue Aug 18 23:33:47 EDT 2009

> There are multiple causes of crashes and several causes of system 'hang' (high 
> CPU, memory depletion, etc) and both should be investigated independently.

Yes, crash itself didn't seem particularly interesting, but am pursuing that
w/ TAC. It looked like it was a "good and orderly" reset, which is why the
failure to complete the reboot (combined w/ incomplete SP crashinfo and full
sup-bootflash) were curious.

> Do you have any syslogs from a few minutes before the crash? If yes send over 
> along with RP crashinfo, whatever was captured from SP and console logs.

Only what was captured in RP crashinfo (sparing the list the rest of the spam,
but symptoms were consistent w/ very high RP cpu. Starting w/ HSRP state flaps,
drop of OSPF adjacencies). The last gasps were:

094893: Aug 18 10:53:19.694 PDT: icc_send_request_internal: ipc_send_rpc_blocked
 failed, result 6 : ios-base : (PID=16407, TID=21) : -Traceback=(s72033_rp-ipser
vicesk9-6-dso-b.so+0x164B40) ([33:0]+0x164DAC) ([33:0]+0x165320) ([23:-9]3+0x316
100) ([33:0]+0x306158) ([23:-9]1+0x2B81A8) ([33:0]+0x2FBFF8) ([23:-9]6+0x4E3BC4)
 ([33:0]+0x4E3B9C)
094894: Aug 18 10:53:25.910 PDT: %CPU_MONITOR-6-NOT_HEARD: CPU_MONITOR messages
have not been heard for 120 seconds [6/0]
094895: Aug 18 10:53:55.990 PDT: %CPU_MONITOR-6-NOT_HEARD: CPU_MONITOR messages
have not been heard for 150 seconds [6/0]
094896: Aug 18 10:54:26.049 PDT: %CPU_MONITOR-3-TIMED_OUT: CPU_MONITOR messages
have failed, resetting system [6/0]
Crashdump : 17:54:26.944  Tue Aug 18 2009 : ios-base : (PID=16407, TID=1) : -Tra
ceback=(s72033_rp-ipservicesk9-9-dso-b.so+0x2E46C8) ([33:0]+0x3577B4) ([33:0]+0x
359CF8) ([23:-9]6+0x4E3BC4) ([33:0]+0x4E3B9C)
crashdump called (with pause = 0 sec)

%ALIGN-1-FATAL: Illegal access to a low address 10:54:26 PDT Tue Aug 18 2009
 addr=0x0, pc=0x74C7D940, ra=0x74C7D86C, sp=0x389EBC8

> On Aug 18, 2009, at 11:04 PM, Kevin Graham 
> wrote:
> 
> > We had a Sup720B (non-redundant, running modular SXI) crash, due to what looks
> > like was due to a CPU_MONITOR watchdog event. What was nasty though was that
> > rather than reload, it hung (dead and unresponsive console) and required a
> > power cycle.
> > 
> > The RP crashinfo made it out fine, however SP crashinfo was incomplete. 
> Looking
> > at that, its due to sup-bootflash running out of space (1 byte left w/ an
> > incomplete/inaccessible crashinfo).
> > 
> > Unfounded speculation is that the "hung" state was due to system pounding away
> > trying to finish writing crashinfo to a full filesystem.
> > 
> > Is that hypothesis at all reasonable, or is there something else that should 
> be
> > explored?
> > 
> > _______________________________________________
> > cisco-nsp mailing list  cisco-nsp at puck.nether.net
> > https://puck.nether.net/mailman/listinfo/cisco-nsp
> > archive at http://puck.nether.net/pipermail/cisco-nsp/