[c-nsp] Sup720 hang while writing SP crashinfo?

Wed Aug 19 22:39:45 EDT 2009

Kevin,

Looks like the RP reset the system because the SP failed to respond to
RP<->SP cpu availability heartbeat keepalives (aka CPU MONITOR). The TAC
engineer should not  bother decoding the RP tracebacks as this would most
likely be generic functions. The root cause lies in the SP and understanding
why it failed or failed to respond to RP heartbeat keepalives.

Some possible causes;

   - SP crashed because of a software bug. Make room for future crashinfo
   files since trigger still looms.
   - SP heartbeat response got stuck behind other EOBC management activity
   during a traffic spike. (eg CSCsm21728, etc.)

It is always a good idea to setup syslog so that all events can be captured
for future troubleshooting.

-Eninja

On Tue, Aug 18, 2009 at 8:33 PM, Kevin Graham <
kgraham at industrial-marshmallow.com> wrote:

>
>
>
>
> > There are multiple causes of crashes and several causes of system 'hang'
> (high
> > CPU, memory depletion, etc) and both should be investigated
> independently.
>
> Yes, crash itself didn't seem particularly interesting, but am pursuing
> that
> w/ TAC. It looked like it was a "good and orderly" reset, which is why the
> failure to complete the reboot (combined w/ incomplete SP crashinfo and
> full
> sup-bootflash) were curious.
>
> > Do you have any syslogs from a few minutes before the crash? If yes send
> over
> > along with RP crashinfo, whatever was captured from SP and console logs.
>
> Only what was captured in RP crashinfo (sparing the list the rest of the
> spam,
> but symptoms were consistent w/ very high RP cpu. Starting w/ HSRP state
> flaps,
> drop of OSPF adjacencies). The last gasps were:
>
> 094893: Aug 18 10:53:19.694 PDT: icc_send_request_internal:
> ipc_send_rpc_blocked
>  failed, result 6 : ios-base : (PID=16407, TID=21) :
> -Traceback=(s72033_rp-ipser
> vicesk9-6-dso-b.so+0x164B40) ([33:0]+0x164DAC) ([33:0]+0x165320)
> ([23:-9]3+0x316
> 100) ([33:0]+0x306158) ([23:-9]1+0x2B81A8) ([33:0]+0x2FBFF8)
> ([23:-9]6+0x4E3BC4)
>  ([33:0]+0x4E3B9C)
> 094894: Aug 18 10:53:25.910 PDT: %CPU_MONITOR-6-NOT_HEARD: CPU_MONITOR
> messages
> have not been heard for 120 seconds [6/0]
> 094895: Aug 18 10:53:55.990 PDT: %CPU_MONITOR-6-NOT_HEARD: CPU_MONITOR
> messages
> have not been heard for 150 seconds [6/0]
> 094896: Aug 18 10:54:26.049 PDT: %CPU_MONITOR-3-TIMED_OUT: CPU_MONITOR
> messages
> have failed, resetting system [6/0]
> Crashdump : 17:54:26.944  Tue Aug 18 2009 : ios-base : (PID=16407, TID=1) :
> -Tra
> ceback=(s72033_rp-ipservicesk9-9-dso-b.so+0x2E46C8) ([33:0]+0x3577B4)
> ([33:0]+0x
> 359CF8) ([23:-9]6+0x4E3BC4) ([33:0]+0x4E3B9C)
> crashdump called (with pause = 0 sec)
>
> %ALIGN-1-FATAL: Illegal access to a low address 10:54:26 PDT Tue Aug 18
> 2009
>  addr=0x0, pc=0x74C7D940, ra=0x74C7D86C, sp=0x389EBC8
>
>
> > On Aug 18, 2009, at 11:04 PM, Kevin Graham
> > wrote:
> >
> > > We had a Sup720B (non-redundant, running modular SXI) crash, due to
> what looks
> > > like was due to a CPU_MONITOR watchdog event. What was nasty though was
> that
> > > rather than reload, it hung (dead and unresponsive console) and
> required a
> > > power cycle.
> > >
> > > The RP crashinfo made it out fine, however SP crashinfo was incomplete.
> > Looking
> > > at that, its due to sup-bootflash running out of space (1 byte left w/
> an
> > > incomplete/inaccessible crashinfo).
> > >
> > > Unfounded speculation is that the "hung" state was due to system
> pounding away
> > > trying to finish writing crashinfo to a full filesystem.
> > >
> > > Is that hypothesis at all reasonable, or is there something else that
> should
> > be
> > > explored?
> > >
> > > _______________________________________________
> > > cisco-nsp mailing list  cisco-nsp at puck.nether.net
> > > https://puck.nether.net/mailman/listinfo/cisco-nsp
> > > archive at http://puck.nether.net/pipermail/cisco-nsp/
>
>