[c-nsp] Sup720 hang while writing SP crashinfo?

Thu Aug 20 14:38:32 EDT 2009

One thing I notice here is that you are running the modular software.

Be sure you have it set up to write core dumps properly to the disk0/ 
disk1 devices.

This will greatly increase the ability of these bugs to be isolated  
and resolved quicker when working with TAC/DEs.

	- Jared

On Aug 19, 2009, at 10:39 PM, e ninja wrote:

> Kevin,
>
> Looks like the RP reset the system because the SP failed to respond to
> RP<->SP cpu availability heartbeat keepalives (aka CPU MONITOR). The  
> TAC
> engineer should not  bother decoding the RP tracebacks as this would  
> most
> likely be generic functions. The root cause lies in the SP and  
> understanding
> why it failed or failed to respond to RP heartbeat keepalives.
>
> Some possible causes;
>
>   - SP crashed because of a software bug. Make room for future  
> crashinfo
>   files since trigger still looms.
>   - SP heartbeat response got stuck behind other EOBC management  
> activity
>   during a traffic spike. (eg CSCsm21728, etc.)
>
> It is always a good idea to setup syslog so that all events can be  
> captured
> for future troubleshooting.
>
> -Eninja
>
>
> On Tue, Aug 18, 2009 at 8:33 PM, Kevin Graham <
> kgraham at industrial-marshmallow.com> wrote:
>
>>
>>
>>
>>
>>> There are multiple causes of crashes and several causes of system  
>>> 'hang'
>> (high
>>> CPU, memory depletion, etc) and both should be investigated
>> independently.
>>
>> Yes, crash itself didn't seem particularly interesting, but am  
>> pursuing
>> that
>> w/ TAC. It looked like it was a "good and orderly" reset, which is  
>> why the
>> failure to complete the reboot (combined w/ incomplete SP crashinfo  
>> and
>> full
>> sup-bootflash) were curious.
>>
>>> Do you have any syslogs from a few minutes before the crash? If  
>>> yes send
>> over
>>> along with RP crashinfo, whatever was captured from SP and console  
>>> logs.
>>
>> Only what was captured in RP crashinfo (sparing the list the rest  
>> of the
>> spam,
>> but symptoms were consistent w/ very high RP cpu. Starting w/ HSRP  
>> state
>> flaps,
>> drop of OSPF adjacencies). The last gasps were:
>>
>> 094893: Aug 18 10:53:19.694 PDT: icc_send_request_internal:
>> ipc_send_rpc_blocked
>> failed, result 6 : ios-base : (PID=16407, TID=21) :
>> -Traceback=(s72033_rp-ipser
>> vicesk9-6-dso-b.so+0x164B40) ([33:0]+0x164DAC) ([33:0]+0x165320)
>> ([23:-9]3+0x316
>> 100) ([33:0]+0x306158) ([23:-9]1+0x2B81A8) ([33:0]+0x2FBFF8)
>> ([23:-9]6+0x4E3BC4)
>> ([33:0]+0x4E3B9C)
>> 094894: Aug 18 10:53:25.910 PDT: %CPU_MONITOR-6-NOT_HEARD:  
>> CPU_MONITOR
>> messages
>> have not been heard for 120 seconds [6/0]
>> 094895: Aug 18 10:53:55.990 PDT: %CPU_MONITOR-6-NOT_HEARD:  
>> CPU_MONITOR
>> messages
>> have not been heard for 150 seconds [6/0]
>> 094896: Aug 18 10:54:26.049 PDT: %CPU_MONITOR-3-TIMED_OUT:  
>> CPU_MONITOR
>> messages
>> have failed, resetting system [6/0]
>> Crashdump : 17:54:26.944  Tue Aug 18 2009 : ios-base : (PID=16407,  
>> TID=1) :
>> -Tra
>> ceback=(s72033_rp-ipservicesk9-9-dso-b.so+0x2E46C8) ([33:0]+0x3577B4)
>> ([33:0]+0x
>> 359CF8) ([23:-9]6+0x4E3BC4) ([33:0]+0x4E3B9C)
>> crashdump called (with pause = 0 sec)
>>
>> %ALIGN-1-FATAL: Illegal access to a low address 10:54:26 PDT Tue  
>> Aug 18
>> 2009
>> addr=0x0, pc=0x74C7D940, ra=0x74C7D86C, sp=0x389EBC8
>>
>>
>>> On Aug 18, 2009, at 11:04 PM, Kevin Graham
>>> wrote:
>>>
>>>> We had a Sup720B (non-redundant, running modular SXI) crash, due to
>> what looks
>>>> like was due to a CPU_MONITOR watchdog event. What was nasty  
>>>> though was
>> that
>>>> rather than reload, it hung (dead and unresponsive console) and
>> required a
>>>> power cycle.
>>>>
>>>> The RP crashinfo made it out fine, however SP crashinfo was  
>>>> incomplete.
>>> Looking
>>>> at that, its due to sup-bootflash running out of space (1 byte  
>>>> left w/
>> an
>>>> incomplete/inaccessible crashinfo).
>>>>
>>>> Unfounded speculation is that the "hung" state was due to system
>> pounding away
>>>> trying to finish writing crashinfo to a full filesystem.
>>>>
>>>> Is that hypothesis at all reasonable, or is there something else  
>>>> that
>> should
>>> be
>>>> explored?
>>>>
>>>> _______________________________________________
>>>> cisco-nsp mailing list  cisco-nsp at puck.nether.net
>>>> https://puck.nether.net/mailman/listinfo/cisco-nsp
>>>> archive at http://puck.nether.net/pipermail/cisco-nsp/
>>
>>
> _______________________________________________
> cisco-nsp mailing list  cisco-nsp at puck.nether.net
> https://puck.nether.net/mailman/listinfo/cisco-nsp
> archive at http://puck.nether.net/pipermail/cisco-nsp/