[c-nsp] Cisco 6500 experiencing %CPU_MONITOR-SP-6-NOT_HEARD

Youssef Bengelloun-Zahr youssef at 720.fr
Wed Jun 16 07:48:23 EDT 2010


Hello List,

Just for the record, I will post this in case some guys out there have the
same problem some day.

Last friday, one of my core routers, a Cisco 6509 with two SUP720-3BXL
modules running s72033-advipservicesk9_wan-mz.122-33.SXH2a, crashed and
restarted out of the blue.

Crashfile info says the following :

*Jun 11 06:43:06.310: %CPU_MONITOR-SP-6-NOT_HEARD: CPU_MONITOR messages have
not been heard for 30 seconds [5/1]
Jun 11 06:43:36.310: %CPU_MONITOR-SP-6-NOT_HEARD: CPU_MONITOR messages have
not been heard for 60 seconds [5/1]
Jun 11 06:44:06.310: %CPU_MONITOR-SP-6-NOT_HEARD: CPU_MONITOR messages have
not been heard for 90 seconds [5/1]
*Jun 11 06:44:25.366: SP: icc_send_request_internal: ipc_send_rpc_blocked
failed, result 6
Jun 11 06:44:25.366: SP: -Traceback= 40BC1538 40BC16F8 40BC19E0 40B11AE4
40B120D0 40752F58 40752F44
*Jun 11 06:44:36.310: %CPU_MONITOR-SP-6-NOT_HEARD: CPU_MONITOR messages have
not been heard for 120 seconds [5/1]
*Jun 11 06:44:51.366: SP: IPC: Message 43EDD2BC timed out waiting for Ack
Jun 11 06:44:51.366: SP: IPC:  MSG: ptr: 0x43EDD2BC, flags: 0x20101,
retries: 21, seq: 0x2155C10, refcount: 2, retry: 00:00:00, rpc_result = 0x0,
data_buffer = 0x503FAF5C, header = 0x8C7A7C8, data = 0x8C7A7E8  || HDR: src:
0x10000, dst: 0x2150010, index: 0, seq: 23568, sz: 80, type: 1, flags: 0x404
hi: 0x6F4F386, lo: 0x8C7A7E8  || DATA: 00 00 00 05 00 00 00 00 00 00 1B 59
00 00 00 01 00 00 00 07
Jun 11 06:44:51.366: SP: IPC: Send failed: IPC msg timeout MSG: ptr:
0x43EDD2BC, flags: 0x20101, retries: 21, seq: 0x2155C10, refcount: 2, retry:
00:00:00, rpc_result = 0x0, data_buffer = 0x503FAF5C, header = 0x8C7A7C8,
data = 0x8C7A7E8  || HDR: src: 0x10000, dst: 0x2150010, index: 0, seq:
23568, sz: 80, type: 1, flags: 0x404 hi: 0x6F4F386, lo: 0x8C7A7E8  || DATA:
00 00 00 05 00 00 00 00 00 00 1B 59 00 00 00 01 00 00 00 07
Jun 11 06:44:51.366: SP: -Traceback= 403E6CB0 403EB96C 403EC00C 40405988
40752F58 40752F44
Jun 11 06:44:51.366: %C6K_PROCMIB-SP-3-IPC_TRANSMIT_FAIL: Failed to send
process statistics update : error code = timeout
-Traceback= 409A39A4 409A39F4 409A3C00 409A3E60 40752F58 40752F44
*Jun 11 06:45:06.310: %CPU_MONITOR-SP-6-NOT_HEARD: CPU_MONITOR messages have
not been heard for 150 seconds [5/1]
Jun 11 06:45:36.310: %CPU_MONITOR-SP-3-TIMED_OUT: CPU_MONITOR messages have
failed, resetting system [5/1]
*
*%Software-forced reload
*

 06:45:36 UTC Fri Jun 11 2010: Breakpoint exception, CPU signal 23, PC =
0x41183348



For some reason, RP and SP were not able to communicate using the EOBC. I
have googling around and looks like folks out there (among c-nsp too) have
already seen this for Cisco 6500 and 7600.

In this particular case, Cisco says :
CPU_MONITOR-3-TIMED_OUT or CPU_MONITOR-6-NOT_HEARD Problem Problème

The switch reports these error messages:

 CPU_MONITOR-3-TIMED_OUT: CPU monitor messages have failed, resetting system
CPU_MONITOR-6-NOT_HEARD: CPU monitor messages have not been heard for
[dec] seconds

 Description Description

These messages indicate that CPU monitor messages have not been heard for a
significant amount of time. A time-out most probably occurs, which resets
the system. [dec] is the number of seconds.

The problem possibly occurs because of these reasons:

   -

   Badly seated line card or module             <=== Not likely
   -

   Bad ASIC or bad backplane                     <=== Not likely
   -

   Software bugs                                           <=== Probably
   -

   Parity error                                                <=== Don't
   know
   -

   High traffic in the Ethernet out of band channel (EOBC) channel
   <=== According to the IPC stats, nothing fancy

   The EOBC channel is a half duplex channel that services many other
   functions, which includes Simple Network Management Protocol (SNMP) traffic
   and packets that are destined to the switch. If the EOBC channel is full of
   messages because of a storm of SNMP traffic, then the channel is subjected
   to collisions. When this happens, EOBC is possibly not able to carry IPC
   messages. This makes the switch display the error message.

Workaround Contournement

Reseat the line card or module. If a maintenance window can be scheduled,
reset the switch in order to clear any transient issues.


Personally, I'd say I hit a bug with this but I can't seem to find it using
cisco web tools. Anyone could point me to the right direction ?

Thank you all.

Best regards.

Y.

-- 
Youssef BENGELLOUN-ZAHR ………………………………………………
Ingénieur Réseaux et Télécoms


Technopole de l'Aube  en Champagne - BP 601 - 10901 TROYES  Cedex 9
Agence Paris : 6, rue Charles Floquet - 92120 MONTROUGE
Tel                 +33 (0) 825 000 720
Tel. direct      +33 (0) 1 77 35 59 14
Tel. portable  +33 (0) 6 22 42 63 80
Email            ybz at 720.fr
……………………………………………………………………………….....www.720.fr


More information about the cisco-nsp mailing list