[c-nsp] RES: RES: activ/standby cpu card status changed

Fri Feb 29 15:43:13 EST 2008

Actually this bug had already corrected in SXF2...

________________________________

De: e ninja [mailto:eninja at gmail.com]
Enviada: sex 29/2/2008 17:29
Para: Nemeth Laszlo
Cc: Leonardo Gama Souza; cisco-nsp at puck.nether.net
Assunto: Re: [c-nsp] RES: activ/standby cpu card status changed

Nemeth,

Your SUP crashed because it failed over 10 consecutive TestSPRPInbandPing. Get the fix/workaround for sc33990 below. 

/eninja

CSCsc33990 

Symptoms: A supervisor engine may unexpectedly reset when the TestSPRPInbandPing as part of the Cisco Generic Online Diagnostics (GOLD) fails for 10 consecutive times. 

The following syslog error messages are typically generated right before the supervisor engine resets, and can also be found in the crashinfo files: 

%CONST_DIAG-SP-3-HM_TEST_FAIL: Module <slot#> TestSPRPInbandPing consecutive failure count:5
%CONST_DIAG-SP-6-HM_TEST_INFO: CPU util(5sec): SP=10% RP=0% Traffic=0% netint_thr_active[0], Tx_Rate[4412], Rx_Rate[0]
%CONST_DIAG-SP-3-HM_TEST_FAIL: Module <slot#> TestSPRPInbandPing consecutive failure count:10
%CONST_DIAG-SP-6-HM_TEST_INFO: CPU util(5sec): SP=10% RP=0% Traffic=0% netint_thr_active[0], Tx_Rate[4652], Rx_Rate[0]
%CONST_DIAG-SP-2-HM_SUP_CRSH: Supervisor crashed due to unrecoverable errors, Reason: Failed TestSPRPInbandPing 

Conditions: This symptom is observed on a Cisco Catalyst 6500 series switch and Cisco 7600 series router that run an integrated Cisco IOS software image. The trigger for the symptom may be possible corruption in TCAM entries that are used to perform the TestSPRPInbandPing. 

Workaround: Enter the no diagnostic crash global configuration command to disable exceptions that are being triggered by failed diagnostic monitoring. However, you should do this with discretion because it may also prevent the system from taking proactive measure to mitigate problems that could impact user traffic. 

Further Information: The fix for this caveat is more of an enhancement because it only prevents the system from being over-aggressive in taking exceptions when the TestSPRPInbandPing fails under specific conditions. Therefore, the fix for this caveat does not address all triggers that may cause the TestSPRPInbandPing to fail. Please consult Cisco TAC for further assistance if you experience the same problem after upgrading to a Cisco IOS software image that contains the fix for this caveat. 

On Fri, Feb 29, 2008 at 1:24 AM, Nemeth Laszlo <csirek at externet.hu> wrote:

	Hi!

	I put the crash file here:

	ftp://195.70.33.12/crashinfo_20080228-151329_cpu1
	ftp://195.70.33.12/crashinfo_20080228-151329_cpu2

	If anybody knows what was the problem, please don't silent it :)

	Possible it's an IOS problem?

	Thanks
	Laci

	Leonardo Gama Souza írta:

	> Hi.
	>
	> It sounds like your MSFC crashed.
	> You ought to look into the crashinfo file in order to figure out why.
	>
	> cheers,
	> Leonardo Gama.
	>
	> ------------------------------------------------------------------------
	> *De:* cisco-nsp-bounces at puck.nether.net em nome de Nemeth Laszlo
	> *Enviada:* qui 28/2/2008 13:43
	> *Para:* cisco-nsp at puck.nether.net
	> *Assunto:* [c-nsp] activ/standby cpu card status changed
	>
	> Hi!
	>
	> My 7604 router has 2 WS-SUP32-10GE-3B cpu card in RRP-PLUS mode.
	>
	> System image file is "sup-bootdisk:s3223-ipservices_wan-mz.122-18.SXF9.bin"
	>
	> I got this syslog messages and after it the cpu card changed the standby
	> mode to
	> active and active to standby. The cpu went at 100% through 15 minutes.
	> I saw a network L2 loop, but I don't know that this L2 loop problem
	> caused by
	> the CPU change, or the CPU change caused by the L2 loop. I use RSTP.
	> This router
	> and more other 2 are members of a litle 10G ring.
	>
	> I can't found this error messages on cisco.com <http://cisco.com/> .
	>
	> We has a similar problem on 1 january 2008 when happend a cpu state
	> change to
	> (cpu was 100% like now, other time the cpu goes on 0-2%).
	>
	> Any idea?
	>
	> Thanks
	> Laci
	>
	> core2#sh redundancy history  | inc state
	> Feb 28 16:13:33 *my state = ACTIVE(13) *peer state = DISABLED(1)
	> Feb 28 16:17:12 *my state = ACTIVE(13) *peer state = UNKNOWN(0)
	> Feb 28 16:17:21 *my state = ACTIVE(13) *peer state = STANDBY COLD(4)
	> Feb 28 16:18:09 *my state = ACTIVE(13) *peer state = STANDBY COLD-CONFIG(5)
	> Feb 28 16:18:19 *my state = ACTIVE(13) *peer state = STANDBY HOT(8)
	>
	> core2#sh redundancy switchover
	> Switchovers this system has experienced          : 1
	> Last switchover reason                           : Active crashed.
	> Uptime since this supervisor switched to active  : 8 weeks, 1 day, 18
	> hours, 50
	> minutes
	> Total system uptime from reload                  : 28 weeks, 1 day, 1
	> hour, 29
	> minutes
	>
	> core2#sh redundancy switchover history
	> Index  Previous  Current  Switchover             Switchover
	>         active    active   reason                 time
	> -----  --------  -------  ----------             ----------
	>     1       1        2     active unit failed     22:44:19 MET Tue Jan 1
	> 2008
	>
	>
	>
	> *Feb 28 16:11:12 MET: %CONST_DIAG-SP-STDBY-3-HM_TEST_FAIL: Module 1
	> TestSPRPInbandPing consecutive failure count:7
	> *Feb 28 16:11:12 MET: %CONST_DIAG-SP-STDBY-6-HM_TEST_INFO: CPU
	> util(5sec): SP=7%
	> RP=0% Traffic=0%
	> netint_thr_active[0], Tx_Rate[70], Rx_Rate[4946], dev=1[IPv4, fail=7]
	> *Feb 28 16:13:12 MET: %CONST_DIAG-SP-STDBY-3-HM_TEST_FAIL: Module 1
	> TestSPRPInbandPing consecutive failure count:14
	> *Feb 28 16:13:12 MET: %CONST_DIAG-SP-STDBY-6-HM_TEST_INFO: CPU
	> util(5sec): SP=2%
	> RP=0% Traffic=0%
	> netint_thr_active[0], Tx_Rate[70], Rx_Rate[8290], dev=1[IPv4, fail=14]
	> Feb 28 16:13:33 MET: %LINEPROTO-5-UPDOWN: Line protocol on Interface
	> TenGigabitEthernet1/1, changed state to down
	> Feb 28 16:13:33 MET: %BGP-5-ADJCHANGE: neighbor xx.xxx.xxx.xxx Down
	> Interface flap
	> Feb 28 16:13:33 MET: %PFREDUN-SP-6-ACTIVE: Standby processor removed or
	> reloaded, changing to Simplex mode
	> Feb 28 16:13:33 MET: %LINK-SP-3-UPDOWN: Interface TenGigabitEthernet1/1,
	> changed
	> state to down
	> Feb 28 16:13:33 MET: %LINEPROTO-SP-5-UPDOWN: Line protocol on Interface
	> TenGigabitEthernet1/1, changed state to down
	> Feb 28 16:17:11 MET: %PFREDUN-SP-6-ACTIVE: Standby initializing for
	> RPR-PLUS mode
	> Feb 28 16:17:11 MET: %SYS-SP-3-LOGGER_FLUSHED: System was paused for
	> 00:00:00 to
	> ensure console debugging output.
	>
	> -
	> _______________________________________________
	> cisco-nsp mailing list  cisco-nsp at puck.nether.net
	> https://puck.nether.net/mailman/listinfo/cisco-nsp
	> archive at http://puck.nether.net/pipermail/cisco-nsp/
	>

	_______________________________________________
	cisco-nsp mailing list  cisco-nsp at puck.nether.net
	https://puck.nether.net/mailman/listinfo/cisco-nsp
	archive at http://puck.nether.net/pipermail/cisco-nsp/