[c-nsp] RES: activ/standby cpu card status changed]

Mon Mar 3 05:08:10 EST 2008

Hello!

I found this bug on Cisco TAC this error fixed in 12.2(18)SXF2.

But i use s3223-ipservices_wan-mz.122-18.SXF9.bin, so this bug lives again?

Thanks
Laci

e ninja írta:
> Nemeth,
> 
> Your SUP crashed because it failed over 10 consecutive 
> TestSPRPInbandPing. Get the fix/workaround for sc33990 below.
> 
> /eninja
> 
> 
> CSCsc33990
> 
> Symptoms: A supervisor engine may unexpectedly reset when the 
> TestSPRPInbandPing as part of the Cisco Generic Online Diagnostics 
> (GOLD) fails for 10 consecutive times.
> 
> The following syslog error messages are typically generated right before 
> the supervisor engine resets, and can also be found in the crashinfo files:
> 
> %CONST_DIAG-SP-3-HM_TEST_FAIL: Module <slot#> TestSPRPInbandPing 
> consecutive failure count:5
> %CONST_DIAG-SP-6-HM_TEST_INFO: CPU util(5sec): SP=10% RP=0% Traffic=0% 
> netint_thr_active[0], Tx_Rate[4412], Rx_Rate[0]
> %CONST_DIAG-SP-3-HM_TEST_FAIL: Module <slot#> TestSPRPInbandPing 
> consecutive failure count:10
> %CONST_DIAG-SP-6-HM_TEST_INFO: CPU util(5sec): SP=10% RP=0% Traffic=0% 
> netint_thr_active[0], Tx_Rate[4652], Rx_Rate[0]
> %CONST_DIAG-SP-2-HM_SUP_CRSH: Supervisor crashed due to unrecoverable 
> errors, Reason: Failed TestSPRPInbandPing
> 
> Conditions: This symptom is observed on a Cisco Catalyst 6500 series 
> switch and Cisco 7600 series router that run an integrated Cisco IOS 
> software image. The trigger for the symptom may be possible corruption 
> in TCAM entries that are used to perform the TestSPRPInbandPing.
> 
> Workaround: Enter the no diagnostic crash global configuration command 
> to disable exceptions that are being triggered by failed diagnostic 
> monitoring. However, you should do this with discretion because it may 
> also prevent the system from taking proactive measure to mitigate 
> problems that could impact user traffic.
> 
> Further Information: The fix for this caveat is more of an enhancement 
> because it only prevents the system from being over-aggressive in taking 
> exceptions when the TestSPRPInbandPing fails under specific conditions. 
> Therefore, the fix for this caveat does not address all triggers that 
> may cause the TestSPRPInbandPing to fail. Please consult Cisco TAC for 
> further assistance if you experience the same problem after upgrading to 
> a Cisco IOS software image that contains the fix for this caveat.
> 
> 
> 
> 
> 
> On Fri, Feb 29, 2008 at 1:24 AM, Nemeth Laszlo <csirek at externet.hu 
> <mailto:csirek at externet.hu>> wrote:
> 
>     Hi!
> 
>     I put the crash file here:
> 
>     ftp://195.70.33.12/crashinfo_20080228-151329_cpu1
>     ftp://195.70.33.12/crashinfo_20080228-151329_cpu2
> 
> 
>     If anybody knows what was the problem, please don't silent it :)
> 
>     Possible it's an IOS problem?
> 
>     Thanks
>     Laci
> 
> 
>     Leonardo Gama Souza írta:
>      > Hi.
>      >
>      > It sounds like your MSFC crashed.
>      > You ought to look into the crashinfo file in order to figure out why.
>      >
>      > cheers,
>      > Leonardo Gama.
>      >
>      >
>     ------------------------------------------------------------------------
>      > *De:* cisco-nsp-bounces at puck.nether.net
>     <mailto:cisco-nsp-bounces at puck.nether.net> em nome de Nemeth Laszlo
>      > *Enviada:* qui 28/2/2008 13:43
>      > *Para:* cisco-nsp at puck.nether.net <mailto:cisco-nsp at puck.nether.net>
>      > *Assunto:* [c-nsp] activ/standby cpu card status changed
>      >
>      > Hi!
>      >
>      > My 7604 router has 2 WS-SUP32-10GE-3B cpu card in RRP-PLUS mode.
>      >
>      > System image file is
>     "sup-bootdisk:s3223-ipservices_wan-mz.122-18.SXF9.bin"
>      >
>      > I got this syslog messages and after it the cpu card changed the
>     standby
>      > mode to
>      > active and active to standby. The cpu went at 100% through 15
>     minutes.
>      > I saw a network L2 loop, but I don't know that this L2 loop problem
>      > caused by
>      > the CPU change, or the CPU change caused by the L2 loop. I use RSTP.
>      > This router
>      > and more other 2 are members of a litle 10G ring.
>      >
>      > I can't found this error messages on cisco.com <http://cisco.com>.
>      >
>      > We has a similar problem on 1 january 2008 when happend a cpu state
>      > change to
>      > (cpu was 100% like now, other time the cpu goes on 0-2%).
>      >
>      > Any idea?
>      >
>      > Thanks
>      > Laci
>      >
>      > core2#sh redundancy history  | inc state
>      > Feb 28 16:13:33 *my state = ACTIVE(13) *peer state = DISABLED(1)
>      > Feb 28 16:17:12 *my state = ACTIVE(13) *peer state = UNKNOWN(0)
>      > Feb 28 16:17:21 *my state = ACTIVE(13) *peer state = STANDBY COLD(4)
>      > Feb 28 16:18:09 *my state = ACTIVE(13) *peer state = STANDBY
>     COLD-CONFIG(5)
>      > Feb 28 16:18:19 *my state = ACTIVE(13) *peer state = STANDBY HOT(8)
>      >
>      > core2#sh redundancy switchover
>      > Switchovers this system has experienced          : 1
>      > Last switchover reason                           : Active crashed.
>      > Uptime since this supervisor switched to active  : 8 weeks, 1 day, 18
>      > hours, 50
>      > minutes
>      > Total system uptime from reload                  : 28 weeks, 1 day, 1
>      > hour, 29
>      > minutes
>      >
>      > core2#sh redundancy switchover history
>      > Index  Previous  Current  Switchover             Switchover
>      >         active    active   reason                 time
>      > -----  --------  -------  ----------             ----------
>      >     1       1        2     active unit failed     22:44:19 MET
>     Tue Jan 1
>      > 2008
>      >
>      >
>      >
>      > *Feb 28 16:11:12 MET: %CONST_DIAG-SP-STDBY-3-HM_TEST_FAIL: Module 1
>      > TestSPRPInbandPing consecutive failure count:7
>      > *Feb 28 16:11:12 MET: %CONST_DIAG-SP-STDBY-6-HM_TEST_INFO: CPU
>      > util(5sec): SP=7%
>      > RP=0% Traffic=0%
>      > netint_thr_active[0], Tx_Rate[70], Rx_Rate[4946], dev=1[IPv4, fail=7]
>      > *Feb 28 16:13:12 MET: %CONST_DIAG-SP-STDBY-3-HM_TEST_FAIL: Module 1
>      > TestSPRPInbandPing consecutive failure count:14
>      > *Feb 28 16:13:12 MET: %CONST_DIAG-SP-STDBY-6-HM_TEST_INFO: CPU
>      > util(5sec): SP=2%
>      > RP=0% Traffic=0%
>      > netint_thr_active[0], Tx_Rate[70], Rx_Rate[8290], dev=1[IPv4,
>     fail=14]
>      > Feb 28 16:13:33 MET: %LINEPROTO-5-UPDOWN: Line protocol on Interface
>      > TenGigabitEthernet1/1, changed state to down
>      > Feb 28 16:13:33 MET: %BGP-5-ADJCHANGE: neighbor xx.xxx.xxx.xxx Down
>      > Interface flap
>      > Feb 28 16:13:33 MET: %PFREDUN-SP-6-ACTIVE: Standby processor
>     removed or
>      > reloaded, changing to Simplex mode
>      > Feb 28 16:13:33 MET: %LINK-SP-3-UPDOWN: Interface
>     TenGigabitEthernet1/1,
>      > changed
>      > state to down
>      > Feb 28 16:13:33 MET: %LINEPROTO-SP-5-UPDOWN: Line protocol on
>     Interface
>      > TenGigabitEthernet1/1, changed state to down
>      > Feb 28 16:17:11 MET: %PFREDUN-SP-6-ACTIVE: Standby initializing for
>      > RPR-PLUS mode
>      > Feb 28 16:17:11 MET: %SYS-SP-3-LOGGER_FLUSHED: System was paused for
>      > 00:00:00 to
>      > ensure console debugging output.
>      >
>      > -
>      > _______________________________________________
>      > cisco-nsp mailing list  cisco-nsp at puck.nether.net
>     <mailto:cisco-nsp at puck.nether.net>
>      > https://puck.nether.net/mailman/listinfo/cisco-nsp
>      > archive at http://puck.nether.net/pipermail/cisco-nsp/
>      >
> 
>     _______________________________________________
>     cisco-nsp mailing list  cisco-nsp at puck.nether.net
>     <mailto:cisco-nsp at puck.nether.net>
>     https://puck.nether.net/mailman/listinfo/cisco-nsp
>     archive at http://puck.nether.net/pipermail/cisco-nsp/
> 
>