[c-nsp] RES: activ/standby cpu card status changed

Fri Feb 29 16:13:31 EST 2008

Leo,

FYI... from the bug notes;

"....the fix for this caveat does not address all triggers that may cause
the TestSPRPInbandPing to fail"

/eninja ;)

On Fri, Feb 29, 2008 at 12:43 PM, Leonardo Gama Souza <
leonardo.souza at nec.com.br> wrote:

>  Actually this bug had already corrected in SXF2...
>
> ------------------------------
> *De:* e ninja [mailto:eninja at gmail.com]
> *Enviada:* sex 29/2/2008 17:29
> *Para:* Nemeth Laszlo
> *Cc:* Leonardo Gama Souza; cisco-nsp at puck.nether.net
> *Assunto:* Re: [c-nsp] RES: activ/standby cpu card status changed
>
> Nemeth,
>
> Your SUP crashed because it failed over 10 consecutive TestSPRPInbandPing.
> Get the fix/workaround for sc33990 below.
>
> /eninja
>
>
> CSCsc33990
>
> Symptoms: A supervisor engine may unexpectedly reset when the
> TestSPRPInbandPing as part of the Cisco Generic Online Diagnostics (GOLD)
> fails for 10 consecutive times.
>
> The following syslog error messages are typically generated right before
> the supervisor engine resets, and can also be found in the crashinfo files:
>
> %CONST_DIAG-SP-3-HM_TEST_FAIL: Module <slot#> TestSPRPInbandPing
> consecutive failure count:5
> %CONST_DIAG-SP-6-HM_TEST_INFO: CPU util(5sec): SP=10% RP=0% Traffic=0%
> netint_thr_active[0], Tx_Rate[4412], Rx_Rate[0]
> %CONST_DIAG-SP-3-HM_TEST_FAIL: Module <slot#> TestSPRPInbandPing
> consecutive failure count:10
> %CONST_DIAG-SP-6-HM_TEST_INFO: CPU util(5sec): SP=10% RP=0% Traffic=0%
> netint_thr_active[0], Tx_Rate[4652], Rx_Rate[0]
> %CONST_DIAG-SP-2-HM_SUP_CRSH: Supervisor crashed due to unrecoverable
> errors, Reason: Failed TestSPRPInbandPing
>
> Conditions: This symptom is observed on a Cisco Catalyst 6500 series
> switch and Cisco 7600 series router that run an integrated Cisco IOS
> software image. The trigger for the symptom may be possible corruption in
> TCAM entries that are used to perform the TestSPRPInbandPing.
>
> Workaround: Enter the no diagnostic crash global configuration command to
> disable exceptions that are being triggered by failed diagnostic monitoring.
> However, you should do this with discretion because it may also prevent the
> system from taking proactive measure to mitigate problems that could impact
> user traffic.
>
> Further Information: The fix for this caveat is more of an enhancement
> because it only prevents the system from being over-aggressive in taking
> exceptions when the TestSPRPInbandPing fails under specific conditions.
> Therefore, the fix for this caveat does not address all triggers that may
> cause the TestSPRPInbandPing to fail. Please consult Cisco TAC for further
> assistance if you experience the same problem after upgrading to a Cisco IOS
> software image that contains the fix for this caveat.
>
>
>
>
> On Fri, Feb 29, 2008 at 1:24 AM, Nemeth Laszlo <csirek at externet.hu> wrote:
>
> > Hi!
> >
> > I put the crash file here:
> >
> > ftp://195.70.33.12/crashinfo_20080228-151329_cpu1
> > ftp://195.70.33.12/crashinfo_20080228-151329_cpu2
> >
> >
> > If anybody knows what was the problem, please don't silent it :)
> >
> > Possible it's an IOS problem?
> >
> > Thanks
> > Laci
> >
> >
> > Leonardo Gama Souza írta:
> >  > Hi.
> > >
> > > It sounds like your MSFC crashed.
> > > You ought to look into the crashinfo file in order to figure out why.
> > >
> > > cheers,
> > > Leonardo Gama.
> > >
> > >
> > ------------------------------------------------------------------------
> > > *De:* cisco-nsp-bounces at puck.nether.net em nome de Nemeth Laszlo
> > > *Enviada:* qui 28/2/2008 13:43
> > > *Para:* cisco-nsp at puck.nether.net
> > > *Assunto:* [c-nsp] activ/standby cpu card status changed
> > >
> > > Hi!
> > >
> > > My 7604 router has 2 WS-SUP32-10GE-3B cpu card in RRP-PLUS mode.
> > >
> > > System image file is "sup-bootdisk:s3223-ipservices_wan-
> > mz.122-18.SXF9.bin"
> > >
> > > I got this syslog messages and after it the cpu card changed the
> > standby
> > > mode to
> > > active and active to standby. The cpu went at 100% through 15 minutes.
> > > I saw a network L2 loop, but I don't know that this L2 loop problem
> > > caused by
> > > the CPU change, or the CPU change caused by the L2 loop. I use RSTP.
> > > This router
> > > and more other 2 are members of a litle 10G ring.
> > >
> > > I can't found this error messages on cisco.com.
> > >
> > > We has a similar problem on 1 january 2008 when happend a cpu state
> > > change to
> > > (cpu was 100% like now, other time the cpu goes on 0-2%).
> > >
> > > Any idea?
> > >
> > > Thanks
> > > Laci
> > >
> > > core2#sh redundancy history  | inc state
> > > Feb 28 16:13:33 *my state = ACTIVE(13) *peer state = DISABLED(1)
> > > Feb 28 16:17:12 *my state = ACTIVE(13) *peer state = UNKNOWN(0)
> > > Feb 28 16:17:21 *my state = ACTIVE(13) *peer state = STANDBY COLD(4)
> > > Feb 28 16:18:09 *my state = ACTIVE(13) *peer state = STANDBY
> > COLD-CONFIG(5)
> > > Feb 28 16:18:19 *my state = ACTIVE(13) *peer state = STANDBY HOT(8)
> > >
> > > core2#sh redundancy switchover
> > > Switchovers this system has experienced          : 1
> > > Last switchover reason                           : Active crashed.
> > > Uptime since this supervisor switched to active  : 8 weeks, 1 day, 18
> > > hours, 50
> > > minutes
> > > Total system uptime from reload                  : 28 weeks, 1 day, 1
> > > hour, 29
> > > minutes
> > >
> > > core2#sh redundancy switchover history
> > > Index  Previous  Current  Switchover             Switchover
> > >         active    active   reason                 time
> > > -----  --------  -------  ----------             ----------
> > >     1       1        2     active unit failed     22:44:19 MET Tue Jan
> > 1
> > > 2008
> > >
> > >
> > >
> > > *Feb 28 16:11:12 MET: %CONST_DIAG-SP-STDBY-3-HM_TEST_FAIL: Module 1
> > > TestSPRPInbandPing consecutive failure count:7
> > > *Feb 28 16:11:12 MET: %CONST_DIAG-SP-STDBY-6-HM_TEST_INFO: CPU
> > > util(5sec): SP=7%
> > > RP=0% Traffic=0%
> > > netint_thr_active[0], Tx_Rate[70], Rx_Rate[4946], dev=1[IPv4, fail=7]
> > > *Feb 28 16:13:12 MET: %CONST_DIAG-SP-STDBY-3-HM_TEST_FAIL: Module 1
> > > TestSPRPInbandPing consecutive failure count:14
> > > *Feb 28 16:13:12 MET: %CONST_DIAG-SP-STDBY-6-HM_TEST_INFO: CPU
> > > util(5sec): SP=2%
> > > RP=0% Traffic=0%
> > > netint_thr_active[0], Tx_Rate[70], Rx_Rate[8290], dev=1[IPv4, fail=14]
> > > Feb 28 16:13:33 MET: %LINEPROTO-5-UPDOWN: Line protocol on Interface
> > > TenGigabitEthernet1/1, changed state to down
> > > Feb 28 16:13:33 MET: %BGP-5-ADJCHANGE: neighbor xx.xxx.xxx.xxx Down
> > > Interface flap
> > > Feb 28 16:13:33 MET: %PFREDUN-SP-6-ACTIVE: Standby processor removed
> > or
> > > reloaded, changing to Simplex mode
> > > Feb 28 16:13:33 MET: %LINK-SP-3-UPDOWN: Interface
> > TenGigabitEthernet1/1,
> > > changed
> > > state to down
> > > Feb 28 16:13:33 MET: %LINEPROTO-SP-5-UPDOWN: Line protocol on
> > Interface
> > > TenGigabitEthernet1/1, changed state to down
> > > Feb 28 16:17:11 MET: %PFREDUN-SP-6-ACTIVE: Standby initializing for
> > > RPR-PLUS mode
> > > Feb 28 16:17:11 MET: %SYS-SP-3-LOGGER_FLUSHED: System was paused for
> > > 00:00:00 to
> > > ensure console debugging output.
> > >
> > > -
> > > _______________________________________________
> > > cisco-nsp mailing list  cisco-nsp at puck.nether.net
> > > https://puck.nether.net/mailman/listinfo/cisco-nsp
> > > archive at http://puck.nether.net/pipermail/cisco-nsp/
> > >
> >
> > _______________________________________________
> > cisco-nsp mailing list  cisco-nsp at puck.nether.net
> > https://puck.nether.net/mailman/listinfo/cisco-nsp
> > archive at http://puck.nether.net/pipermail/cisco-nsp/
> >
>
>