[c-nsp] Packet Memory buffer test detected errors

Thu Jul 6 15:33:12 EDT 2006

On Thursday,  6 July 2006 at 17:48:20 +0100, David Freedman wrote:
> from DDTS CSCdz57255:
> 
> --------------------------------------------------------------------
> 
> Symptoms:
> 
> Devices connected to a Catalyst 4000 or 4500 family IOS switch may 
> sometimes
> experience poor network connectivity. Some packets are dropped by the 
> switch
> due to a faulty component(sram) on the supervisor. Larger packets are 
> affected
> more than the smaller ones. The transient sram failure is very rare and has
> been seen on only a very small number of supervisors. A similar problem has
> been identified on Catalyst 4000 Supervisor Engine II. Please see 
> CSCdy46288
> for details.
> 
> Note that poor connectivity can be a result of various network
> misconfigurations as well and replacing the supervisor in those cases 
> will not
> fix the problem. Hence it is highly recommended that the following steps 
> taken
> to confirm the transient SRAM problem.
> 
> The following indications will be present. Please capture the Output of the
> below requested tests.
> 
> 1. Successive iterations of the command show platform cpu
> packets statistics all for Cisco IOS Releases 12.1(12c)EW and
> higher will show the counter for VlanZeroBadCrc under the "Packets 
> Dropped In
> Processing by Reason" steadily increasing. The increase over several 
> minutes
> will be in the range of hundreds or thousands. If a small number of
> VlanZeroBadCrc are seen and the number is not increasing, then that is 
> not an
> indication of a problem.

Heh.. Thanks for quick response Devid, but VlanZeroBadCrc is always zero
for me.. So intresting what kind of problem i have? :)

> 
> For Cisco IOS Releases 12.1(11b)EW1 and lower, successive iterations of the
> command show platform cpuport all will show
> the VlanZeroBadCrc counter (for 12.1(11b)EW1) or the VlanZero counter 
> (for 12.1
> (8a)EW1) under the "Packets Dropped In Processing by Reason" steadily
> increasing in the range of hundreds or thousands over several minutes.
> 
> The following information should be captured immediately:
> 
> show platform cpu packets statistics all or
> show
> platform cpuport all (several iterations)
> show platform software interface all (several
> iterations)
> 
> 2. Perform a soft reset by issuing the reload
> command. In the presence of this bug, the Supervisor will fail POST. The 
> POST
> results should be captured to a text file.
> 3. If the customer is running an image equal to or higher than 12.2(18)
> EW,capture the O/P for the following command : show diagnostics
> result module all detail
> 4. Perform a power cycle (power off/on) of the switch. On booting up, the
> Supervisor will pass POST and there will be no further symptoms of the
> problem. The POST results should be captured to a text file, and a
> show tech should be collected.
> 
> All indications must be present in order to conclude that the problem 
> was due
> to this bug.
> 
> If you are running 12.2(18)EW or later and encounter a message as below : "%
> C4K_L3HWFORWARDING-3-FTECONSISTENCYCHECKFAILED: FwdTableEntry Consistency
> Check Failed: index 98339" then, most probably the supervisor has 
> encountered
> an SRAM corruption for the memory used as "forwarding memory". Please 
> refer to
> the bug CSCed49194.
> 
> 
> Conditions:
> 
> This problem has been traced to an SRAM component failure which is 
> transient
> in nature. The incidence of this failure is extremely rare and is well 
> below
> the predicted failure rates for this component. If you believe that you 
> have
> encountered this bug please open a case with the TAC (Technical Assistance
> Center) and attach all the above captured information to the case. Boards
> exhibiting this failure should be replaced using RMA.
> 
> 
> Workaround:
> 
> For the software releases earlier than 12.1(19)EW, hard resetting the 
> switch
> by powering it OFF & ON is the only workaround.
> 
> For the software releases 12.1(19)EW and later but prior to 12.2(18)EW, the
> SRAM workaround incorporated is "partial". For the software release 
> 12.1(21)E
> and earlier, the SRAM workaround is "partial" too. The software detects, 
> logs
> and takes appropriate action, depending upon the configuration mode. 
> Here the
> SRAM workaround can be configured in either of the 3 modes : normal,
> conservative or aggressive, using the following command :
> 
> (config) diagnostic monitor action <conservative | normal |
> aggressive>
> 
> conservative : Directed memory tests are not run, so does not reliably 
> detect
> the failure. Does not reset the switch on error detection, but does 
> syslog the
> message.
> 
> normal : Directed memory tests are run, so reliably detects the failure. 
> Does
> not reset the switch on error detection, but does syslog the message.
> 
> aggressive : Directed memory tests are run, so reliably detects the 
> failure.
> Soft-resets the switch on error detection & syslogs the message. On bootup,
> the supervisor remains in the faulty state. This action allows for either a
> redundant supervisor engine or network-level redundancy to take over.
> 
> For software release 12.2(18)EW onwards, an SRAM workaround is 
> incorporated to
> automatically detect the failure and take action to recover from the failed
> state depending upon the configuration. This SRAM workaround can be 
> configured
> in any of the 3 modes : conservatve, normal or aggressive as described 
> below:
> 
> (config) diagnostic monitor action <conservative | normal |
> aggressive>
> 
> conservative : Directed memory tests are not run, so does not reliably 
> detect
> the failure. Does not reset the switch on error detection, but does 
> syslog the
> message.
> 
> normal : Directed memory tests are run, so reliably detects the failure. On
> detection of the failure, supervisor resets and on bootup, removes the
> affected memory from the usage and continues to function with the
> available "good" memory. It syslogs the message at regular intervals.
> 
> aggressive : Directed memory tests are run, so reliably detects the 
> failure.
> Soft-resets the switch on error detection & syslogs the message. On bootup,
> the supervisor fails to come online. This action allows for either a 
> redundant
> supervisor engine or network-level redundancy to take over.
> 
> For detailed explanation, please refer to the release-notes for the bug
> CSCed61591.
> 
> In any case, on the detection of the problem, the supervisor needs to be
> RMA'ed.
> 
> All diagnostics and all actions can be completely disabled (even if 
> there is a
> standby supervisor present) with this CLI:
> 
> (config) no diagnostic monitor action
> 
> 
> 
> _______________________________________________
> cisco-nsp mailing list  cisco-nsp at puck.nether.net
> https://puck.nether.net/mailman/listinfo/cisco-nsp
> archive at http://puck.nether.net/pipermail/cisco-nsp/

-- 
========================================================================= 
= Best regards, Nikolay Pavlov. <<<------------------------------------ = 
=========================================================================