[c-nsp] Packet Memory buffer test detected errors

David Freedman david.freedman at uk.clara.net
Thu Jul 6 12:48:20 EDT 2006


from DDTS CSCdz57255:

--------------------------------------------------------------------

Symptoms:

Devices connected to a Catalyst 4000 or 4500 family IOS switch may 
sometimes
experience poor network connectivity. Some packets are dropped by the 
switch
due to a faulty component(sram) on the supervisor. Larger packets are 
affected
more than the smaller ones. The transient sram failure is very rare and has
been seen on only a very small number of supervisors. A similar problem has
been identified on Catalyst 4000 Supervisor Engine II. Please see 
CSCdy46288
for details.

Note that poor connectivity can be a result of various network
misconfigurations as well and replacing the supervisor in those cases 
will not
fix the problem. Hence it is highly recommended that the following steps 
taken
to confirm the transient SRAM problem.

The following indications will be present. Please capture the Output of the
below requested tests.

1. Successive iterations of the command show platform cpu
packets statistics all for Cisco IOS Releases 12.1(12c)EW and
higher will show the counter for VlanZeroBadCrc under the "Packets 
Dropped In
Processing by Reason" steadily increasing. The increase over several 
minutes
will be in the range of hundreds or thousands. If a small number of
VlanZeroBadCrc are seen and the number is not increasing, then that is 
not an
indication of a problem.

For Cisco IOS Releases 12.1(11b)EW1 and lower, successive iterations of the
command show platform cpuport all will show
the VlanZeroBadCrc counter (for 12.1(11b)EW1) or the VlanZero counter 
(for 12.1
(8a)EW1) under the "Packets Dropped In Processing by Reason" steadily
increasing in the range of hundreds or thousands over several minutes.

The following information should be captured immediately:

show platform cpu packets statistics all or
show
platform cpuport all (several iterations)
show platform software interface all (several
iterations)

2. Perform a soft reset by issuing the reload
command. In the presence of this bug, the Supervisor will fail POST. The 
POST
results should be captured to a text file.
3. If the customer is running an image equal to or higher than 12.2(18)
EW,capture the O/P for the following command : show diagnostics
result module all detail
4. Perform a power cycle (power off/on) of the switch. On booting up, the
Supervisor will pass POST and there will be no further symptoms of the
problem. The POST results should be captured to a text file, and a
show tech should be collected.

All indications must be present in order to conclude that the problem 
was due
to this bug.

If you are running 12.2(18)EW or later and encounter a message as below : "%
C4K_L3HWFORWARDING-3-FTECONSISTENCYCHECKFAILED: FwdTableEntry Consistency
Check Failed: index 98339" then, most probably the supervisor has 
encountered
an SRAM corruption for the memory used as "forwarding memory". Please 
refer to
the bug CSCed49194.


Conditions:

This problem has been traced to an SRAM component failure which is 
transient
in nature. The incidence of this failure is extremely rare and is well 
below
the predicted failure rates for this component. If you believe that you 
have
encountered this bug please open a case with the TAC (Technical Assistance
Center) and attach all the above captured information to the case. Boards
exhibiting this failure should be replaced using RMA.


Workaround:

For the software releases earlier than 12.1(19)EW, hard resetting the 
switch
by powering it OFF & ON is the only workaround.

For the software releases 12.1(19)EW and later but prior to 12.2(18)EW, the
SRAM workaround incorporated is "partial". For the software release 
12.1(21)E
and earlier, the SRAM workaround is "partial" too. The software detects, 
logs
and takes appropriate action, depending upon the configuration mode. 
Here the
SRAM workaround can be configured in either of the 3 modes : normal,
conservative or aggressive, using the following command :

(config) diagnostic monitor action <conservative | normal |
aggressive>

conservative : Directed memory tests are not run, so does not reliably 
detect
the failure. Does not reset the switch on error detection, but does 
syslog the
message.

normal : Directed memory tests are run, so reliably detects the failure. 
Does
not reset the switch on error detection, but does syslog the message.

aggressive : Directed memory tests are run, so reliably detects the 
failure.
Soft-resets the switch on error detection & syslogs the message. On bootup,
the supervisor remains in the faulty state. This action allows for either a
redundant supervisor engine or network-level redundancy to take over.

For software release 12.2(18)EW onwards, an SRAM workaround is 
incorporated to
automatically detect the failure and take action to recover from the failed
state depending upon the configuration. This SRAM workaround can be 
configured
in any of the 3 modes : conservatve, normal or aggressive as described 
below:

(config) diagnostic monitor action <conservative | normal |
aggressive>

conservative : Directed memory tests are not run, so does not reliably 
detect
the failure. Does not reset the switch on error detection, but does 
syslog the
message.

normal : Directed memory tests are run, so reliably detects the failure. On
detection of the failure, supervisor resets and on bootup, removes the
affected memory from the usage and continues to function with the
available "good" memory. It syslogs the message at regular intervals.

aggressive : Directed memory tests are run, so reliably detects the 
failure.
Soft-resets the switch on error detection & syslogs the message. On bootup,
the supervisor fails to come online. This action allows for either a 
redundant
supervisor engine or network-level redundancy to take over.

For detailed explanation, please refer to the release-notes for the bug
CSCed61591.

In any case, on the detection of the problem, the supervisor needs to be
RMA'ed.

All diagnostics and all actions can be completely disabled (even if 
there is a
standby supervisor present) with this CLI:

(config) no diagnostic monitor action





More information about the cisco-nsp mailing list