[c-nsp] FABRIC-3-ERR_HANDLE

Aaron dudepron at gmail.com
Mon Nov 16 11:19:12 EST 2009


It is normal to have a CSC in standby mode. If something goes wrong with the
other CSC, it takes over.


Step 1 - Gather data before making any changes

                        term length 0    - so  you don’t have to hit enter

                        show log

                        show tech

                        show monitor event-trace fab

                        show monitor event-trace agent-ctrl

                        show monitor event-trace board_mgr

                        show monitor event-trace lci

                        execute-on all show controllers fia (x5 times or so)

                        show controllers errors fabric counters (x5 times or
so)

                        show controllers errors (x5 times or so)

                        show controllers xbar (x5 times or so)

                        show controllers sca (x5 times or so)

                        show controllers clock

                        show controllers fab-clk





Step 2 - Determine if the issue is with a single or multiple slots,
including the RP slots

Step 3 - Check location of the primary clock scheduler and if both CSC are
active (from

show controllers clock) and the number of SFC. If only 1 CSC, troubleshoot
missing CSC first. Ensure that you will have 4 active fabric cards  before
OIRing card since line cards may go out of service due to lack to fabric BW.

Step 4 - *CRC- and LOS errors in control path from CSC to SFC cards*

Explanation <#CRC_and_LOS_errors_control_path>

>From *show controllers xbar*, on 120XX chassis look at Interrupt status
field, on 124XX and 128XX, look at Control LOS status and Control CRC error
fields.  If 0 then go to step 5.

Check to see which card is primary from *show controllers clock* and if both
are present.

If incrementing and the error is on all fabric cards, then OIR primary CSC

If incrementing and the error is only one 1 fabric card, then OIR fabric

If *show controllers xbar* does not show more errors, then the issue was
seating, otherwise RMA card



Step 5 – *CSC Clocking and Synchronization problems *

                        Explaination <#CSC_clocking_and_sync>

                        From *show controllers clock* and *show controllers
errors* (CLKSTS field)

                        Check to see which card is primary from *show
controllers clock*.

                        If all the cards are using primary clock (default is
CSC_0), then go to step

                        6

Cards not using same clock must be in IOS RUN, RP ACTV or RP STBY, if not,
go to step 6

If multiple cards not using primary, OIR primary CSC, if still, RMA primary
CSC

If single card not using primary, OIR suspect card, if still, RMA suspect
card



Step 6 – *ToFab FIA Halt*

Explanation <#ToFab_FIA_Halt>

If a syslog message or from *execute-on all show controllers fia* we observe
errors

If the RP has failed over and we have line cards also halted, then suspect
the chassis or backplane. If only a line card is halted, the router tries to
recover several times, if it cannot recover, the RP resets the line card and
runs additional tests. If the line card fails, then RMA the line card



Step 7 - *CRC and LOS errors between fabric cards and line cards/RPs*


Explanation from LC/RP to Fabric <#CRC_and_LOS_Errors_from_LC>

Explanation from Fabric to LC/RP <#CRC_and_LOS_Errors_from_Fabric>

Errors are observed from *show controller error* (not useful on 120XX)
and *show
controller errors fabric counters*. The DAT_LOS (124XX and 128XX) and
DAT_CRC (128XX only) identify the cards. On a 120XX, the cause of errors
from LC/RP to fabric can only be determined by removing 1 card at a time to
see if the errors stop. Since the possibility is high that a in use line
card is the problem, start with the backbone facing cards first one at a
time, then customer facing one at a time, then cards not in use one at a
time.

If multiple cards show DAT_CRC and DAT_LOS errors, then cause is most likely
a fabric card determined from the bitmap. Reseat suspect card to see if
errors continue. If so, RMA card.

Show controller errors fabric counters show errors from the fabric. The
bitmask will determine which one is suspect. Reset suspect card to see if
errors continue. If so, RMA card.


More information about the cisco-nsp mailing list