[c-nsp] 7600 sup7203bxl error msg
Sukumar Subburayan
sukumars at cisco.com
Fri Nov 10 15:07:42 EST 2006
Just to close on this. Looks like the DEs want to print a syslog message
everytime the software does a system controller soft reset. So, that we
can coorelate the information with the event which happened (in
the case belwo, the event which caused the system controller reset
is TM_DATA_PARITY_ERROR) and for cases where there is no event happening
and the system controller still resetting, we can investigate further.
However, the syslog was not changed appropriated to simply say "RESET", as
it is misleading to print "Excessive Reset".
We will fix this minor cosmetic bug in an upcoming release.
The DDTS tracking this fix is CSCsg69605.
sukumar
On Thu, 9 Nov 2006, Sukumar Subburayan wrote:
> I think, we have a bug here.
>
> We wanted to add a syslog message only when we have excessive (n number of
> resets, with y seconds) system controller soft reset to warn user. This
> enhancement was requested via CSCed21601.
>
> However, I looked at the bug, and looks like we are simply printing this
> syslog message everytime we are soft resetting the system controller, which
> is not correct.
>
> For now, you can ignore the 2nd message. We will fix this in an upcoming
> release.
>
> Also, there is no correlation between 'Mistral Error Interrupt' and IBC
> resets in 'show ibc'. So, they don't have to match.
>
> sukumar
>
> On Thu, 9 Nov 2006, Dale W. Carder wrote:
>
>> We too got this error on a SXF4 router last night. Guess you're
>> not alone :-)
>>
>> Nov 9 05:25:34: %SYSTEM_CONTROLLER-3-ERROR: Error condition detected:
>> TM_DATA_PARITY_ERROR
>> Nov 9 05:25:34: %SYSTEM_CONTROLLER-3-EXCESSIVE_RESET: System Controller is
>> getting reset so frequently
>>
>> In our case, it looks like the problem came from the RP instead
>> of the SP. From "sh ibc" it looks like the number of IBC resets
>> that occurred was 2, and any other error counter is 0. "sh stack"
>> says the Mistral Error Interrupt process has only been called once.
>>
>> Dale
>>
>> ----------------------------------
>> Dale W. Carder - Network Engineer
>> University of Wisconsin at Madison
>> http://net.doit.wisc.edu/~dwcarder
>>
>>
>> On Nov 8, 2006, at 6:07 PM, Sukumar Subburayan wrote:
>>> Everytime we get parity error in the system controller, we try to soft
>>> reset the IBC and recover from the condition. Things should just be fine,
>>> if this was a transient one off case.
>>>
>>> However, if the issue is persistant, we will be constantly resetting the
>>> IBC and hence dropping packets. The second syslog message is to warn you
>>> that you are seeing excessive IBC resets.
>>>
>>> According to your syslog your SP's system controller is what is reporting
>>> the parity error.
>>>
>>> Is the output of 'show ibc' below from the SP-side?
>>>
>>> If not, can you get us 'remote command switch show ibc' .
>>>
>>> sukumar
>>>
>>>
>>>
>>> On Wed, 8 Nov 2006, matt carter wrote:
>>>
>>>>
>>>> hey all,
>>>>
>>>> hoping someone may have an insight into some errors i have not seen
>>>> before
>>>>
>>>> Nov 7 11:24:46.535 AEST: %SYSTEM_CONTROLLER-SP-3-ERROR: Error condition
>>>> detected: TM_DATA_PARITY_ERROR
>>>>
>>>> Nov 7 11:24:46.535 AEST: %SYSTEM_CONTROLLER-SP-3-EXCESSIVE_RESET: System
>>>> Controller is getting reset so frequently
>>>>
>>>> first one is fair enough
>>>>
>>>> Explanation The most common errors from the Mistral ASIC on the
>>>> supervisor engine are TM_DATA_PARITY_ERROR and TM_NPP_PARITY_ERROR.
>>>> Possible
>>>> causes of these parity errors are random static discharge or other
>>>> external
>>>> factors.
>>>> Recommended Action If the error message is only seen once (or rarely),
>>>> the recommendation is to monitor the switch syslog to confirm the error
>>>> message was an isolated incident. If these error messages are
>>>> reoccurring,
>>>> open a case with the Technical Assistance Center
>>>>
>>>> but the construct of the second error seems to suggest this is not a
>>>> "isolated incident" since my system controller is being reset "so
>>>> frequently" which kind of makes the log messages somewhat contradictory
>>>> in a
>>>> fashion. when i go looking at the SP stack i can only see the mistral
>>>> error
>>>> interrupt called once.
>>>>
>>>> from show stack
>>>> Interrupt level stacks:
>>>> Level Called Unused/Size Name
>>>> 5 1 7168/9000 Mistral Error Interrupt
>>>>
>>>> when i check the ibc stats i can see the mistral hardware was definately
>>>> reset at the time of the incident, and has
>>>> been reset 3 times, but there is 0 errors.
>>>>
>>>> anyone have ideas on how to proceed in terms of catching this if it
>>>> happens
>>>> again?
>>>>
>>>> from show ibc
>>>> Interface information:
>>>> Interface IBC0/0(idb 0x43090238)
>>>> Hardware is Mistral IBC (revision 5)
>>>> 5 minute rx rate 5000 bits/sec, 11 packets/sec
>>>> 5 minute tx rate 12000 bits/sec, 22 packets/sec
>>>> 4900820 packets input, 317526199 bytes
>>>> 4788796 broadcasts received
>>>> 10253179 packets output, 730435109 bytes
>>>> 68213 broadcasts sent
>>>> 0 Packets CEF Switched, 0 Packets Fast Switched
>>>> 0 Packets SLB Switched, 0 Packets CWAN Switched
>>>> IBC resets = 3; last at 11:24:46.535 AEST Fri Nov 7 2006
>>>> MISTRAL ERROR COUNTERS
>>>> System address timeouts = 0 BUS errors = 0
>>>> IBC Address timeouts = 0 (addr 0x0)
>>>> Page CRC errors = 0 IBL CRC errors = 0
>>>> ECC Correctable errors = 0
>>>> Packets with padding removed (0/0/0) = 0
>>>> Packets expanded (0/0) = 0
>>>> Packets attempted tail end expansion > 1 page and were dropped = 0
>>>> IP packets dropped with frag offset of 1 = 0
>>>> 0 packets (aggregate) dropped on throttled interfaces
>>>> Hazard Illegal packet length = 0 Illegal Offset = 0
>>>> Hazard Packet underflow = 0 Packet Overflow = 0
>>>> IBL fill hang count = 0 Unencapsed packets = 0
>>>> LBIC RXQ Drop pkt count = 0 LBIC drop pkt count = 0
>>>> LBIC Drop pkt stick = 0
>
More information about the cisco-nsp
mailing list