[c-nsp] Sup-720 fabric failures

Pete Lumbis alumbis at gmail.com
Fri Jul 5 00:19:57 EDT 2013


Robert,

You probably have bad hardware but two sups both being bad sounds a little
suspect.

This matches CSCtx83944 a little bit. Would you be willing to supply me the
serial numbers of the failed modules?

Otherwise, it's probably worth it to open a case with TAC and see what they
say. My guess is RMA.

Regards,
Pete Lumbis
TAC Routing Protocols Technical Leader


On Thu, Jul 4, 2013 at 7:44 PM, Robert Williams <Robert at custodiandc.com>wrote:

> Hi,
>
> Got a weird persistent issue which I'd like to know if anyone else has
> seen. We have a site with a 6503-E chassis, with a 720-3bxl in slot 1 and a
> 6516A-gbic in slot 3. It had been running fine (for 310 days) until
> recently the facility it's hosted at got very cold (supply air @ 15
> degrees, hitting the base of the chassis through the in-rack floor vent).
> Then it started crashing until it warmed up and it was fine again.
>
> It crashed each time with the same SP error:
>
> %FABRIC-SP-3-DISABLE_FAB: The fabric manager disabled active fabric in
> slot 1 due to the error (2) on this channel (FPOE 4) connected to slot 1
>
> The most useful commands I found were:
>
> #show fabric fpoe map
> slot   channel   fpoe
>  1        0       4
>  1        1       0
>  2        0       5
>  2        1       14
>  3        0       2
>  3        1       11
>
> #show fabric fpoe interface gi1/1
> fpoe for GigabitEthernet1/1 is 4
>
> #show fabric fpoe interface gi1/2
> fpoe for GigabitEthernet1/2 is 4
>
> Suggesting that the channel in question is for the supervisors' own
> onboard ports.
>
> Now, once we realised it was temperature related, we decided to get it
> back to our lab and test it to see which component it was. So we simply
> swapped the whole unit out in one go - the chassis, line card, sup, both
> PSUs and PEMs included, even the fan tray. Only the 6 GBICs remained and
> were connected back into the new line card.
>
> In the lab, we made it consistently fail around 17 degrees (it would last
> around 3 minutes at that temperature before crashing). If you took it to
> about 21 degrees it would run all day just fine.
>
> Then 2 days later, the new chassis we had installed back at the same site
> suddenly crashed. Believe it or not, with, you guessed it, exactly the same
> error!
>
> It's running 15.1(1)SY1, doing full-table BGP to 5 other iBGP peers and
> has around 10 vlans and not a lot else. Average traffic around 500mbit/s
> total.
>
> Previously to this failure, the (original) chassis had an uptime of 310
> days (running an older IOS, we upgraded it when it got replaced).
>
> Is it possible that a GBIC (the only component 'not' swapped) could cause
> this?
>
> Any ideas or suggestions most welcome, as we've literally run out of
> components to swap over!
>
> Cheers,
>
>
> Robert Williams
> Custodian Data Centre
> Email: Robert at CustodianDC.com
> http://www.CustodianDC.com
>
>
>
>
> _______________________________________________
> cisco-nsp mailing list  cisco-nsp at puck.nether.net
> https://puck.nether.net/mailman/listinfo/cisco-nsp
> archive at http://puck.nether.net/pipermail/cisco-nsp/
>


More information about the cisco-nsp mailing list