[c-nsp] Sup-720 fabric failures

Thu Jul 4 19:44:36 EDT 2013

Hi,

Got a weird persistent issue which I'd like to know if anyone else has seen. We have a site with a 6503-E chassis, with a 720-3bxl in slot 1 and a 6516A-gbic in slot 3. It had been running fine (for 310 days) until recently the facility it's hosted at got very cold (supply air @ 15 degrees, hitting the base of the chassis through the in-rack floor vent). Then it started crashing until it warmed up and it was fine again.

It crashed each time with the same SP error:

%FABRIC-SP-3-DISABLE_FAB: The fabric manager disabled active fabric in slot 1 due to the error (2) on this channel (FPOE 4) connected to slot 1

The most useful commands I found were:

#show fabric fpoe map
slot   channel   fpoe
 1        0       4
 1        1       0
 2        0       5
 2        1       14
 3        0       2
 3        1       11

#show fabric fpoe interface gi1/1
fpoe for GigabitEthernet1/1 is 4

#show fabric fpoe interface gi1/2
fpoe for GigabitEthernet1/2 is 4

Suggesting that the channel in question is for the supervisors' own onboard ports.

Now, once we realised it was temperature related, we decided to get it back to our lab and test it to see which component it was. So we simply swapped the whole unit out in one go - the chassis, line card, sup, both PSUs and PEMs included, even the fan tray. Only the 6 GBICs remained and were connected back into the new line card.

In the lab, we made it consistently fail around 17 degrees (it would last around 3 minutes at that temperature before crashing). If you took it to about 21 degrees it would run all day just fine.

Then 2 days later, the new chassis we had installed back at the same site suddenly crashed. Believe it or not, with, you guessed it, exactly the same error!

It's running 15.1(1)SY1, doing full-table BGP to 5 other iBGP peers and has around 10 vlans and not a lot else. Average traffic around 500mbit/s total.

Previously to this failure, the (original) chassis had an uptime of 310 days (running an older IOS, we upgraded it when it got replaced).

Is it possible that a GBIC (the only component 'not' swapped) could cause this?

Any ideas or suggestions most welcome, as we've literally run out of components to swap over!

Cheers,

Robert Williams
Custodian Data Centre
Email: Robert at CustodianDC.com
http://www.CustodianDC.com