[c-nsp] C6K VSS outage after forced SSO switchover

Alasdair McWilliam alasdairm at gmail.com
Mon Jul 27 06:30:54 EDT 2009


Hello all,

We've got a Cisco 6509 VSS deployment at a new data centre running
12.2(33)SXI1. The DC itself isn't live yet so we were doing some final
resilience testing, which involved forcing a node fail over to record
what traffic loss if any we were to experience if a node fails.

We had various pings going to pieces of kit during the test, and as
soon as the 'redundancy force-switchover' command was entered, latency
started to increase and pings started to drop out. Within 15-20
seconds, access to the VSS was lost and all our management VPNs sent
offline.

We had an engineer on site who was able to pull some logs, and our
EIGRP sessions to a pair of ASR1k boxes were cycling constantly (time
outs, peer terminations). The CPU of the newly active node was 90%:

CPU utilization for five seconds: 74%/67%; one minute: 87%; five minutes: 90%

I've gone through every process on the MSFC and at best can account
for 5% utilisation from the ARP Input process. Everything else was
less than 0%. I will note that we didn't get the CPU info for the SP
but instinct suggests this was an STP issue because the VSS itself was
OK. The failed node itself came back OK and assumed Standby role, and
interfaces came back online. I could tell this from the ASR1k boxes as
the interfaces went up/up and I could see the VSS in CDP.

The failed node was reporting this via the active MSFC.

%FABRIC-SW2_SPSTBY-6-TIMEOUT_ERR: Fabric in slot 5 detected excessive
flow-control on channel 18 (Module 5, fabric connection 0)

The VSS itself never recovered and in the end we just had to ask our
engineer to physically power down both boxes. The VSS then came back
up as normal.

Has anyone else experienced this, or a similar issue, with a VSS?

I've found bug ID CSCsx27836 on the Cisco bug tracker which in summary
advises that a VSS can get stuck in an L2 loop and high CPU
utilisation after a node fail over, however it does specifically
stipulate that the issue is when the standby node is failed. We failed
the active node. I've raised a query via our account team and will
probably request a TAC case to be opened via our partner.

Any info would be appreciated!

Regards
Alasdair


More information about the cisco-nsp mailing list