[c-nsp] Very strange ME3600 err-disabled on Te0/1 Te0/2 problem

Joe Bender jbender at clearrate.com
Wed Sep 30 15:45:45 EDT 2015


Everyone,

We've had a very strange problem with our ME3600s, specifically ME-3600X-24TS-M, that we use in a ring topology, with the ten gigabit ports running purely "no switchport" L3 links between sites.  All of them are running 15.3(3) S5 at the moment.

We've had several incidents on multiple switches (5, to be exact)  where both Te0/1 and Te0/2 drop offline due to unknown causes.  Switches seem to fail somewhat randomly.

Symptoms are:

Both interfaces have BFD detect a node down, simultaneously.
30-45 seconds later UDLD detects unidirectional link, puts both interfaces into error-disabled.
Other side doesn't notice anything amiss other than the interface going down because the node having issues shuts the port down.

Active interventions like removing udld from the ports, then shut/no shut won't bring the ports back online.  They'll come out of error-disable,  but that's about it.  The *only* thing that'll bring the ten gigabit interfaces back online is a switch reload.  There's no tracebacks or anything else.

Changes that have been recently made in the network include installing two ASR920s with current IOS XE (03.16.00.S ) versions into the network to replace a single ME3600 to remove a single-point-of-failure at a critical spot in the ring.  The nodes that have had issues are the "closest" nodes in the ring to where the ASRs were installed, but aren't necessarily directly connected to said ASRs.  These also tend to be the nodes with the highest amount of traffic flowing through them.

At the moment, we haven't had a problem with them in several days, and as a result Cisco TAC has been somewhat useless as they're insisting they can't help us if we can't have them actively looking at one of them when it fails.  This, for us, is very difficult, because I can't wait the 35-40 minutes it usually takes TAC to get in, webex, etc as this is taking our customers down, and they're understandably getting a bit twitchy with some of these issues.  I can't even get a script to run against the switch for us to pull additional information other than a show tech because they're insisting they can't tell me what they might run to get info (which I find strange)

One TAC team thought it was a bug or some sort of ASIC/software but haven't been able to isolate anything that might be the cause, bug-id or otherwise.  Right now I'm dealing with someone who seems to be concentrating on the fact that he's seeing UDLD firing as a result of the interfaces going offline.  I'm also getting an immense amount of pushback from the support engineers about getting this escalated because we don't have a lot of information about the problem, and it hasn't come back in days.

As a result, I'm reaching out to the list here, because we're at our wits' end , afraid that we're sitting on a ticking timebomb in our network, waiting for one of these sites to have both TenGigabit interfaces fall offline at the same time again.  If anyone has seen ANYTHING like this happen on the ME3600 ten gig ports, or can get me in touch with a Cisco resource that will actually take me seriously and actually help us look for causes instead of constantly blaming UDLD (seriously both interfaces at the same time?), I'd appreciate the info.
-Joseph Bender



More information about the cisco-nsp mailing list