[c-nsp] Very strange ME3600 err-disabled on Te0/1 Te0/2 problem

Fri Oct 2 01:08:40 EDT 2015

Apologies to everyone on the list, things got busy right after I sent that post.

>Does it affect both Te ports at EXACTLY the same time? Or does one go down then a few seconds alter the other? If it's both at the same time, that smells a bit like an ASIC issue, as Adam said they share an ASIC for the TenG ports.
>
Yesx, exactly at the same time, and I've been arguing that it's probably an ASIC issue of some kind.  I had one TAC team that seemed to think the same thing, but without them running their secret-sauce commands against a device while it was having the problem, they're stuck and can't/won't do anything else.

>When you say this has happened on 5 switches, all 5 were ME3600-X's?
>
	Yes.  I wasn't clear on that before, but they're all the same model of ME3600X (24TS)

>Has it ever happened on more than one switch at the same time, two neighbouring switches for example have had all their TenG interface cease to function at the same time?
>
	Not exactly at the same time.  They will fail within a couple of hours of each other, and tend to be the "next" switch in the immediate path of a switch that had the TenG ports die.

>Do you have out of band access to these PoPs? When you talk about getting TAC via WebEx and rebooting the switches I assume you do, is there anything else on the switch that isn't working? I.e all the 1G ports (or however your customers are >connected) are working? BGP sessions up etc?
>
	Yes, we do, which is how we can get at least a show tech off the devices, and now, if it does it again, some platform asic commands off the things.   The 1G ports work just fine, including BFD, BGP, LDP, etc, which is, again, why we also suspect an ASIC issue.

>Do any of these PoPs where you have had the issue have low enough traffic you could switch to a 1G link between them to trial for a while?
>
	Interestingly enough, we deployed a second ME3600 to act as a backstop to the first at a POP, and cross-connected them at a gig.  That switch had a port failure not 12 hours later...  But...LDP and OSPF/BFD was just fine into the thing via the 1gbit ports we had connected between them which kept that unit online.  We're actually retrofitting 1gbit connections into POPs that seem the most vulnerable to keep those POPs at least somewhat online if the problem comes back.

>Adam pointed out the RPs could be too busy to service the BFD requests, is the CPU high when the issue occurs? Are you able to mirror the ports (or SPAN) at one of the PoPs on the 10G ports, to see if any traffic is actually coming over from the >neighbouring PoP, or being sent from the local switch?
>
	Don't have SPAN capability yet to the things to record what's going on when things go do.  Working on a possible solution to that.  The CPU doesn't spike, though we can increase the polling amount for that stat to see if we're missing a spike in there somewhere.  I'm honestly not sure if this is a CPU loading issue though, as UDLD also fires and error-disables the port, and  shut/no-shut doesn't bring the ports back.  The ONLY thing that'll bring the TenGig ports back from the dead is a reload of the chassis.  We just got done backing udld off all of the ports to satisfy a TAC request to have it where the ports don't get error-disabled and leave them in a up-but-hung state until they can look at it (which is also why we're hauling backup bandwidth around to mitigate interface outages).

>You said the switches seem to fail randomly, can you increase your NMS polling (since this sounds like a bug being triggered) to correlate the interface shutdowns to something like a spike in traffic, spike in latency between PoPs, IGP/BGP >update/topology change coming through, LDP/RSVP update coming through?
>

This is something we're investigating, but have not yet implemented.

Thanks for taking the time to get back to me.  I've had other private responses that I haven't gotten to yet, but I did want to say that I am very appreciative of everyone who has written back to me on this issue so far.

-Joseph