[c-nsp] Very strange ME3600 err-disabled on Te0/1 Te0/2 problem

Adam Vitkovsky Adam.Vitkovsky at gamma.co.uk
Thu Oct 1 03:48:04 EDT 2015


That's interesting I never came across this one might be something new post S3 (but I never used UDLD),
The fact that this happened on 5 switches proves it's not a case of both cables squeezed at the doors for the closet causing RX fail,
And since it's both TE ports it's not likely that both SFP+ modules lost power or RX capability at the same time due to SFP+ module failure (and again this happened of 5 switches).

> Joe Bender
> Sent: Wednesday, September 30, 2015 8:46 PM
> Symptoms are:
>
> Both interfaces have BFD detect a node down, simultaneously.
OK so BFD ceased to see incoming Hellos so it pronounced the neighbour as dead.
It would be worth checking if the other side notices something or not
-if not then it's just the RX that was hit by the issue
-if the other side went down as well it's both TX and RX that went down.

Are you using echo mode (default) or async mode i.e. "no bfd echo"
Are you using carrier-delay 0 on the interfaces?

On ME3600-x the BFD is handled by IOS on the RP of the switch.
So it might be that the RP get busy and can't process BFD messages or it could be spike in traffic delaying these.
But that doesn't explain why some them later UDLD goes down as well.
-as that renders the interface not operational for quite some time

Anyways BFD can be examined with these
-------------------------------------------------------
1. CPU Congestion <- Please be aware " CSCuc59105    Crash when running
'show platform qos policer cpu' due to SSH"
      " show platform aspdma all_counters 0" <-BFD uses Queue 1,
taken 2 or 3 times in a row
      " show platform qos policer cpu 1 0"   taken 2 or 3 times in a row
2. CPU Utilization
      " show processes cpu"
      " show processes cpu sorted"
3. BFD Commands
     " show bfd drops"
     " show bfd neighbor detail"
4. Inteface discard counters
     " show interfaces counters errors"   taken 2 or 3 times in a row
5. ISIS
     " show isis neighbor detail"
6. ARP command  "<- To check flapping occurs because of incorrect/no arp learning"
     " show ip arp"
7. Mac address learn
      "show mac-address-table dynamic"

sh platform ho-fpga tx-buffer-table de 1
-------------------------------------------------------



So it might be that the whole RX function of the ASIC fails (TenGE ports share the same ASIC).
-which is more likely since the shut/no-shut doesn’t help and the issue is restored only after restart.


> 30-45 seconds later UDLD detects unidirectional link, puts both interfaces into
> error-disabled.
Yeah some time later UDLD realizes it didn't receive any messages so it error-disables the port.
But I think you shouldn't need to run UDLD if you are using BFD echo mode.


> Other side doesn't notice anything amiss other than the interface going
> down because the node having issues shuts the port down.
>
So yeah it would be worth knowing if the remote end goes down immediately or only after UDLD brings the port down.

> I can't even get a script to run against the switch for us to pull additional
> information other than a show tech because they're insisting they can't tell
> me what they might run to get info (which I find strange)
>
Tell them to provide you with all the commands to check status of the 10GE ASIC.
These would be in the service mode:
conf t
 service internal
ex
and you get in with:
sdcli


> One TAC team thought it was a bug or some sort of ASIC/software but
> haven't been able to isolate anything that might be the cause, bug-id or
> otherwise.  Right now I'm dealing with someone who seems to be
> concentrating on the fact that he's seeing UDLD firing as a result of the
> interfaces going offline.  I'm also getting an immense amount of pushback
> from the support engineers about getting this escalated because we don't
> have a lot of information about the problem, and it hasn't come back in days.
>
But it happened on 5 switches already this should be a hot case.
Let them build the exact topology with the exact configs to see if they can replicate the bug.
If you are not satisfied with the engineer on the case ask for a new one.

adam


        Adam Vitkovsky
        IP Engineer

T:      0333 006 5936
E:      Adam.Vitkovsky at gamma.co.uk
W:      www.gamma.co.uk

This is an email from Gamma Telecom Ltd, trading as “Gamma”. The contents of this email are confidential to the ordinary user of the email address to which it was addressed. This email is not intended to create any legal relationship. No one else may place any reliance upon it, or copy or forward all or any of it in any form (unless otherwise notified). If you receive this email in error, please accept our apologies, we would be obliged if you would telephone our postmaster on +44 (0) 808 178 9652 or email postmaster at gamma.co.uk

Gamma Telecom Limited, a company incorporated in England and Wales, with limited liability, with registered number 04340834, and whose registered office is at 5 Fleet Place London EC4M 7RD and whose principal place of business is at Kings House, Kings Road West, Newbury, Berkshire, RG14 5BY.




More information about the cisco-nsp mailing list