[c-nsp] ME3600's BFD-related "outage" between directly connected ME's and ASR1001's (Same rack)

Adam Vitkovsky Adam.Vitkovsky at gamma.co.uk
Fri Jan 22 04:45:30 EST 2016


Hi,

On me3600x the BFD is handled on central CPU so if two BFD sessions where affected at once -followed by third couple seconds later, it could be some process hogging the CPU for a blip of time causing the BFD sessions to reset.
Something that you wouldn't  notice looking at the CPU utilization graph.
Maybe some SPF computation took place "show ip ospf statistics".

But from the log it looks like the OSPF sessions bounced down and up very quickly but then went down and stayed like that which I have no explanation for.

What puzzles me is the LDP session as it's a TCP session that should have been rerouted using any working link (any of the three links on ME02).
So if the LDP sessions was off for 20 minutes it looks like the whole box was off the net for 20 minutes.
Or even if there was a problem re-establishing the LDP session after it was knocked down -even the fact that it went down suggests the connectivity to ME02 was down for over 3 minutes.

adam
> CiscoNSP List
> Sent: Friday, January 22, 2016 6:13 AM
>
> Hi Everyone,
>
>
> At one of our POPs we have 2 x ME3600's, and 2 x ASR1001's - All directly
> connected in a mesh(All via single mode fibre), running OSPF, LDP,
> BGP...have not had any issues(connectivity) on them since they were put in
> some 30-odd weeks ago...
>
>
> This morning, we received BFD down notifications, then OSPF with all the
> units....the unit that seems to have had the issue is one of the ME's...ME02
> (The other devices only lost BFD/OSPF to this device), it lost BFD and OSPF to
> the other 3 units.
>
>
> *Jan 22 2016 07:29:55.686 GMTEST: %OSPF-5-ADJCHG: Process 100, Nbr
> xxx.xxx.xxx.211 on GigabitEthernet0/3 from FULL to DOWN, Neighbor Down:
> BFD node down
>
> *Jan 22 2016 07:29:55.842 GMTEST: %OSPF-5-ADJCHG: Process 100, Nbr
> xxx.xxx.xxx.213 on GigabitEthernet0/1 from FULL to DOWN, Neighbor Down:
> BFD node down
>
> *Jan 22 2016 07:29:57.950 GMTEST: %OSPF-5-ADJCHG: Process 100, Nbr
> xxx.xxx.xxx.213 on GigabitEthernet0/1 from DOWN to INIT, Received Hello
>
> *Jan 22 2016 07:29:57.954 GMTEST: %OSPF-5-ADJCHG: Process 100, Nbr
> xxx.xxx.xxx.213 on GigabitEthernet0/1 from INIT to 2WAY, 2-Way Received
>
> *Jan 22 2016 07:29:57.954 GMTEST: %OSPF-5-ADJCHG: Process 100, Nbr
> xxx.xxx.xxx.213 on GigabitEthernet0/1 from 2WAY to EXSTART, AdjOK?
>
> *Jan 22 2016 07:29:57.954 GMTEST: %OSPF-5-ADJCHG: Process 100, Nbr
> xxx.xxx.xxx.213 on GigabitEthernet0/1 from EXSTART to EXCHANGE,
> Negotiation Done
>
> *Jan 22 2016 07:29:57.958 GMTEST: %OSPF-5-ADJCHG: Process 100, Nbr
> xxx.xxx.xxx.213 on GigabitEthernet0/1 from EXCHANGE to LOADING,
> Exchange Done
>
> *Jan 22 2016 07:29:57.970 GMTEST: %OSPF-5-ADJCHG: Process 100, Nbr
> xxx.xxx.xxx.213 on GigabitEthernet0/1 from LOADING to FULL, Loading Done
>
> *Jan 22 2016 07:29:58.210 GMTEST: %OSPF-5-ADJCHG: Process 100, Nbr
> xxx.xxx.xxx.213 on GigabitEthernet0/1 from FULL to DOWN, Neighbor Down:
> BFD node down
>
> *Jan 22 2016 07:29:58.614 GMTEST: %OSPF-5-ADJCHG: Process 100, Nbr
> xxx.xxx.xxx.211 on GigabitEthernet0/3 from DOWN to INIT, Received Hello
>
> *Jan 22 2016 07:29:58.614 GMTEST: %OSPF-5-ADJCHG: Process 100, Nbr
> xxx.xxx.xxx.211 on GigabitEthernet0/3 from INIT to 2WAY, 2-Way Received
>
> *Jan 22 2016 07:29:58.614 GMTEST: %OSPF-5-ADJCHG: Process 100, Nbr
> xxx.xxx.xxx.211 on GigabitEthernet0/3 from 2WAY to EXSTART, AdjOK?
>
> *Jan 22 2016 07:29:58.614 GMTEST: %OSPF-5-ADJCHG: Process 100, Nbr
> xxx.xxx.xxx.211 on GigabitEthernet0/3 from EXSTART to EXCHANGE,
> Negotiation Done
>
> *Jan 22 2016 07:29:58.626 GMTEST: %OSPF-5-ADJCHG: Process 100, Nbr
> xxx.xxx.xxx.211 on GigabitEthernet0/3 from EXCHANGE to LOADING,
> Exchange Done
>
> *Jan 22 2016 07:29:58.626 GMTEST: %OSPF-5-ADJCHG: Process 100, Nbr
> xxx.xxx.xxx.211 on GigabitEthernet0/3 from LOADING to FULL, Loading Done
>
> *Jan 22 2016 07:29:58.986 GMTEST: %OSPF-5-ADJCHG: Process 100, Nbr
> xxx.xxx.xxx.211 on GigabitEthernet0/3 from FULL to DOWN, Neighbor Down:
> BFD node down
>
> *Jan 22 2016 07:29:59.250 GMTEST: %OSPF-5-ADJCHG: Process 100, Nbr
> xxx.xxx.xxx.212 on GigabitEthernet0/2 from FULL to DOWN, Neighbor Down:
> BFD node down
>
>
> OSPF recovered very quickly (i.e. few ms.....by 07:30)
>
>
> Once this occurred, the "problem" ME3600 (ME02), also lost LDP to another
> ME at a different POP....this ME is not directly connected to ME02 (But we do
> have FRR enabled on physical Ints (Not vlan Ints....got hit by that bug
> already!)
>
>
> *Jan 22 2016 07:33:15.846 GMTEST: %LDP-5-GR: GR session xxx.xxx.xxx.208:0
> (inst 4): interrupted--recovery pending
>
> *Jan 22 2016 07:33:15.846 GMTEST: %LDP-5-NBRCHG: LDP Neighbor
> xxx.xxx.xxx.208:0 (0) is DOWN (Session KeepAlive Timer expired)
>
>
> Then recovery of LDP to this ME, some 20 minutes later:
>
> *Jan 22 2016 07:53:09.866 GMTEST: %LDP-5-NBRCHG: LDP Neighbor
> xxx.xxx.xxx.208:0 (4) is UP
>
>
> No-one was physically at the devices, and no config changes were being
> made.....CPU utilisation on all was "normal", i.e. very low.....any suggestions
> as to what may have happened here would be greatly appreciated, or what
> other post outage investigation I can do prior to opening a TAC case.
>
>
> If it was just the one Int, potential bad SFP or cable....but all 3 Ints on the
> ME02 looked to be hit at the same time....
>
>
> All ME's are currently running 15.3(3)S4...we were in the process of
> upgrading them all (next week or 2) to 15.3.3(S6) or 15.4.3(S4)...but want to
> get to the bottom of what's occurred today first.
>
>
> Cheers.
>
>
>
>
>

        Adam Vitkovsky
        IP Engineer

T:      0333 006 5936
E:      Adam.Vitkovsky at gamma.co.uk
W:      www.gamma.co.uk

This is an email from Gamma Telecom Ltd, trading as “Gamma”. The contents of this email are confidential to the ordinary user of the email address to which it was addressed. This email is not intended to create any legal relationship. No one else may place any reliance upon it, or copy or forward all or any of it in any form (unless otherwise notified). If you receive this email in error, please accept our apologies, we would be obliged if you would telephone our postmaster on +44 (0) 808 178 9652 or email postmaster at gamma.co.uk

Gamma Telecom Limited, a company incorporated in England and Wales, with limited liability, with registered number 04340834, and whose registered office is at 5 Fleet Place London EC4M 7RD and whose principal place of business is at Kings House, Kings Road West, Newbury, Berkshire, RG14 5BY.


_______________________________________________
> cisco-nsp mailing list  cisco-nsp at puck.nether.net
> https://puck.nether.net/mailman/listinfo/cisco-nsp
> archive at http://puck.nether.net/pipermail/cisco-nsp/



More information about the cisco-nsp mailing list