[c-nsp] ME3600's BFD-related "outage" between directly connected ME's and ASR1001's (Same rack)

Fri Jan 22 18:16:51 EST 2016

Thanks for the reply Adam 

ospf stats from PE02 @ POPA:

#show ip ospf statistics

            OSPF Router with ID (yyy.yyy.yyy.253) (Process ID 1)

  Summary OSPF SPF statistic

None

            OSPF Router with ID (10.10.4.41) (Process ID 40)

  Area 40: SPF algorithm executed 25 times

  Summary OSPF SPF statistic

  SPF calculation time
Delta T   Intra D-Intra Summ    D-Summ  Ext     D-Ext   Total   Reason
7w0d   0        0       0       0       0       0       0       R, N, X
7w0d   0        0       0       0       0       0       0       0x0
7w0d   0        0       0       0       0       0       0       R, N, X
7w0d   0        0       0       0       0       0       0       0x0
7w0d   0        0       0       0       0       0       0       R, N, X
7w0d   0        0       0       0       0       0       0       0x0
22:35:00   0    0       0       0       0       0       0       R, N, X
22:34:55   0    0       0       0       0       0       0       0x0
22:33:37   0    0       0       0       0       0       0       R, N, X
22:33:32   0    0       0       0       0       0       0       0x0

            OSPF Router with ID (xxx.xxx.xxx.210) (Process ID 100)

  Area 0: SPF algorithm executed 939 times

  Summary OSPF SPF statistic

  SPF calculation time
Delta T   Intra D-Intra Summ    D-Summ  Ext     D-Ext   Total   Reason
1w2d   0        0       0       0       0       4       4       R, SN, X
1w2d   0        0       0       0       0       0       0       0x0
1w2d   0        0       0       0       0       0       0       R, N
1d15h   0       0       0       0       0       0       0       R
1d15h   0       0       0       0       0       0       0       R
22:58:08   0    0       0       0       0       4       4       R, SN, X
22:58:03   0    0       0       0       4       4       8       R, SN, X
22:57:58   0    0       0       0       0       0       0       R
19:14:18   0    0       0       0       0       0       0       R, X
19:14:13   0    0       0       0       0       0       0       0x0

The "issue" re-occurred again some 4-5 hours later - I was able to telnet (remotely) to all loops during this, so Im a little stumped as to what is happening.

The main impact of this problem is that we have another ME3600 at "POPB" that has 2 paths to reach the ME's/ASR's at "POPA"(Where the BFD problem arose)....at POPB we have 2 x ME3600 - both directly connected...ME01 has a direct connection to ASR1 at POPA, and ME02, a direct connection to an ASR1006 in another Pop (POPC)....both ME's at POPB (during the outage) have the same route to ME02 at POPA....that is via ME01(POPB)/ASR1(POPA)....during the outage, ME02 at POPB loses the ability to reach ME02(via ping to loopback IP) at POPA...but ME02 at POPA can reach(ping) ME02 at POPB's loop....very strange(ping one way, but not the other)....I have temporarily "fixed" this via adding an ospf cost to the link between the 2 ME's at POPB, so ME02's(POPB) path to ME02(POPA) is now via POPC....note, ME01(POPB) can reach ME02(POPA) without issue(During outage)....really very strange, and am at a loss as to why it has just started happening (Note, the ME02s(POPA/POPB) have customer PW's configured between them...i.e. we were notified by customers very quickly the moment it occurred)

The above paragraph is probably a bit difficult to understand without a diag.....just on the road atm, so limited access.

Cheers

________________________________________
From: Adam Vitkovsky <Adam.Vitkovsky at gamma.co.uk>
Sent: Friday, 22 January 2016 8:45 PM
To: CiscoNSP List; cisco-nsp at puck.nether.net
Subject: RE: ME3600's BFD-related "outage" between directly connected ME's and ASR1001's (Same rack)

Hi,

On me3600x the BFD is handled on central CPU so if two BFD sessions where affected at once -followed by third couple seconds later, it could be some process hogging the CPU for a blip of time causing the BFD sessions to reset.
Something that you wouldn't  notice looking at the CPU utilization graph.
Maybe some SPF computation took place "show ip ospf statistics".

But from the log it looks like the OSPF sessions bounced down and up very quickly but then went down and stayed like that which I have no explanation for.

What puzzles me is the LDP session as it's a TCP session that should have been rerouted using any working link (any of the three links on ME02).
So if the LDP sessions was off for 20 minutes it looks like the whole box was off the net for 20 minutes.
Or even if there was a problem re-establishing the LDP session after it was knocked down -even the fact that it went down suggests the connectivity to ME02 was down for over 3 minutes.

adam
> CiscoNSP List
> Sent: Friday, January 22, 2016 6:13 AM
>
> Hi Everyone,
>
>
> At one of our POPs we have 2 x ME3600's, and 2 x ASR1001's - All directly
> connected in a mesh(All via single mode fibre), running OSPF, LDP,
> BGP...have not had any issues(connectivity) on them since they were put in
> some 30-odd weeks ago...
>
>
> This morning, we received BFD down notifications, then OSPF with all the
> units....the unit that seems to have had the issue is one of the ME's...ME02
> (The other devices only lost BFD/OSPF to this device), it lost BFD and OSPF to
> the other 3 units.
>
>
> *Jan 22 2016 07:29:55.686 GMTEST: %OSPF-5-ADJCHG: Process 100, Nbr
> xxx.xxx.xxx.211 on GigabitEthernet0/3 from FULL to DOWN, Neighbor Down:
> BFD node down
>
> *Jan 22 2016 07:29:55.842 GMTEST: %OSPF-5-ADJCHG: Process 100, Nbr
> xxx.xxx.xxx.213 on GigabitEthernet0/1 from FULL to DOWN, Neighbor Down:
> BFD node down
>
> *Jan 22 2016 07:29:57.950 GMTEST: %OSPF-5-ADJCHG: Process 100, Nbr
> xxx.xxx.xxx.213 on GigabitEthernet0/1 from DOWN to INIT, Received Hello
>
> *Jan 22 2016 07:29:57.954 GMTEST: %OSPF-5-ADJCHG: Process 100, Nbr
> xxx.xxx.xxx.213 on GigabitEthernet0/1 from INIT to 2WAY, 2-Way Received
>
> *Jan 22 2016 07:29:57.954 GMTEST: %OSPF-5-ADJCHG: Process 100, Nbr
> xxx.xxx.xxx.213 on GigabitEthernet0/1 from 2WAY to EXSTART, AdjOK?
>
> *Jan 22 2016 07:29:57.954 GMTEST: %OSPF-5-ADJCHG: Process 100, Nbr
> xxx.xxx.xxx.213 on GigabitEthernet0/1 from EXSTART to EXCHANGE,
> Negotiation Done
>
> *Jan 22 2016 07:29:57.958 GMTEST: %OSPF-5-ADJCHG: Process 100, Nbr
> xxx.xxx.xxx.213 on GigabitEthernet0/1 from EXCHANGE to LOADING,
> Exchange Done
>
> *Jan 22 2016 07:29:57.970 GMTEST: %OSPF-5-ADJCHG: Process 100, Nbr
> xxx.xxx.xxx.213 on GigabitEthernet0/1 from LOADING to FULL, Loading Done
>
> *Jan 22 2016 07:29:58.210 GMTEST: %OSPF-5-ADJCHG: Process 100, Nbr
> xxx.xxx.xxx.213 on GigabitEthernet0/1 from FULL to DOWN, Neighbor Down:
> BFD node down
>
> *Jan 22 2016 07:29:58.614 GMTEST: %OSPF-5-ADJCHG: Process 100, Nbr
> xxx.xxx.xxx.211 on GigabitEthernet0/3 from DOWN to INIT, Received Hello
>
> *Jan 22 2016 07:29:58.614 GMTEST: %OSPF-5-ADJCHG: Process 100, Nbr
> xxx.xxx.xxx.211 on GigabitEthernet0/3 from INIT to 2WAY, 2-Way Received
>
> *Jan 22 2016 07:29:58.614 GMTEST: %OSPF-5-ADJCHG: Process 100, Nbr
> xxx.xxx.xxx.211 on GigabitEthernet0/3 from 2WAY to EXSTART, AdjOK?
>
> *Jan 22 2016 07:29:58.614 GMTEST: %OSPF-5-ADJCHG: Process 100, Nbr
> xxx.xxx.xxx.211 on GigabitEthernet0/3 from EXSTART to EXCHANGE,
> Negotiation Done
>
> *Jan 22 2016 07:29:58.626 GMTEST: %OSPF-5-ADJCHG: Process 100, Nbr
> xxx.xxx.xxx.211 on GigabitEthernet0/3 from EXCHANGE to LOADING,
> Exchange Done
>
> *Jan 22 2016 07:29:58.626 GMTEST: %OSPF-5-ADJCHG: Process 100, Nbr
> xxx.xxx.xxx.211 on GigabitEthernet0/3 from LOADING to FULL, Loading Done
>
> *Jan 22 2016 07:29:58.986 GMTEST: %OSPF-5-ADJCHG: Process 100, Nbr
> xxx.xxx.xxx.211 on GigabitEthernet0/3 from FULL to DOWN, Neighbor Down:
> BFD node down
>
> *Jan 22 2016 07:29:59.250 GMTEST: %OSPF-5-ADJCHG: Process 100, Nbr
> xxx.xxx.xxx.212 on GigabitEthernet0/2 from FULL to DOWN, Neighbor Down:
> BFD node down
>
>
> OSPF recovered very quickly (i.e. few ms.....by 07:30)
>
>
> Once this occurred, the "problem" ME3600 (ME02), also lost LDP to another
> ME at a different POP....this ME is not directly connected to ME02 (But we do
> have FRR enabled on physical Ints (Not vlan Ints....got hit by that bug
> already!)
>
>
> *Jan 22 2016 07:33:15.846 GMTEST: %LDP-5-GR: GR session xxx.xxx.xxx.208:0
> (inst 4): interrupted--recovery pending
>
> *Jan 22 2016 07:33:15.846 GMTEST: %LDP-5-NBRCHG: LDP Neighbor
> xxx.xxx.xxx.208:0 (0) is DOWN (Session KeepAlive Timer expired)
>
>
> Then recovery of LDP to this ME, some 20 minutes later:
>
> *Jan 22 2016 07:53:09.866 GMTEST: %LDP-5-NBRCHG: LDP Neighbor
> xxx.xxx.xxx.208:0 (4) is UP
>
>
> No-one was physically at the devices, and no config changes were being
> made.....CPU utilisation on all was "normal", i.e. very low.....any suggestions
> as to what may have happened here would be greatly appreciated, or what
> other post outage investigation I can do prior to opening a TAC case.
>
>
> If it was just the one Int, potential bad SFP or cable....but all 3 Ints on the
> ME02 looked to be hit at the same time....
>
>
> All ME's are currently running 15.3(3)S4...we were in the process of
> upgrading them all (next week or 2) to 15.3.3(S6) or 15.4.3(S4)...but want to
> get to the bottom of what's occurred today first.
>
>
> Cheers.
>
>
>
>
>

        Adam Vitkovsky
        IP Engineer

T:      0333 006 5936
E:      Adam.Vitkovsky at gamma.co.uk
W:      www.gamma.co.uk

This is an email from Gamma Telecom Ltd, trading as “Gamma”. The contents of this email are confidential to the ordinary user of the email address to which it was addressed. This email is not intended to create any legal relationship. No one else may place any reliance upon it, or copy or forward all or any of it in any form (unless otherwise notified). If you receive this email in error, please accept our apologies, we would be obliged if you would telephone our postmaster on +44 (0) 808 178 9652 or email postmaster at gamma.co.uk

Gamma Telecom Limited, a company incorporated in England and Wales, with limited liability, with registered number 04340834, and whose registered office is at 5 Fleet Place London EC4M 7RD and whose principal place of business is at Kings House, Kings Road West, Newbury, Berkshire, RG14 5BY.

_______________________________________________
> cisco-nsp mailing list  cisco-nsp at puck.nether.net
> https://puck.nether.net/mailman/listinfo/cisco-nsp
> archive at http://puck.nether.net/pipermail/cisco-nsp/