[c-nsp] ME3600's BFD-related "outage" between directly connected ME's and ASR1001's (Same rack)

Fri Jan 22 01:12:43 EST 2016

Hi Everyone,

At one of our POPs we have 2 x ME3600's, and 2 x ASR1001's - All directly connected in a mesh(All via single mode fibre), running OSPF, LDP, BGP...have not had any issues(connectivity) on them since they were put in some 30-odd weeks ago...

This morning, we received BFD down notifications, then OSPF with all the units....the unit that seems to have had the issue is one of the ME's...ME02 (The other devices only lost BFD/OSPF to this device), it lost BFD and OSPF to the other 3 units.

*Jan 22 2016 07:29:55.686 GMTEST: %OSPF-5-ADJCHG: Process 100, Nbr xxx.xxx.xxx.211 on GigabitEthernet0/3 from FULL to DOWN, Neighbor Down: BFD node down

*Jan 22 2016 07:29:55.842 GMTEST: %OSPF-5-ADJCHG: Process 100, Nbr xxx.xxx.xxx.213 on GigabitEthernet0/1 from FULL to DOWN, Neighbor Down: BFD node down

*Jan 22 2016 07:29:57.950 GMTEST: %OSPF-5-ADJCHG: Process 100, Nbr xxx.xxx.xxx.213 on GigabitEthernet0/1 from DOWN to INIT, Received Hello

*Jan 22 2016 07:29:57.954 GMTEST: %OSPF-5-ADJCHG: Process 100, Nbr xxx.xxx.xxx.213 on GigabitEthernet0/1 from INIT to 2WAY, 2-Way Received

*Jan 22 2016 07:29:57.954 GMTEST: %OSPF-5-ADJCHG: Process 100, Nbr xxx.xxx.xxx.213 on GigabitEthernet0/1 from 2WAY to EXSTART, AdjOK?

*Jan 22 2016 07:29:57.954 GMTEST: %OSPF-5-ADJCHG: Process 100, Nbr xxx.xxx.xxx.213 on GigabitEthernet0/1 from EXSTART to EXCHANGE, Negotiation Done

*Jan 22 2016 07:29:57.958 GMTEST: %OSPF-5-ADJCHG: Process 100, Nbr xxx.xxx.xxx.213 on GigabitEthernet0/1 from EXCHANGE to LOADING, Exchange Done

*Jan 22 2016 07:29:57.970 GMTEST: %OSPF-5-ADJCHG: Process 100, Nbr xxx.xxx.xxx.213 on GigabitEthernet0/1 from LOADING to FULL, Loading Done

*Jan 22 2016 07:29:58.210 GMTEST: %OSPF-5-ADJCHG: Process 100, Nbr xxx.xxx.xxx.213 on GigabitEthernet0/1 from FULL to DOWN, Neighbor Down: BFD node down

*Jan 22 2016 07:29:58.614 GMTEST: %OSPF-5-ADJCHG: Process 100, Nbr xxx.xxx.xxx.211 on GigabitEthernet0/3 from DOWN to INIT, Received Hello

*Jan 22 2016 07:29:58.614 GMTEST: %OSPF-5-ADJCHG: Process 100, Nbr xxx.xxx.xxx.211 on GigabitEthernet0/3 from INIT to 2WAY, 2-Way Received

*Jan 22 2016 07:29:58.614 GMTEST: %OSPF-5-ADJCHG: Process 100, Nbr xxx.xxx.xxx.211 on GigabitEthernet0/3 from 2WAY to EXSTART, AdjOK?

*Jan 22 2016 07:29:58.614 GMTEST: %OSPF-5-ADJCHG: Process 100, Nbr xxx.xxx.xxx.211 on GigabitEthernet0/3 from EXSTART to EXCHANGE, Negotiation Done

*Jan 22 2016 07:29:58.626 GMTEST: %OSPF-5-ADJCHG: Process 100, Nbr xxx.xxx.xxx.211 on GigabitEthernet0/3 from EXCHANGE to LOADING, Exchange Done

*Jan 22 2016 07:29:58.626 GMTEST: %OSPF-5-ADJCHG: Process 100, Nbr xxx.xxx.xxx.211 on GigabitEthernet0/3 from LOADING to FULL, Loading Done

*Jan 22 2016 07:29:58.986 GMTEST: %OSPF-5-ADJCHG: Process 100, Nbr xxx.xxx.xxx.211 on GigabitEthernet0/3 from FULL to DOWN, Neighbor Down: BFD node down

*Jan 22 2016 07:29:59.250 GMTEST: %OSPF-5-ADJCHG: Process 100, Nbr xxx.xxx.xxx.212 on GigabitEthernet0/2 from FULL to DOWN, Neighbor Down: BFD node down

OSPF recovered very quickly (i.e. few ms.....by 07:30)

Once this occurred, the "problem" ME3600 (ME02), also lost LDP to another ME at a different POP....this ME is not directly connected to ME02 (But we do have FRR enabled on physical Ints (Not vlan Ints....got hit by that bug already!)

*Jan 22 2016 07:33:15.846 GMTEST: %LDP-5-GR: GR session xxx.xxx.xxx.208:0 (inst 4): interrupted--recovery pending

*Jan 22 2016 07:33:15.846 GMTEST: %LDP-5-NBRCHG: LDP Neighbor xxx.xxx.xxx.208:0 (0) is DOWN (Session KeepAlive Timer expired)

Then recovery of LDP to this ME, some 20 minutes later:

*Jan 22 2016 07:53:09.866 GMTEST: %LDP-5-NBRCHG: LDP Neighbor xxx.xxx.xxx.208:0 (4) is UP

No-one was physically at the devices, and no config changes were being made.....CPU utilisation on all was "normal", i.e. very low.....any suggestions as to what may have happened here would be greatly appreciated, or what other post outage investigation I can do prior to opening a TAC case.

If it was just the one Int, potential bad SFP or cable....but all 3 Ints on the ME02 looked to be hit at the same time....

All ME's are currently running 15.3(3)S4...we were in the process of upgrading them all (next week or 2) to 15.3.3(S6) or 15.4.3(S4)...but want to get to the bottom of what's occurred today first.

Cheers.