[j-nsp] Random BGP peer drops

Wed Feb 15 10:39:00 EST 2012

We do. It's standard on all our interfaces:

myuser at MYPE1-re0> show configuration protocols ospf area 0 interface xe-0/0/0 
interface-type p2p;
metric 100;
ldp-synchronization;

Serge

________________________________
 From: Addy Mathur <addy.mathur at gmail.com>
To: Serge Vautour <serge at nbnet.nb.ca> 
Cc: "juniper-nsp at puck.nether.net" <juniper-nsp at puck.nether.net> 
Sent: Wednesday, February 15, 2012 10:54:29 AM
Subject: Re: [j-nsp] Random BGP peer drops

Serge:

Do you have ldp synchronization enabled?

http://www.juniper.net/techpubs/en_US/junos10.4/topics/usage-guidelines/routing-configuring-synchronization-between-ldp-and-igps.html

--Addy.

On Tuesday, February 14, 2012, Serge Vautour <sergevautour at yahoo.ca> wrote:
> Hello,
>
> We have an MPLS network made up of many MX960s and MX80s. We run OSPF as our IGP - all links in area 0. BGP is used for signaling of all L2VPN & VPLS. At this time we only have 1 L3VPN for mgmt. LDP is used for for transport LSPs. We have M10i as dedicated Route Reflectors. Most MX are on 10.4S5. M10i still on 10.0R3. Each PE peers with 2 RRs and has 2 diverse uplinks for redundancy. If 1 link fails, there's always another path.
>
> It's been rare but we've seen random iBGP peer drops. The first was several months ago. We've now seen 2 in the last week. 2 of the 3 were related to link failures. The primary path from the PE to the RR failed. BGP timed out after a bit. Here's an example:
>
> Feb  8 14:05:32  OURBOX-re0 mib2d[2279]: %DAEMON-4-SNMP_TRAP_LINK_DOWN: ifIndex 129, ifAdminStatus up(1), ifOperStatus down(2), ifName xe-7/0/0
> Feb  8 14:05:32  OURBOX-re0 mib2d[2279]: %DAEMON-4-SNMP_TRAP_LINK_DOWN: ifIndex 120, ifAdminStatus up(1), ifOperStatus down(2), ifName xe-0/0/0
> Feb  8 14:06:33  OURBOX-re0 rpd[1413]: %DAEMON-4: bgp_hold_timeout:3660: NOTIFICATION sent to 10.1.1.2 (Internal AS 123): code 4 (Hold Timer Expired Error), Reason: holdtime expired for 10.1.1.2 (Internal AS 123), socket buffer sndcc: 0 rcvcc: 0 TCP state: 4, snd_una: 1056225956 snd_nxt: 1056225956 snd_wnd: 16384 rcv_nxt: 3883304584 rcv_adv: 3883320968, hold timer 0
>
> BGP holdtime is 90sec. This is more than enough time for OSPF to find the other path and converge. The BGP peer came back up before the link so things did eventually converge.
>
> The last BGP peer drop happened without any links failure. Out of the blue, BGP just went down. The logs on the PE:
>
> Feb 13 20:40:48  OUR-PE1 rpd[1159]: %DAEMON-4: bgp_hold_timeout:3660: NOTIFICATION sent to 10.1.1.2 (Internal AS 123): code 4 (Hold Timer Expired Error), Reason: holdtime expired for 10.1.1.2 (Internal AS 123), socket buffer sndcc: 0 rcvcc: 0 TCP state: 4, snd_una: 2149021074 snd_nxt: 2149021074 snd_wnd: 16384 rcv_nxt: 2049196833 rcv_adv: 2049213217, hold timer 0
> Feb 13 20:40:48  OUR-PE1 rpd[1159]: %DAEMON-4-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 10.1.1.2 (Internal AS 123) changed state from Established to Idle (event HoldTime)
> Feb 13 20:41:21  OUR-PE1 rpd[1159]: %DAEMON-4-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 10.1.1.2 (Internal AS 123) changed state from OpenConfirm to Established (event RecvKeepAlive)
>
> The RR side shows the same:
>
> Feb 13 20:40:49  OUR-RR1-re0 rpd[1187]: %DAEMON-4-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 10.1.1.61 (Internal AS 123) changed state from Established to Idle (event RecvNotify)
> Feb 13 20:40:49  OUR-RR1-re0 rpd[1187]: %DAEMON-4: bgp_read_v4_message:8927: NOTIFICATION received from 10.1.1.61 (Internal AS 123): code 4 (Hold Timer Expired Error), socket buffer sndcc: 57 rcvcc: 0 TCP state: 4, snd_una: 2049196833 snd_nxt: 2049196871 snd_wnd: 16384 rcv_nxt: 2149021095 rcv_adv: 2149037458, hold timer 1:03.112744
> Feb 13 20:41:21  OUR-RR1-re0 rpd[1187]: %DAEMON-4-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 10.1.1.61 (Internal AS 123) changed state from EstabSync to Established (event RsyncAck)
> Feb 13 20:41:30  OUR-RR1-re0 rpd[1187]: %DAEMON-3: bgp_send: sending 30 bytes to 10.1.1.61 (Internal AS 123) blocked (no spooling requested): Resource temporarily unavailable
>
>
> You can see the peer wasn't down long and re-established on it's own. The logs on the RR make it look like it received a msg from the PE that it was dropping the BGP session. The last error on the RR seems odd as well.
>
>
> Has anyone seen something like this before? We do have a case open regarding a large number of LSA retransmits. TAC is saying this is a bug related to NSR but shouldn't cause any negative impacts. I'm not sure if this is related. I'm considering opening a case for this as well but I'm not very confident I'll get far.
>
>
> Any help would be appreciated.
>
>
> Thanks,
> Serge
> _______________________________________________
> juniper-nsp mailing list juniper-nsp at puck.nether.net
> https://puck.nether.net/mailman/listinfo/juniper-nsp
>