[j-nsp] Random BGP peer drops

Wed Feb 15 09:54:29 EST 2012

Serge:

Do you have ldp synchronization enabled?

http://www.juniper.net/techpubs/en_US/junos10.4/topics/usage-guidelines/routing-configuring-synchronization-between-ldp-and-igps.html

--Addy.

On Tuesday, February 14, 2012, Serge Vautour <sergevautour at yahoo.ca> wrote:
> Hello,
>
> We have an MPLS network made up of many MX960s and MX80s. We run OSPF as
our IGP - all links in area 0. BGP is used for signaling of all L2VPN &
VPLS. At this time we only have 1 L3VPN for mgmt. LDP is used for for
transport LSPs. We have M10i as dedicated Route Reflectors. Most MX are on
10.4S5. M10i still on 10.0R3. Each PE peers with 2 RRs and has 2 diverse
uplinks for redundancy. If 1 link fails, there's always another path.
>
> It's been rare but we've seen random iBGP peer drops. The first was
several months ago. We've now seen 2 in the last week. 2 of the 3 were
related to link failures. The primary path from the PE to the RR failed.
BGP timed out after a bit. Here's an example:
>
> Feb  8 14:05:32  OURBOX-re0 mib2d[2279]: %DAEMON-4-SNMP_TRAP_LINK_DOWN:
ifIndex 129, ifAdminStatus up(1), ifOperStatus down(2), ifName xe-7/0/0
> Feb  8 14:05:32  OURBOX-re0 mib2d[2279]: %DAEMON-4-SNMP_TRAP_LINK_DOWN:
ifIndex 120, ifAdminStatus up(1), ifOperStatus down(2), ifName xe-0/0/0
> Feb  8 14:06:33  OURBOX-re0 rpd[1413]: %DAEMON-4: bgp_hold_timeout:3660:
NOTIFICATION sent to 10.1.1.2 (Internal AS 123): code 4 (Hold Timer Expired
Error), Reason: holdtime expired for 10.1.1.2 (Internal AS 123), socket
buffer sndcc: 0 rcvcc: 0 TCP state: 4, snd_una: 1056225956 snd_nxt:
1056225956 snd_wnd: 16384 rcv_nxt: 3883304584 rcv_adv: 3883320968, hold
timer 0
>
> BGP holdtime is 90sec. This is more than enough time for OSPF to find the
other path and converge. The BGP peer came back up before the link so
things did eventually converge.
>
> The last BGP peer drop happened without any links failure. Out of the
blue, BGP just went down. The logs on the PE:
>
> Feb 13 20:40:48  OUR-PE1 rpd[1159]: %DAEMON-4: bgp_hold_timeout:3660:
NOTIFICATION sent to 10.1.1.2 (Internal AS 123): code 4 (Hold Timer Expired
Error), Reason: holdtime expired for 10.1.1.2 (Internal AS 123), socket
buffer sndcc: 0 rcvcc: 0 TCP state: 4, snd_una: 2149021074 snd_nxt:
2149021074 snd_wnd: 16384 rcv_nxt: 2049196833 rcv_adv: 2049213217, hold
timer 0
> Feb 13 20:40:48  OUR-PE1 rpd[1159]:
%DAEMON-4-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 10.1.1.2 (Internal AS
123) changed state from Established to Idle (event HoldTime)
> Feb 13 20:41:21  OUR-PE1 rpd[1159]:
%DAEMON-4-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 10.1.1.2 (Internal AS
123) changed state from OpenConfirm to Established (event RecvKeepAlive)
>
> The RR side shows the same:
>
> Feb 13 20:40:49  OUR-RR1-re0 rpd[1187]:
%DAEMON-4-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 10.1.1.61 (Internal AS
123) changed state from Established to Idle (event RecvNotify)
> Feb 13 20:40:49  OUR-RR1-re0 rpd[1187]: %DAEMON-4:
bgp_read_v4_message:8927: NOTIFICATION received from 10.1.1.61 (Internal AS
123): code 4 (Hold Timer Expired Error), socket buffer sndcc: 57 rcvcc: 0
TCP state: 4, snd_una: 2049196833 snd_nxt: 2049196871 snd_wnd: 16384
rcv_nxt: 2149021095 rcv_adv: 2149037458, hold timer 1:03.112744
> Feb 13 20:41:21  OUR-RR1-re0 rpd[1187]:
%DAEMON-4-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 10.1.1.61 (Internal AS
123) changed state from EstabSync to Established (event RsyncAck)
> Feb 13 20:41:30  OUR-RR1-re0 rpd[1187]: %DAEMON-3: bgp_send: sending 30
bytes to 10.1.1.61 (Internal AS 123) blocked (no spooling requested):
Resource temporarily unavailable
>
>
> You can see the peer wasn't down long and re-established on it's own. The
logs on the RR make it look like it received a msg from the PE that it was
dropping the BGP session. The last error on the RR seems odd as well.
>
>
> Has anyone seen something like this before? We do have a case open
regarding a large number of LSA retransmits. TAC is saying this is a bug
related to NSR but shouldn't cause any negative impacts. I'm not sure if
this is related. I'm considering opening a case for this as well but I'm
not very confident I'll get far.
>
>
> Any help would be appreciated.
>
>
> Thanks,
> Serge
> _______________________________________________
> juniper-nsp mailing list juniper-nsp at puck.nether.net
> https://puck.nether.net/mailman/listinfo/juniper-nsp
>