[j-nsp] Random BGP peer drops
Serge Vautour
sergevautour at yahoo.ca
Tue Feb 14 11:55:30 EST 2012
Hello,
We have an MPLS network made up of many MX960s and MX80s. We run OSPF as our IGP - all links in area 0. BGP is used for signaling of all L2VPN & VPLS. At this time we only have 1 L3VPN for mgmt. LDP is used for for transport LSPs. We have M10i as dedicated Route Reflectors. Most MX are on 10.4S5. M10i still on 10.0R3. Each PE peers with 2 RRs and has 2 diverse uplinks for redundancy. If 1 link fails, there's always another path.
It's been rare but we've seen random iBGP peer drops. The first was several months ago. We've now seen 2 in the last week. 2 of the 3 were related to link failures. The primary path from the PE to the RR failed. BGP timed out after a bit. Here's an example:
Feb 8 14:05:32 OURBOX-re0 mib2d[2279]: %DAEMON-4-SNMP_TRAP_LINK_DOWN: ifIndex 129, ifAdminStatus up(1), ifOperStatus down(2), ifName xe-7/0/0
Feb 8 14:05:32 OURBOX-re0 mib2d[2279]: %DAEMON-4-SNMP_TRAP_LINK_DOWN: ifIndex 120, ifAdminStatus up(1), ifOperStatus down(2), ifName xe-0/0/0
Feb 8 14:06:33 OURBOX-re0 rpd[1413]: %DAEMON-4: bgp_hold_timeout:3660: NOTIFICATION sent to 10.1.1.2 (Internal AS 123): code 4 (Hold Timer Expired Error), Reason: holdtime expired for 10.1.1.2 (Internal AS 123), socket buffer sndcc: 0 rcvcc: 0 TCP state: 4, snd_una: 1056225956 snd_nxt: 1056225956 snd_wnd: 16384 rcv_nxt: 3883304584 rcv_adv: 3883320968, hold timer 0
BGP holdtime is 90sec. This is more than enough time for OSPF to find the other path and converge. The BGP peer came back up before the link so things did eventually converge.
The last BGP peer drop happened without any links failure. Out of the blue, BGP just went down. The logs on the PE:
Feb 13 20:40:48 OUR-PE1 rpd[1159]: %DAEMON-4: bgp_hold_timeout:3660: NOTIFICATION sent to 10.1.1.2 (Internal AS 123): code 4 (Hold Timer Expired Error), Reason: holdtime expired for 10.1.1.2 (Internal AS 123), socket buffer sndcc: 0 rcvcc: 0 TCP state: 4, snd_una: 2149021074 snd_nxt: 2149021074 snd_wnd: 16384 rcv_nxt: 2049196833 rcv_adv: 2049213217, hold timer 0
Feb 13 20:40:48 OUR-PE1 rpd[1159]: %DAEMON-4-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 10.1.1.2 (Internal AS 123) changed state from Established to Idle (event HoldTime)
Feb 13 20:41:21 OUR-PE1 rpd[1159]: %DAEMON-4-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 10.1.1.2 (Internal AS 123) changed state from OpenConfirm to Established (event RecvKeepAlive)
The RR side shows the same:
Feb 13 20:40:49 OUR-RR1-re0 rpd[1187]: %DAEMON-4-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 10.1.1.61 (Internal AS 123) changed state from Established to Idle (event RecvNotify)
Feb 13 20:40:49 OUR-RR1-re0 rpd[1187]: %DAEMON-4: bgp_read_v4_message:8927: NOTIFICATION received from 10.1.1.61 (Internal AS 123): code 4 (Hold Timer Expired Error), socket buffer sndcc: 57 rcvcc: 0 TCP state: 4, snd_una: 2049196833 snd_nxt: 2049196871 snd_wnd: 16384 rcv_nxt: 2149021095 rcv_adv: 2149037458, hold timer 1:03.112744
Feb 13 20:41:21 OUR-RR1-re0 rpd[1187]: %DAEMON-4-RPD_BGP_NEIGHBOR_STATE_CHANGED: BGP peer 10.1.1.61 (Internal AS 123) changed state from EstabSync to Established (event RsyncAck)
Feb 13 20:41:30 OUR-RR1-re0 rpd[1187]: %DAEMON-3: bgp_send: sending 30 bytes to 10.1.1.61 (Internal AS 123) blocked (no spooling requested): Resource temporarily unavailable
You can see the peer wasn't down long and re-established on it's own. The logs on the RR make it look like it received a msg from the PE that it was dropping the BGP session. The last error on the RR seems odd as well.
Has anyone seen something like this before? We do have a case open regarding a large number of LSA retransmits. TAC is saying this is a bug related to NSR but shouldn't cause any negative impacts. I'm not sure if this is related. I'm considering opening a case for this as well but I'm not very confident I'll get far.
Any help would be appreciated.
Thanks,
Serge
More information about the juniper-nsp
mailing list