[j-nsp] BGP Hold time expiry

Fri Sep 19 09:36:25 EDT 2008

On Fri, Sep 19, 2008 at 10:20:47AM -0700, Kevin Oberman wrote:
> Looks a lot like an MTU mismatch. BGP does not do PMTU and sets do not
> fragment, so the MTUs need to be the same on both ends. Things like VLAN
> or tunneling can mess this up.
> 
> You can try capturing the traffic to confirm this. I use tcpdump (in the
> shell as root) and move the output file to a box where I can run
> wireshark on it. You will see the big frame being re-transmitted several
> times. This is probably not really required, though.

You can enable path-mtu-discovery for the entire box under set system
internet-options path-mtu-discovery, or under bgp with set protocols bgp
mtu-discovery, but of course that won't help you if you don't have your
MTUs configured correctly on both sides. If your L3 devices are not both
configured to a value which can safely pass between them (and any L2
devices in the middle), fragmentation (or ICMP needfrag) will not
function, thus defeating PMTUD and causing blackholing.

The big confusion that I typically see people running into is that
Juniper and Cisco mean different things when you configure an interface
MTU. Juniper includes all L2 overhead, Cisco does not, so for example a
Juniper with interface mtu of 9192 (max) would only correctly talk to a
Cisco with its L3 interface configured to 9178 (or 9174 if the Juniper
is vlan-tagged, or 9170 if the Juniper is flexible/stacking
vlan-tagged). And of course, under 6500/7600 SVIs, you have you
configure the physical interface to 9216, and then the interface Vlan to
9178/9174/9170 (default is still 1500 even with the physical port mtu
bumped).

A quick and dirty test is to force the tcp-mss on the bgp session lower,
say for example with set protocol bgp group blah neighbor x.x.x.x
tcp-mss 536. If this stops the flapping, you probably have an MTU issue. 
You can also ping across the link with the do-not-fragment bit set to
verify these issues, but remember that Cisco and Juniper also disagree
about what ping "size" means. Cisco means it to be the size of the
entire packet, Juniper means it to be the size of the ping payload, so
in the case of IPv4 you would need to subtract 28 (20 bytes IP, 8 bytes
ICMP) from the "size" param to match a Cisco side. Between that and the 
mtu issue above, Cisco and Juniper have created a real mess for 
inter-provider MTU negotiation.

Of course it could be any number of other things too, not just MTU. For
example, in my experience Cisco control-plane policing on 6500/7600 is
absolutely horrific at applying fair rate limits. If you do bump your
CoPP rate-limit (by say, bouncing a bgp session, doing a soft clear,
etc), rather than simply cause tcp to back off and slowing down the
transfer of data, more often than not what will happen is one stream
will monopolize the bandwidth and cause the other sessions to not
exchange keepalives. Being careful with said rate limits has resolved
almost all of the problems that initially looked like a poor scheduler
(though not to say that IOS doesn't have a poor scheduler anyways :P). 

Oh and while we're on the subject, am I the only one who is "concerned" 
by Juniper's configurable range of tcp-mss on BGP neighbors?

ras at router# set protocols bgp tcp-mss ?  
Possible completions:
  <tcp-mss>            Maximum TCP segment size (1..4096)

Setting a TCP MSS to 1 and then trying to exchange a large amount of
data makes for an excellent DoS, and many operating systems now include
a minimum acceptable MSS setting as protection against this.

-- 
Richard A Steenbergen <ras at e-gerbil.net>       http://www.e-gerbil.net/ras
GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)