[j-nsp] MPLS-in-MPLS mtu
Jeff S Wheeler
jsw at inconcepts.biz
Mon Apr 16 09:06:17 EDT 2007
On Mon, 2007-04-16 at 08:21 -0400, Jared Mauch wrote:
> There was a presentation years ago at IETF where Cisco showed
> the performance increases of enabling path mtu discovery and how this
> took those bgp message sizes grew from the cisco default to match your
> link mtu.
I find it interesting that you mention Cisco as an example, since their
TCP stack for BGP sessions is badly broken in many versions of IOS. The
output of `show ip bgp neighbor <ip>` displays the TCP MSS being used
for the BGP session. As an example, on recent 6500 IOS (12.2(18)SXF6 in
many of my environments) you will find, with BGP MD5 enabled, the TCP
stack is reducing the MSS by an extra 20 bytes as a hack to work around
the additional space consumed by the TCP MD5 option. This means an iBGP
session with MD5 has a TCP MSS of 516, which I think is illegal!
In older 6500 IOS, for example 12.2(17r)S2 that I have tested for this,
the above TCP MSS adjustment does not happen; and the TCP stack (which I
believe is a seperate TCP stack for the BGP process) can actually send
packets which exceed the MTU of an egress interface, for example on a
VLAN configured for a 1500 byte MTU, it can actually send > 1500 byte IP
packets, which will be dropped by neighboring ethernet switches and
counted as giant frames in their error counters.
Neither behavior is correct; the MSS should not be adjusted, and the TCP
stack should decide how much payload to send in the segment based on the
MSS and the size of the TCP header, including options.
> I'm advocating a consistent internal MTU for your network, be that
> 1500, 4470, or something larger. If the underlying transport does not
> support it and you are dealing with broken host stacks from your vendors
> then you should discontinue using their equipment until they
> repair these critical defects.
I agree with you that a bigger MTU can be beneficial (there is a
parallel thread on the nanog mailing list) but that doesn't mean it's a
safe idea to take advantage of it with a bigger MSS for iBGP sessions.
I also agree that a consistent MTU for core links is a good idea, but
many technologies exist (such as RSVP MTU signaling, path MTU detection,
IP fragmenting, transparent layer-3 devices that adjust TCP MSS, etc.)
because it is not always practical. If I recall correctly, the Cisco
3550 series are willing to switch (layer-2) jumbo frames on GE ports,
but are not willing to route (layer-3) jumbo IP packets on VLAN SVIs, to
provide another Cisco example. It is not very practical to throw away
all the 3550s you might be using because they won't work with the MTU
you have decided must be uniformly supported across your whole network.
> Scaling your bgp update messages to something larger than 500 bytes
> can have a significant win in route convergence as we're all carrying
> voip and other similar sensitive traffic on our networks (even if we don't
> know what all that sensitive traffic is).
If Cisco was half as good at software / control-plane as they are at
hardware / data plane, I doubt they would give a presentation on why
raising the MSS on your TCP sessions (to a value which is not guaranteed
to be safe on all mediums capable of forwarding IP packets) could be a
"significant win."
To look at it from another perspective, the TCP MD5 fix that got applied
in the 6500 IOS software sometime between 12.2(17r)S2 and 12.2(18)SXF6
was really done because eBGP sessions were breaking; iBGP sessions are
assumed to be over 576 byte MTU path by default and it's extremely rare
that it's really that small, so iBGP sessions weren't breaking. Yet the
"fix" Cisco did reduces the TCP MSS for both eBGP and iBGP sessions.
Do you think they had a conversation about how fixing the bug in this
manner, instead of correcting the broken behavior of their TCP stack,
would have an adverse impact on convergence time? Maybe they did, but
it obviously did not change their decision about how to "fix" it.
In any event, when changing the TCP MSS of iBGP sessions, it's important
to be aware of the implications of that decision. The MTU you use to
pick your MSS should be the minimum MTU on any link that might be used
to transport your BGP session.
--
Jeff S Wheeler <jsw at inconcepts.biz> +1-212-981-0607
Sr Network Operator / Innovative Network Concepts
More information about the juniper-nsp
mailing list