[c-nsp] MTU nightmare
Justin Shore
justin at justinshore.com
Wed Mar 11 20:59:15 EDT 2009
My recent post about LDP and BGP flapping turns out to be caused by MTU
errors. The problem just happened early one morning. No config changes
were made; it just broke. RANCID confirms that no changes were made for
12hrs before and almost as many after the problem first happened.
I've been working with TAC on the issue but I they seem to be stuck on
the fiber media converters I'm using (I'm forced to use due to a lack of
single-strand optics from Cisco that go farther than 10k). I have
several dozen of these same MCs spread out across my network and almost
all of them carry links with MTUs of 9000 or 9216. The link in question
worked find for several weeks, nearly 2 months, before it suddenly broke
last week. We've gone through more debugging than I can recall. We've
looked for frames with elam on my 7600s and dove into the DFC to look at
the registers on the rohini. We haven't found anything at this point.
My layout is fairly simple. I have a pair of 7613s in the core with
6748s (and DFCs). One of the 6748s connects to a Versitron MC, crosses
several miles of our fiber, out another MC and into a 7201 (Gi0/0,
non-PCI bus interface). Interface MTUs are set to 9000 on both sides.
IP and MPLS MTUs were default originally when it was working (both
follow interface MTUs). CLNS MTU was 1496 to keep IS-IS from getting
pissy with me.
Up until tonight neither the 7201 or 7613 can ping the remote IP on the
directly-connected subnet with packets larger than 1500 bytes. 1501
fails. This is without df set so it should frag it. Dropping the MTU
back down to 1500 fixed pings including fragging packets but of course
that hurts other MPLS things. Neither box can hit the other's loopback
either. This evening I came out to this CO to disconnect from the MC
and connect directly to the 7201. I'm using a 2821 so I can up the MTU
on the onboard ports. I did that and ICMPs passed fine all the way up
to 9000. I reconnected to the MC and I can again ping the 7613 but not
constantly. If I start pinging at 9000 it works about half the time.
If I use 1600 it works maybe 95% of the time. That holds true all the
way down to 1501. The first packet or 2 are usually dropped for some
reason. Currently IP MTU on the 7613 is set to 1500 with interface set
to 9000 and MPLS MTU set to 1524. This was part of the testing
left-over from several hours hours on the phone today with TAC. At this
point I would say that I think the problem is the media converters BUT
when I remove the IP MTU from the 7613 all ICMPs greater than 1500 fail.
Clearly this MTU problem is not just the MCs but has to also be the
7613. Doesn't it?
This has been a huge mind-bender for me. I can't for the life of me
figure out what's going on here. The worst part is that it's not 100%
consistent. Yesterday with a slightly more intelligent pair of MCs I
could send 1508 but not 1509. I swapped them last night with a known
good pair of completely dumb MCs (no settings, switches, etc at all).
These MCs are purely FIFO compared to the ones that are 2-port switches.
I have several of these dumb ones in my network. I have 8 of them
back to back spanning about 120mi carrying frames of 9000 bytes without
any problems. They just work.
Does anyone have any ideas? WAGs would even be welcomed at this point.
I don't know if the problem is the 7201 or 7613 or the 7613's code
(SRB1). I'm ready to bump the case to P1 but I'm afraid the finger will
still be pointed at the MCs and I can't prove it's not them.
Confused,
Justin
More information about the cisco-nsp
mailing list