[c-nsp] MTU nightmare

Justin Shore justin at justinshore.com
Wed Mar 11 20:59:15 EDT 2009


My recent post about LDP and BGP flapping turns out to be caused by MTU 
errors.  The problem just happened early one morning.  No config changes 
were made; it just broke.  RANCID confirms that no changes were made for 
12hrs before and almost as many after the problem first happened.

I've been working with TAC on the issue but I they seem to be stuck on 
the fiber media converters I'm using (I'm forced to use due to a lack of 
single-strand optics from Cisco that go farther than 10k).  I have 
several dozen of these same MCs spread out across my network and almost 
all of them carry links with MTUs of 9000 or 9216.  The link in question 
worked find for several weeks, nearly 2 months, before it suddenly broke 
last week.  We've gone through more debugging than I can recall.  We've 
looked for frames with elam on my 7600s and dove into the DFC to look at 
the registers on the rohini.  We haven't found anything at this point.

My layout is fairly simple.  I have a pair of 7613s in the core with 
6748s (and DFCs).  One of the 6748s connects to a Versitron MC, crosses 
several miles of our fiber, out another MC and into a 7201 (Gi0/0, 
non-PCI bus interface).  Interface MTUs are set to 9000 on both sides. 
IP and MPLS MTUs were default originally when it was working (both 
follow interface MTUs).  CLNS MTU was 1496 to keep IS-IS from getting 
pissy with me.

Up until tonight neither the 7201 or 7613 can ping the remote IP on the 
directly-connected subnet with packets larger than 1500 bytes.  1501 
fails.  This is without df set so it should frag it.  Dropping the MTU 
back down to 1500 fixed pings including fragging packets but of course 
that hurts other MPLS things.  Neither box can hit the other's loopback 
either.  This evening I came out to this CO to disconnect from the MC 
and connect directly to the 7201.  I'm using a 2821 so I can up the MTU 
on the onboard ports.  I did that and ICMPs passed fine all the way up 
to 9000.  I reconnected to the MC and I can again ping the 7613 but not 
constantly.  If I start pinging at 9000 it works about half the time. 
If I use 1600 it works maybe 95% of the time.  That holds true all the 
way down to 1501.  The first packet or 2 are usually dropped for some 
reason.  Currently IP MTU on the 7613 is set to 1500 with interface set 
to 9000 and MPLS MTU set to 1524.  This was part of the testing 
left-over from several hours hours on the phone today with TAC.  At this 
point I would say that I think the problem is the media converters BUT 
when I remove the IP MTU from the 7613 all ICMPs greater than 1500 fail. 
  Clearly this MTU problem is not just the MCs but has to also be the 
7613.  Doesn't it?

This has been a huge mind-bender for me.  I can't for the life of me 
figure out what's going on here.  The worst part is that it's not 100% 
consistent.  Yesterday with a slightly more intelligent pair of MCs I 
could send 1508 but not 1509.  I swapped them last night with a known 
good pair of completely dumb MCs (no settings, switches, etc at all). 
These MCs are purely FIFO compared to the ones that are 2-port switches. 
  I have several of these dumb ones in my network.  I have 8 of them 
back to back spanning about 120mi carrying frames of 9000 bytes without 
any problems.  They just work.

Does anyone have any ideas?  WAGs would even be welcomed at this point. 
  I don't know if the problem is the 7201 or 7613 or the 7613's code 
(SRB1).  I'm ready to bump the case to P1 but I'm afraid the finger will 
still be pointed at the MCs and I can't prove it's not them.

Confused,
  Justin


More information about the cisco-nsp mailing list