[c-nsp] MTU nightmare

Bryan Campbell bbc at misn.com
Thu Mar 12 01:20:45 EDT 2009


O.K. WAG-ing away . . .

If you are willing to defend the MC hardware, so be it.  But, the first
thing you should do is look for a microbend in a fiber jumper.  Replace
all the jumpers with known good and clean jumpers.  Second, replace all
the copper cables with known good cables.  Make sure that you have
certification data for all the cables/jumpers.  You have to be able to
trust your cabling and jumpers.

Second, throw a switch in that can handle the traffic from the MC to the
router.  Mirror out the data to a workstation running wireshark or
something of the like.  Capture your BGP data.  Do the same on the other
end of the link.  You should be able to isolate the problem this way.

If Cisco can't find anything wrong with the configs, it has to be a
physical part.  Interfaces, media converters, cables/jumpers are the
first place to look.  It is not very common for Cisco parts to just up
and fail with no warning.  If Cisco stuff fails, it usually goes down in
a blaze of glory with a pile of data left behind to inspect.   

FWIW, every time we have had a failure of this nature it has been a
media converter failure, or a microbend in a fiber jumper.  

bbc at misn.com


On Wed, 2009-03-11 at 19:59 -0500, Justin Shore wrote:
> My recent post about LDP and BGP flapping turns out to be caused by MTU 
> errors.  The problem just happened early one morning.  No config changes 
> were made; it just broke.  RANCID confirms that no changes were made for 
> 12hrs before and almost as many after the problem first happened.
> 
> I've been working with TAC on the issue but I they seem to be stuck on 
> the fiber media converters I'm using (I'm forced to use due to a lack of 
> single-strand optics from Cisco that go farther than 10k).  I have 
> several dozen of these same MCs spread out across my network and almost 
> all of them carry links with MTUs of 9000 or 9216.  The link in question 
> worked find for several weeks, nearly 2 months, before it suddenly broke 
> last week.  We've gone through more debugging than I can recall.  We've 
> looked for frames with elam on my 7600s and dove into the DFC to look at 
> the registers on the rohini.  We haven't found anything at this point.
> 
> My layout is fairly simple.  I have a pair of 7613s in the core with 
> 6748s (and DFCs).  One of the 6748s connects to a Versitron MC, crosses 
> several miles of our fiber, out another MC and into a 7201 (Gi0/0, 
> non-PCI bus interface).  Interface MTUs are set to 9000 on both sides. 
> IP and MPLS MTUs were default originally when it was working (both 
> follow interface MTUs).  CLNS MTU was 1496 to keep IS-IS from getting 
> pissy with me.
> 
> Up until tonight neither the 7201 or 7613 can ping the remote IP on the 
> directly-connected subnet with packets larger than 1500 bytes.  1501 
> fails.  This is without df set so it should frag it.  Dropping the MTU 
> back down to 1500 fixed pings including fragging packets but of course 
> that hurts other MPLS things.  Neither box can hit the other's loopback 
> either.  This evening I came out to this CO to disconnect from the MC 
> and connect directly to the 7201.  I'm using a 2821 so I can up the MTU 
> on the onboard ports.  I did that and ICMPs passed fine all the way up 
> to 9000.  I reconnected to the MC and I can again ping the 7613 but not 
> constantly.  If I start pinging at 9000 it works about half the time. 
> If I use 1600 it works maybe 95% of the time.  That holds true all the 
> way down to 1501.  The first packet or 2 are usually dropped for some 
> reason.  Currently IP MTU on the 7613 is set to 1500 with interface set 
> to 9000 and MPLS MTU set to 1524.  This was part of the testing 
> left-over from several hours hours on the phone today with TAC.  At this 
> point I would say that I think the problem is the media converters BUT 
> when I remove the IP MTU from the 7613 all ICMPs greater than 1500 fail. 
>   Clearly this MTU problem is not just the MCs but has to also be the 
> 7613.  Doesn't it?
> 
> This has been a huge mind-bender for me.  I can't for the life of me 
> figure out what's going on here.  The worst part is that it's not 100% 
> consistent.  Yesterday with a slightly more intelligent pair of MCs I 
> could send 1508 but not 1509.  I swapped them last night with a known 
> good pair of completely dumb MCs (no settings, switches, etc at all). 
> These MCs are purely FIFO compared to the ones that are 2-port switches. 
>   I have several of these dumb ones in my network.  I have 8 of them 
> back to back spanning about 120mi carrying frames of 9000 bytes without 
> any problems.  They just work.
> 
> Does anyone have any ideas?  WAGs would even be welcomed at this point. 
>   I don't know if the problem is the 7201 or 7613 or the 7613's code 
> (SRB1).  I'm ready to bump the case to P1 but I'm afraid the finger will 
> still be pointed at the MCs and I can't prove it's not them.
> 
> Confused,
>   Justin
> _______________________________________________
> cisco-nsp mailing list  cisco-nsp at puck.nether.net
> https://puck.nether.net/mailman/listinfo/cisco-nsp
> archive at http://puck.nether.net/pipermail/cisco-nsp/



More information about the cisco-nsp mailing list