[c-nsp] Ethernet Freezeup

Andre Beck cisco-nsp at ibh.net
Mon Apr 7 12:04:28 EDT 2008


Hi Ed,

On Mon, Apr 07, 2008 at 10:10:38AM -0400, Ed Ravin wrote:
> On Mon, Apr 07, 2008 at 03:28:12PM +0200, Andre Beck wrote:
> > Sadly I've came to know this bug in the last months as well.
> ...
> > I was seeing this with a 7206/IO-FE that *has* other interfaces, though
> > what seemed to trigger it there was indeed single-armed routed traffic.
> ...> 
> > > Any thoughts about what might be going on in the innards of the IOS,
> > > and how to troubleshoot or prevent recurrence?
> > 
> > Ed, did you find a solution (other than going to a NPE-G1/2 or NPE-400)
> > or workaround? Anyone else here on c-nsp still using these good old
> > chassis and having advise?
> 
> I was seeing the problem in two routers - first in a 1750 with IOS
> 12.2.something, and then later on in a 7204 / NPE-225 non-VXR.  Both
> routers were using "router-on-a-stick" configurations.  We were able
> to get a close look with the sniffer at the 7204 in the "stuck" state:
> it was still sending ARP requests, OSPF HELOs, and HSRP UDP traffic, but
> apparently not "seeing" any received packets.  The latter was especially
> painful since the router's OSPF neighbors noticed nothing wrong and
> dutifully routed traffic to the zombie router, and since the zombie was
> still sending out HSRP packets, the backup router saw no reason to
> step in and take over the virtual IP address.

Exactly the same thing here. HSRP failing here is especially bad, since
there would be failover paths, but they aren't used.
 
> 11 weeks ago, I replaced the 1750 with a 1720 that had IOS 12.3(24a).
> I was originally planning to do just an IOS upgrade but the router
> was exhibiting some flaky behavior (would freeze up completely if I
> unplugged the console or aux port cable).  We've had no problems with
> the new router since then.  The old 1750 is still in use, with the same
> IOS, but it has been demoted to being a console server for the new
> router in case the problem returns.
> 
> 4 weeks ago, I also upgraded the 7204 to IOS 12.3(24a).  No problems
> since.

Interesting. I've searched a bit in the Bug Toolkit, but didn't find
anything conclusive.
 
> I don't know whether the bug is quenched with the new IOS - this is
> definitely an improvement, but we've had similar quiet periods before.
> If I don't see it for another 2-3 months, then I might declare victory.

How I know this. Last change was swapping power supplies, now it's
again waiting. But given your experiences, it's probably not power
supplies at all...
 
> We did find a workaround.  We set up a cron job to run every 3 minutes
> on a Unix host that had RANCID installed.  The job would try to ping
> the problem router, and if it didn't respond, it would tell RANCID to
> log in to the console port and issue a "clear int FastEthernet0" (or
> Faste0/0 in the case of the 7204).  That dirty trick worked remarkably
> well.  Of course, you need a console server that can be reached by
> the host running RANCID.

I thought about this, but currently not having a rancid at the right
side of the box (where it is still reachable) was a showstopper.
 
> With a recent enough IOS, I suspect you could script a similar workaround
> on the router itself, using object tracking and/or the TCL capability.

OMG.

Thanks for this hint - I just rolled up something with SLA, tracking
and EEM that eventually might just do it. Let's see...

Thanks,
Andre.
-- 
   Real men don't make backups of their mail. They just send it out
    on the Internet and let the secret services do the hard work.

-> Andre Beck    +++ ABP-RIPE +++      IBH IT-Service GmbH, Dresden <-


More information about the cisco-nsp mailing list