[c-nsp] Ethernet Freezeup
Andre Beck
cisco-nsp at ibh.net
Mon Apr 7 12:04:28 EDT 2008
Hi Ed,
On Mon, Apr 07, 2008 at 10:10:38AM -0400, Ed Ravin wrote:
> On Mon, Apr 07, 2008 at 03:28:12PM +0200, Andre Beck wrote:
> > Sadly I've came to know this bug in the last months as well.
> ...
> > I was seeing this with a 7206/IO-FE that *has* other interfaces, though
> > what seemed to trigger it there was indeed single-armed routed traffic.
> ...>
> > > Any thoughts about what might be going on in the innards of the IOS,
> > > and how to troubleshoot or prevent recurrence?
> >
> > Ed, did you find a solution (other than going to a NPE-G1/2 or NPE-400)
> > or workaround? Anyone else here on c-nsp still using these good old
> > chassis and having advise?
>
> I was seeing the problem in two routers - first in a 1750 with IOS
> 12.2.something, and then later on in a 7204 / NPE-225 non-VXR. Both
> routers were using "router-on-a-stick" configurations. We were able
> to get a close look with the sniffer at the 7204 in the "stuck" state:
> it was still sending ARP requests, OSPF HELOs, and HSRP UDP traffic, but
> apparently not "seeing" any received packets. The latter was especially
> painful since the router's OSPF neighbors noticed nothing wrong and
> dutifully routed traffic to the zombie router, and since the zombie was
> still sending out HSRP packets, the backup router saw no reason to
> step in and take over the virtual IP address.
Exactly the same thing here. HSRP failing here is especially bad, since
there would be failover paths, but they aren't used.
> 11 weeks ago, I replaced the 1750 with a 1720 that had IOS 12.3(24a).
> I was originally planning to do just an IOS upgrade but the router
> was exhibiting some flaky behavior (would freeze up completely if I
> unplugged the console or aux port cable). We've had no problems with
> the new router since then. The old 1750 is still in use, with the same
> IOS, but it has been demoted to being a console server for the new
> router in case the problem returns.
>
> 4 weeks ago, I also upgraded the 7204 to IOS 12.3(24a). No problems
> since.
Interesting. I've searched a bit in the Bug Toolkit, but didn't find
anything conclusive.
> I don't know whether the bug is quenched with the new IOS - this is
> definitely an improvement, but we've had similar quiet periods before.
> If I don't see it for another 2-3 months, then I might declare victory.
How I know this. Last change was swapping power supplies, now it's
again waiting. But given your experiences, it's probably not power
supplies at all...
> We did find a workaround. We set up a cron job to run every 3 minutes
> on a Unix host that had RANCID installed. The job would try to ping
> the problem router, and if it didn't respond, it would tell RANCID to
> log in to the console port and issue a "clear int FastEthernet0" (or
> Faste0/0 in the case of the 7204). That dirty trick worked remarkably
> well. Of course, you need a console server that can be reached by
> the host running RANCID.
I thought about this, but currently not having a rancid at the right
side of the box (where it is still reachable) was a showstopper.
> With a recent enough IOS, I suspect you could script a similar workaround
> on the router itself, using object tracking and/or the TCL capability.
OMG.
Thanks for this hint - I just rolled up something with SLA, tracking
and EEM that eventually might just do it. Let's see...
Thanks,
Andre.
--
Real men don't make backups of their mail. They just send it out
on the Internet and let the secret services do the hard work.
-> Andre Beck +++ ABP-RIPE +++ IBH IT-Service GmbH, Dresden <-
More information about the cisco-nsp
mailing list