[c-nsp] Ethernet Freezeup

Mon Apr 7 10:10:38 EDT 2008

The story so far:
On Sat, Jul 15, 2006 at 05:23:20PM -0400, Ed Ravin wrote:
> A few times on this list, people have discussed how a Cisco 1700 series
> router can suddenly "freeze up" on its main Ethernet interface.  The
> problem as I've observed it hits routers that have a single Ethernet
> interface (and no other interfaces in use).  The symptom is that the router
> no longer receives traffic on the Ethernet - it still transmits ARP requests
> and retries of routing protocol packets, but nothing is received.  Getting to
> the console of the router and issuing "clear int faste0" always fixes the
> problem.

And then:
On Mon, Apr 07, 2008 at 03:28:12PM +0200, Andre Beck wrote:
> Sadly I've came to know this bug in the last months as well.
...
> I was seeing this with a 7206/IO-FE that *has* other interfaces, though
> what seemed to trigger it there was indeed single-armed routed traffic.
...> 
> > Any thoughts about what might be going on in the innards of the IOS,
> > and how to troubleshoot or prevent recurrence?
> 
> Ed, did you find a solution (other than going to a NPE-G1/2 or NPE-400)
> or workaround? Anyone else here on c-nsp still using these good old
> chassis and having advise?

I was seeing the problem in two routers - first in a 1750 with IOS
12.2.something, and then later on in a 7204 / NPE-225 non-VXR.  Both
routers were using "router-on-a-stick" configurations.  We were able
to get a close look with the sniffer at the 7204 in the "stuck" state:
it was still sending ARP requests, OSPF HELOs, and HSRP UDP traffic, but
apparently not "seeing" any received packets.  The latter was especially
painful since the router's OSPF neighbors noticed nothing wrong and
dutifully routed traffic to the zombie router, and since the zombie was
still sending out HSRP packets, the backup router saw no reason to
step in and take over the virtual IP address.

11 weeks ago, I replaced the 1750 with a 1720 that had IOS 12.3(24a).
I was originally planning to do just an IOS upgrade but the router
was exhibiting some flaky behavior (would freeze up completely if I
unplugged the console or aux port cable).  We've had no problems with
the new router since then.  The old 1750 is still in use, with the same
IOS, but it has been demoted to being a console server for the new
router in case the problem returns.

4 weeks ago, I also upgraded the 7204 to IOS 12.3(24a).  No problems
since.

I don't know whether the bug is quenched with the new IOS - this is
definitely an improvement, but we've had similar quiet periods before.
If I don't see it for another 2-3 months, then I might declare victory.

We did find a workaround.  We set up a cron job to run every 3 minutes
on a Unix host that had RANCID installed.  The job would try to ping
the problem router, and if it didn't respond, it would tell RANCID to
log in to the console port and issue a "clear int FastEthernet0" (or
Faste0/0 in the case of the 7204).  That dirty trick worked remarkably
well.  Of course, you need a console server that can be reached by
the host running RANCID.

With a recent enough IOS, I suspect you could script a similar workaround
on the router itself, using object tracking and/or the TCL capability.

	-- Ed