[nsp] 6500 stops ARPing

From: Matt Buford (matt@overloaded.net)
Date: Fri May 31 2002 - 03:46:27 EDT

Lately I've been having a problem with some 6509s that doesn't seem to make
any sense, and I was wondering if others had any ideas or have run into this
before. I spent the last few hours searching the web without finding any
discussion that sounded like the same problem I'm seeing.

I have some 6509s running Supervisor IOS 12.1(11b)E. The 6509s are in pairs
with VLAN interfaces, and are running HSRP on these interfaces. These VLANs
feed down to smaller switches where the host connects. The smaller switches
are connected to both of the 6509s in the pair. The smaller switches
consist of cisco models plus some HPs. ARP table sizes of 15,000 to 30,000
are common.

Sometimes a router seems to just refuse to try to ARP certain IPs.
Traceroutes to the broken IPs show the last hop as the 6509. The router
shows no ARP entry. The 6509 does have the host's mac address in the
forwarding table. "debug arp" shows nothing matching the broken IP when
left running for 30 minutes while a continuous ping to a broken IP runs from
another host. The debug does show other addresses generating ARPs (and
getting responses) so it isn't like ARP completely stops working. When a
few IPs break but most things are working, a "clear arp" will result in only
a small percentage of the ARPs returning, with the majority of the IPs
suddenly being broken and not coming back (at least not anytime soon).

Here's the strange part. If I log onto the affected 6509 that has no ARP
entry, all I have to do is ping the broken IP from the management interface.
This generates an arp, and instantly the ping I left running from a remote
host to a broken IP starts responding. Another way to fix it is to do
"shut" then "no shut" on the affected VLAN interface. This seems to clear
up something on the interface, as suddenly all the IPs on that interface get
ARP entries. A few hours ago I had identified roughly 10 IPs (all on the
same VLAN interface) that were having this problem. I decided to try "clear
arp" to see if that would reset things and correct the problem. After about
5 minutes, the ARP table was only back up to about 5,900 entries compared to
the normal 15,000 to 20,000 and there were huge numbers of IPs that were not
unreachable and not generating ARPs. After 10 minutes, the table was only
up to about 6,000. I pasted in "int vlanXXX", "shut", "no shut" commands
for every VLAN interface, and within 1 minute after that the ARP table was
up to a reasonable 15,000 entries and everything became reachable.

The fact that a ping from the management interface fixes it along with the
lack of any ARP attempts even showing up in the debug arp output leads me to
believe the fault is clearly within the 6509, and not any other part of the
network. It doesn't seem to be throttling or broadcast storm control as
evidenced by the fact that the ARPs *CAN* be learned very quickly if you can
just get the 6509 to generate the arps (by the shut and no shut). I'm
guessing perhaps the packets destined for the broken IPs are (incorrectly)
being switched in hardware, and thus never being seen by the CPU and never
generating ARPs.

This problem seems to happen regularly now. I'm open to ideas, suggestions,
and show/debug to collect next time this happens...

This archive was generated by hypermail 2b29 : Sun Aug 04 2002 - 04:11:58 EDT