Re: [nsp] 6500 stops ARPing

From: Steve Francis (sfrancis@expertcity.com)
Date: Fri May 31 2002 - 15:10:25 EDT


on the cat:

 sho mls entry cef ip 10.16.1.155/32 ad
Mod: 15
Destination-IP: 10.16.1.155 Destination-Mask: 255.255.255.255
FIB-Type: resolved

AdjType NextHop-IP NextHop-Mac Vlan Encp Tx-Packets Tx-Octets
-------- --------------- ----------------- ---- ---- ------------
-------------
frc drop 10.16.1.155

(I was pinging a non-existent host on a directly connected net from the
MSFC )

Matt Buford wrote:

>Good info in that URL. Thanks.
>
>In "Remarks and Conclusions" #1, it talks about a "frc drop" (force drop)
>entry being entered into the adjacency table while the arp is in the
>incomplete state. I have started running pings to nonexistant hosts, and I
>see the incomplete arp in the arp table, however I am not able to see any
>kind of entry for the host in the adjacency table while the arp is
>incomplete. Is there some way to see these "frc drop" entries?
>
>A guess is that perhaps somehow I'm getting force drop adjacency entries
>left behind after an incomplete arp times out and is removed from the arp
>table. These force drop adjacency entries may then be orphaned, with
>nothing ever coming along to remove them, and thus the MSFC never sees any
>packets destined for this host, so it never feels the need to generate any
>further arps. However, I need to find a way to show these table entries to
>really know.
>
>----- Original Message -----
>From: "Steve Francis" <steve@expertcity.com>
>To: "Matt Buford" <matt@overloaded.net>
>Cc: <cisco-nsp@puck.nether.net>
>Sent: Friday, May 31, 2002 12:32 PM
>Subject: Re: [nsp] 6500 stops ARPing
>
>
>>Sounds like a CEF issue.
>>
>>I'd try the stuff in http://www.cisco.com/warp/customer/473/128.html#case1
>>and see if the mls on the switch and the cef adjacencies agree.
>>
>>
>>Matt Buford wrote:
>>
>>>Lately I've been having a problem with some 6509s that doesn't seem to
>>>
>make
>
>>>any sense, and I was wondering if others had any ideas or have run into
>>>
>this
>
>>>before. I spent the last few hours searching the web without finding any
>>>discussion that sounded like the same problem I'm seeing.
>>>
>>>I have some 6509s running Supervisor IOS 12.1(11b)E. The 6509s are in
>>>
>pairs
>
>>>with VLAN interfaces, and are running HSRP on these interfaces. These
>>>
>VLANs
>
>>>feed down to smaller switches where the host connects. The smaller
>>>
>switches
>
>>>are connected to both of the 6509s in the pair. The smaller switches
>>>consist of cisco models plus some HPs. ARP table sizes of 15,000 to
>>>
>30,000
>
>>>are common.
>>>
>>>Sometimes a router seems to just refuse to try to ARP certain IPs.
>>>Traceroutes to the broken IPs show the last hop as the 6509. The router
>>>shows no ARP entry. The 6509 does have the host's mac address in the
>>>forwarding table. "debug arp" shows nothing matching the broken IP when
>>>left running for 30 minutes while a continuous ping to a broken IP runs
>>>
>from
>
>>>another host. The debug does show other addresses generating ARPs (and
>>>getting responses) so it isn't like ARP completely stops working. When a
>>>few IPs break but most things are working, a "clear arp" will result in
>>>
>only
>
>>>a small percentage of the ARPs returning, with the majority of the IPs
>>>suddenly being broken and not coming back (at least not anytime soon).
>>>
>>>Here's the strange part. If I log onto the affected 6509 that has no ARP
>>>entry, all I have to do is ping the broken IP from the management
>>>
>interface.
>
>>>This generates an arp, and instantly the ping I left running from a
>>>
>remote
>
>>>host to a broken IP starts responding. Another way to fix it is to do
>>>"shut" then "no shut" on the affected VLAN interface. This seems to
>>>
>clear
>
>>>up something on the interface, as suddenly all the IPs on that interface
>>>
>get
>
>>>ARP entries. A few hours ago I had identified roughly 10 IPs (all on the
>>>same VLAN interface) that were having this problem. I decided to try
>>>
>"clear
>
>>>arp" to see if that would reset things and correct the problem. After
>>>
>about
>
>>>5 minutes, the ARP table was only back up to about 5,900 entries compared
>>>
>to
>
>>>the normal 15,000 to 20,000 and there were huge numbers of IPs that were
>>>
>not
>
>>>unreachable and not generating ARPs. After 10 minutes, the table was
>>>
>only
>
>>>up to about 6,000. I pasted in "int vlanXXX", "shut", "no shut" commands
>>>for every VLAN interface, and within 1 minute after that the ARP table
>>>
>was
>
>>>up to a reasonable 15,000 entries and everything became reachable.
>>>
>>>The fact that a ping from the management interface fixes it along with
>>>
>the
>
>>>lack of any ARP attempts even showing up in the debug arp output leads me
>>>
>to
>
>>>believe the fault is clearly within the 6509, and not any other part of
>>>
>the
>
>>>network. It doesn't seem to be throttling or broadcast storm control as
>>>evidenced by the fact that the ARPs *CAN* be learned very quickly if you
>>>
>can
>
>>>just get the 6509 to generate the arps (by the shut and no shut). I'm
>>>guessing perhaps the packets destined for the broken IPs are
>>>
>(incorrectly)
>
>>>being switched in hardware, and thus never being seen by the CPU and
>>>
>never
>
>>>generating ARPs.
>>>
>>>This problem seems to happen regularly now. I'm open to ideas,
>>>
>suggestions,
>
>>>and show/debug to collect next time this happens...
>>>
>>



This archive was generated by hypermail 2b29 : Sun Aug 04 2002 - 04:13:46 EDT