[c-nsp] CEF Scanner eating CPU in Supervisor 720

Fri Jun 9 16:23:07 EDT 2006

Is there HSRP on any of the interfaces where there were duplicate IPs?
I've seen Sup's die a 100% CPU death in this scenario, and not recover when
the duplicate was removed.

Ian

Rodney Dunn wrote:
> Peter,
> 
> Someone mentioned to me (and thy were watching this alias and
> our thread) that they were seeing a similar issue.
> 
> I tried to recreate it in the lab with this:
> 
> R1 --
> R2 --  UUT --- R3 -- R4
> 
> I sent 1k routes from R2 to UUT as EBGP routes.
> I originated 1k OSPF routes adn 1k iBGP routes from R4.
> I had MPLS on between UUT and R3.
> 
> I then turned up R1 with duplicate ip address.
> 
> I do see the adj flap as the two devices argue about who should
> own that ip address.
> 
> And I see the adj update in 'sh ip cef ev' but I'm not able to
> get the CEF scanner to run high.
> 
> On Monday I'll see if I can get a 76xx with SXE5 on it and try.
> 
> If you could try without MPLS and or contact me offline and get
> me remote access I'll look at it with you if you can recreate it.
> 
> In newer code that is coming out the CEF scanner doesn't exist
> anymore. But I'd still like to understand what's happening here
> because even if one adj is flapping it doesn't seem normal that the
> scanner would constantly run like that.
> 
> Rodney
> 
> On Fri, Jun 09, 2006 at 08:39:21AM -0400, Rodney Dunn wrote:
> 
>>On Fri, Jun 09, 2006 at 08:56:47AM +0200, Peter Salanki wrote:
>>
> I think the problem s not at all MPLS related. I did a perl hack that  
> paresed the output of sh ip cef event new and added static arp on the  
> hosts which flapped rapidly, 7 IPs had abnormal activity. The load is  
> now down to a more acceptable level of 20% avg. I could remove the  
> statics and isable MPLS on the core facing interfaces just to make  
> sure that MPLS has nothing to do with it if you want.
>>>
>>>I hate asking for things from production networks but if you could
>>>do that without too much trouble that would be a good data point
>>>to have.
>>>
>>>Also, those macs are they on the downstream interfaces (non mpls
>>>enabled interfaces)?
>>>
>>>Are there any routes resolving through those arp/mac's on those interfaces?
>>>
>>>
>>>How are the arps changing?
>>>
>>>
>>> Do you have any  
>>>
> case about this, and/or any plans of "fixing" it? 
>>>
>>>I need to understand a little more about what the root problem is first.
>>>
>>>I don't like the  
>>>
> thought of directly connected kiddies being able to drain all cpu  on  
> my (imo. not cheap) sup720-3bxl by just stealing eachothers IP  
> addresses.
>>>
>>>Help me understand a bit more about what is actually going on to trigger
>>>it and we'll see what we can do.
>>>
>>>
>>>I'm not a l2 person but isn't there stuff about securing macs on ports, etc.?
>>>
>>>
> 9 jun 2006 kl. 00.32 skrev Rodney Dunn:
> 
> 
>>One trick is you can do a 'sh ip cef ev new' and do it over and
>>over. See which ones are flapping.
> 
>>How many routes do you have?
> 
>>Can you turn off logging to the console: no logg con
> 
>>and run a couple of mpls debugs and let's see what that says:
> 
>>debug mpls lfib cef
>>debug mpls lfib enc
> 
>>Set the lot to a couple of meg.
> 
>>Rodney
> 
>>On Thu, Jun 08, 2006 at 11:13:02PM +0200, Peter Salanki wrote:
> 
>>>The CEF Scanner is now eating almost all CPU :/
>>>
>>>The events table doesn't look any particular to me,
>>>--SNAP--
>>>
>>>+00:00:00.000:           81.170.148.226/32     ADJ (Vl4001) update
>>>[OK]
>>>+00:00:00.024:           195.178.160.138/32    ADJ (Vl19) update
>>>[OK]
>>>+00:00:00.052:           81.170.138.13/32      ADJ (Vl604) update
>>>[OK]
>>>+00:00:00.232:           81.170.152.129/32     ADJ (Vl4003) update
>>>[OK]
>>>+00:00:00.240:           81.170.148.118/32     ADJ (Vl4001) update
>>>[OK]
>>>+00:00:00.304:           81.170.149.246/32     ADJ (Vl4001) update
>>>[OK]
>>>+00:00:00.320:           81.170.152.50/32      ADJ (Vl4003) update
>>>[OK]
>>>+00:00:00.380:           81.170.154.117/32     ADJ (Vl4004) update
>>>[OK]
>>>+00:00:00.388:           213.136.56.90/32      ADJ (Vl39) update
>>>[OK]
>>>+00:00:00.400:           81.170.136.79/32      ADJ (Vl504) update
>>>[OK]
>>>+00:00:00.416:           195.178.160.173/32    ADJ (Vl19) update
>>>[OK]
>>>+00:00:00.512:           81.170.164.163/32     ADJ (Vl4009) update
>>>[OK]
>>>+00:00:00.728:           81.170.130.75/32      ADJ (Vl204) update
>>>[OK]
>>>+00:00:00.736: [Default] 199.3.108.0/24        NBD modified
>>>[OK]
>>>+00:00:00.736: [Default] 199.3.109.0/24        NBD modified
>>>[OK]
>>>+00:00:00.820:           195.178.186.24/32     ADJ (Vl666) update
>>>[OK]
>>>+00:00:00.832:           81.170.160.3/32       ADJ (Vl4007) update
>>>[OK]
>>>+00:00:00.868:           81.170.164.33/32      ADJ (Vl4009) update
>>>[OK]
>>>+00:00:00.944:           81.170.132.159/32     ADJ (Vl304) update
>>>[OK]
>>>+00:00:00.952:           81.170.128.77/32      ADJ (Vl104) update
>>>[OK]
>>>+00:00:01.008:           81.170.149.246/32     ADJ (Vl4001) update
>>>[OK]
>>>+00:00:01.128:           194.68.123.141/32     ADJ (Vl15) update
>>>[OK]
>>>--More--
>>>
>>>
>>>8 jun 2006 kl. 19.40 skrev Rodney Dunn:
>>>
>>>
>>>>Are you running MPLS on the box?
> 
>>>>Check the sh ip cef event outut and see if you have a /32 ADJ
>>>>for a mac constantly changing. That's the most common trigger
>>>>I've seen for the scanner running high.
> 
>>>>You are forcing CEF to constantly reresolve prefixes.
> 
>>>>Rodney
> 
>>>>On Thu, Jun 08, 2006 at 02:23:22PM +0200, Peter Salanki wrote:
> 
>>>>>-----BEGIN PGP SIGNED MESSAGE-----
>>>>>Hash: SHA1
>>>>>
>>>>>Hello,
>>>>>
>>>>>Process "CEF Scanner" is eating average 60% of the CPU on one of my
>>>>>Sup720-3BXL. This leads to snmp responses being delayed and full  
>>>>>BGP
>>>>>updates taking a long time. I have not seen this on any of my other
>>>>>sup720s. What differs this box from the rest is that this one has a
>>>>>lot of directly connected hosts ~10 SVIs with 300 hosts each  
>>>>>(on /23
>>>>>subnets). I have tried setting arp timeout to 1200 on those SVIs,
>>>>>which resulted in a small CPU utilization decrease. What can I  
>>>>>do to
>>>>>calm down the CEF Scanner? I'm running 12.2(18)SXF4.
>>>>>
>>>>>CPU utilization for five seconds: 44%/4%; one minute: 38%; five
>>>>>minutes: 38%
>>>>>PID Runtime(ms)   Invoked      uSecs   5Sec   1Min   5Min TTY  
>>>>>Process
>>>>>119   103495040    719635     143819 35.40% 23.87% 21.54%   0 CEF
>>>>>Scanner
>>>>>
>>>>>Sincerely
>>>>>
>>>>>Peter Salanki
>>>>>Chief Network Engineer
>>>>>Bahnhof AB (AS8473)
>>>>>www.bahnhof.se
>>>>>Office: +46855577132
>>>>>Cell: +46709174932
>>>>>
>>>>>
>>>>>