[c-nsp] 3750 CPU spikes (was: Sup720 CPU spikes, an academic question

Jeff Kell jeff-kell at utc.edu
Tue May 3 13:44:05 EDT 2011


Just in case they're related (are you running redundant Sups or VSS?), let me throw this
one in the mix...

I have an open case regarding EIGRP route-flapping on a 3750 stack, which ended up being
CPU starvation due to configuration management (!).

We have a stack of 4 3750-48s running a few VRFs with EIGRP in each instance as local
IGP.  If you do "certain" configuration changes, in our case adding/removing vlans to a
switchport trunk allowed list, or setting up a monitor session, or replacing ACLs with a
cut/paste operation, we were getting route flapping in the EIGRP instances (dropping
neighbors due to timeouts, then almost immediately establishing new neighbor adjacencies).

If you monitor the "show proc cpu sort" during the changes, you end up with a
consistently high CPU utilization by the "hulc running con" process, which is
responsible for making the configuration changes and populating it to the other stack
members.  Apparently this is a high-priority process, and higher than EIGRP, thus
starving out the hello packet processing.

There is no fix at this time (other than making your hello timers longer and screwing up
your reconvergence in the event of an actual failure) but is supposed to be addressed
"in the next release" (*cough*).

Any possibility of a similar process in the 6500 train?

Jeff

On 5/3/2011 1:32 PM, Peter Rathlev wrote:
> I know a single 5 second interval of 100% CPU utilization now and then
> is rather irrelevant seen from an operational perspective. That's
> probably even more true when looking at a 600 MHz MIPS on a Sup720. This
> thing has me puzzled though. :-)
>
> The context is dozen or so of C6k Sup720s running (mostly) SXI1 AIS,
> IS-IS, MPLS L3VPN (including MP-BGP), a little L2VPN and nothing much
> more than that. They're doing really fine, no practical problems.
>
> The following is the output from "show proc cpu" (slightly reformatted)
> from a device that exceeded a 90% warning threshold we've configured. 
>
>   CPU utilization for five seconds: 100%/0%; one min: 10%; five min: 4%
>    PID Runtime(ms)   Invoked  uSecs  5Sec  1Min  5Min  Process
>      8   870373628  51977035  16745 1.27% 0.59% 0.64%  Check heaps
>    487    20306096  67521163    300 0.15% 0.04% 0.04%  Port manager per
>      2        9688   5187559      1 0.07% 0.00% 0.00%  Load Meter
>    358    18902200  40236967    469 0.07% 0.03% 0.02%  CEF: IPv4 proces
>     23    85574908 641372631    133 0.00% 0.12% 0.08%  IPC Seat Manager
>     51   111228136   4913752  22636 0.00% 0.07% 0.05%  Per-minute Jobs
>    272    28800268 228265577    126 0.00% 0.10% 0.07%  IP Input
>    561    55288392 590654988     93 0.00% 0.13% 0.09%  ISIS Adj
>    578    16540192 166947095     99 0.00% 0.05% 0.04%  HSRP IPv4
>
> I've excluded processes with 0% utilization for all three periods. To me
> the above means that 0% time (?) was spent interrupt switching, so the
> load must be either process switched traffic or some other process. The
> specific example is generally representative of what we see, but the
> processes mentioned differ (e.g. "Check heaps" not always on top nor
> taking > 0.5% load) and no single process tends to take more than 1%-2%
> load. I have all the alerts archived by the way.
>
> I've tested flood-pinging a similar device and it seems that both
> process time (first number) and interrupt time (second number) increase
> when the device has to handle incoming ICMP Echo Requests. Furthermore I
> can clearly see the "IP Input" process take a significant amount of CPU
> time. Similar results are seen when I try flooding the device with TCP
> SYN requests, with floods targeting open ports putting more load on the
> device than closed ports.
>
> Alongside this reactive alerting I have a continuous ERSPAN session
> monitoring all traffic sent to the RP ("source cpu rp tx") and all the
> traffic is logged to a 10GB rotating buffer that holds around 5-6 days
> of traffic. Each time I get an alert I take a look at the traffic
> surrounding the time of the alert. Everytime I see nothing of interest:
> I see significantly more traffic at other times, and the specific
> traffic mix does not give me any clues.
>
> The spikes do not seem to correlate with a lot of traffic, neither
> traffic for the RP nor traffic generally being forwarded by the box. It
> also does not correlate with IGP or BGP events or anything I'd consider
> relevant. Even the odd loop or ridiculous multicast flooding dosn't tax
> the CPU under normal circumstances.
>
> The device has no CoPP configured and only default rate-limiters.
>
> The only thing that might mean something is that we've not (yet) seen
> this on devices running newer software (SXI5+). That might be because
> they're outnumbered by devices still running SXI1. I haven't yet had the
> motivation to mine the Bug Toolkit for clues, sorry.
>
> What puzzles me is: What causes the RP to max out at 100% utilization in
> a case like this? Should I just ignore it altogether?
>



More information about the cisco-nsp mailing list