[c-nsp] Sup720 CPU spikes, an academic question

Tue May 3 13:32:44 EDT 2011

I know a single 5 second interval of 100% CPU utilization now and then
is rather irrelevant seen from an operational perspective. That's
probably even more true when looking at a 600 MHz MIPS on a Sup720. This
thing has me puzzled though. :-)

The context is dozen or so of C6k Sup720s running (mostly) SXI1 AIS,
IS-IS, MPLS L3VPN (including MP-BGP), a little L2VPN and nothing much
more than that. They're doing really fine, no practical problems.

The following is the output from "show proc cpu" (slightly reformatted)
from a device that exceeded a 90% warning threshold we've configured. 

  CPU utilization for five seconds: 100%/0%; one min: 10%; five min: 4%
   PID Runtime(ms)   Invoked  uSecs  5Sec  1Min  5Min  Process
     8   870373628  51977035  16745 1.27% 0.59% 0.64%  Check heaps
   487    20306096  67521163    300 0.15% 0.04% 0.04%  Port manager per
     2        9688   5187559      1 0.07% 0.00% 0.00%  Load Meter
   358    18902200  40236967    469 0.07% 0.03% 0.02%  CEF: IPv4 proces
    23    85574908 641372631    133 0.00% 0.12% 0.08%  IPC Seat Manager
    51   111228136   4913752  22636 0.00% 0.07% 0.05%  Per-minute Jobs
   272    28800268 228265577    126 0.00% 0.10% 0.07%  IP Input
   561    55288392 590654988     93 0.00% 0.13% 0.09%  ISIS Adj
   578    16540192 166947095     99 0.00% 0.05% 0.04%  HSRP IPv4

I've excluded processes with 0% utilization for all three periods. To me
the above means that 0% time (?) was spent interrupt switching, so the
load must be either process switched traffic or some other process. The
specific example is generally representative of what we see, but the
processes mentioned differ (e.g. "Check heaps" not always on top nor
taking > 0.5% load) and no single process tends to take more than 1%-2%
load. I have all the alerts archived by the way.

I've tested flood-pinging a similar device and it seems that both
process time (first number) and interrupt time (second number) increase
when the device has to handle incoming ICMP Echo Requests. Furthermore I
can clearly see the "IP Input" process take a significant amount of CPU
time. Similar results are seen when I try flooding the device with TCP
SYN requests, with floods targeting open ports putting more load on the
device than closed ports.

Alongside this reactive alerting I have a continuous ERSPAN session
monitoring all traffic sent to the RP ("source cpu rp tx") and all the
traffic is logged to a 10GB rotating buffer that holds around 5-6 days
of traffic. Each time I get an alert I take a look at the traffic
surrounding the time of the alert. Everytime I see nothing of interest:
I see significantly more traffic at other times, and the specific
traffic mix does not give me any clues.

The spikes do not seem to correlate with a lot of traffic, neither
traffic for the RP nor traffic generally being forwarded by the box. It
also does not correlate with IGP or BGP events or anything I'd consider
relevant. Even the odd loop or ridiculous multicast flooding dosn't tax
the CPU under normal circumstances.

The device has no CoPP configured and only default rate-limiters.

The only thing that might mean something is that we've not (yet) seen
this on devices running newer software (SXI5+). That might be because
they're outnumbered by devices still running SXI1. I haven't yet had the
motivation to mine the Bug Toolkit for clues, sorry.

What puzzles me is: What causes the RP to max out at 100% utilization in
a case like this? Should I just ignore it altogether?

-- 
Peter