[c-nsp] IOS Upgrade to SXI3

Mon Dec 21 12:01:52 EST 2009

Responding to my own posting with an update: the SNMP issue described
below appears to have had nothing to do with the SNMP code on the
router.

Instead it appears to have been a hardware related problem with
internal communication paths in the router which was stalling the SCP
paths and "Async Write" processes, causing failures for a number of
things like writing files to flash and answering SNMP queries.

It's not clear why the issue didn't show up until the router was
reloaded on SXI3 code, but Murphy's Law is always at work.

During a maint window we power cycled the router with full startup
diagnostics, but found no hardware problems. However, the high SP CPU
load (99 percent) was present again on the slot 5 sup (slot 6 sup was
OK). Replacing the slot 5 sup appears to have resolved all issues. All
modules in the box were reseated during the power-cycling, just in
case.

The other two routers running SXI3 are not having any CPU load
problems and have been stable. One is a border router doing full BGP
peering, the other an enterprise core router.

On the port-channel issue that was noted, the error counters on the po
int have not incremented since counters were cleared. It appears that
there must have been a burst of errors at startup but nothing since
then.

-Charles

Charles E. Spurgeon / UTnet
UT Austin ITS / Networking
c.spurgeon at its.utexas.edu / 512.475.9265

On Tue, Dec 15, 2009 at 05:50:55PM -0600, Charles Spurgeon wrote:
> 
> We upgraded three core routers to monolithic 12.2(33)SXI3 on Sunday,
> Dec 13.
> 
> One of the upgraded routers started throwing SNMP input queue errors
> after several hours of runtime. All three routers are polled by the
> same servers asking for the same OIDs, but only one of the upgraded
> routers has thrown any SNMP errors: 
> "Dec 14 14:19:50: %SNMP-3-INPUT_QFULL_ERR: Packet dropped due to input queue full"
>
> SNMP graphing stopped working coincident with these error msgs.
> 
> In an attempt to clear the errors we applied these commands that
> were found when looking for info on this error:
> snmp-server view public-view iso included
> snmp-server view public-view ciscoMemoryPoolMIB excluded
> 
> Roughly coincident with applying those snmp config lines the SP CPU
> went to 100 percent load, where it has remained stuck ever since. RP
> CPU is running normally.
> 
> We have opened a TAC case, run a number of debugs, removed all SNMP
> commands, etc. But the SP CPU is still pegged and we haven't been able
> to find a smoking gun.
> 
> The biggest process load on the SP appears to be from an Async write
> process:
> --------------------
> NOCA9-sp#show proc cpu | exc 0.00
> Load for five secs: 100%/13%; one minute: 99%; five minutes: 99%
> Time source is hardware calendar, 10:46:59.677 CST Mon Dec 14 2009
> 
> CPU utilization for five seconds: 100%/13%; one minute: 99%; five minutes: 99%
>  PID Runtime(ms)   Invoked      uSecs   5Sec   1Min   5Min TTY Process 
>   42       52936      2280      23217  0.63%  0.07%  0.01%   0 Per-minute Jobs  
>   93    51573408   1269609      40621 67.46% 65.15% 64.79%   0 Async write proc 
>  111     2197532   3855803        569  1.91%  1.88%  1.91%   0 slcp process   
> --------------------
> 
> We ran debug on SNMP packets and requests and found that the SNMP
> traffic consists of well-behaved SNMP queries from just our set of
> servers, polling only the MIB vars needed and there are no high
> quantities of requests.
> 
> Meanwhile, there are an insane number of VeryBig buffers on the RP and
> equally insane numbers of Medium buffers on the SP being created:
> --------------------
> RP
> --------------------
> VeryBig buffers, 4520 bytes (total 1013, permanent 10, peak 1016 @ 14:51:06):
>      12 in free list (0 min, 100 max allowed)
>      584335 hits, 21308 misses, 15077 trims, 16080 created
>      14417 failures (0 no memory)
> 
> --------------------
> SP
> --------------------
> Medium buffers, 256 bytes (total 30359, permanent 3000, peak 30359 @ 00:00:00):
>      66 in free list (64 min, 3000 max allowed)
>      1659825 hits, 9193 misses, 33 trims, 27392 created
>      0 failures (0 no memory)
> 
> Other than this, we have not been able to find any other useful info.
> 
> Also, we have been seeing errors on a port-channel associated with one
> of the other routers that was upgraded to SXI3. 
> 
> There have been bursts of errors received on the upstream router from
> the upgraded router on the two 10GigE ints that make up the port
> channel. As far as we can tell these ints were running clean until
> SXI3 was loaded, but we're still investigating this issue.
> 
> -Charles
> 
> Charles E. Spurgeon / UTnet
> UT Austin ITS / Networking
> c.spurgeon at its.utexas.edu / 512.475.9265
> _______________________________________________
> cisco-nsp mailing list  cisco-nsp at puck.nether.net
> https://puck.nether.net/mailman/listinfo/cisco-nsp
> archive at http://puck.nether.net/pipermail/cisco-nsp/