[c-nsp] Total hangs on 7301/7200 (NPE-G1)

Osama I. Dosary oid at saudi.net.sa
Mon Dec 13 10:01:47 EST 2004


Thank you Rodney,

I ruled out a) a few days back, since I was able to catch a hang while 
connected to console -- finding no messages. Also, we pulled out the one 
and only network cable, waited for over 20 minutes, and it still did not 
come back. In addition, prior to a hang, there are no significant 
traffic (bps&pps) or CPU increase noticible on our monitoring system 
(NMIS 5min polling interval.) (And I don't know an easy way to sniff 
100Mbps, and make sense of it. Suggestions welcomed.)

Regarding b) "real OS/hardware hang". I believe this to be the culprit. 
I've already set the register for stack trace few days back, but haven't 
been lucky to have a hang on that router since. (Also wondering, will a 
control-break be accepted by the console, when the console hangs?)

After a bit of testing, we came up with a work around: Using an external 
FastEthernet PA. This work around has been working pretty well the past 
3 days (no hangs.) There are very few "spurious interrupts", but still a 
lot of in-queue drops.

Someone on the mailing list pointed me to this unresolved bug: "NPE-G1 
hangs in unqueue_inline"/ CSCef31300. (*Thank you Ian Chaloner*)
http://www.cisco.com/cgi-bin/Support/Bugtool/onebug.pl?bugid=CSCef31300+&Submit=Search
Although our routers are both 7200 and 7301, I figured that that bug 
might be it. To test it, I did "no ip cef" globally, and pounded the 
router with L2TP. So far so good. We lost some CPU power, though (~30%).

The status of the bug is: *Unreproducible*. Does that mean no one is 
working to fix it?
Is our fate to run without cef until more networks show up with this 
problem?

Help/Osama

Rodney Dunn wrote:

>So the deal with hangs are this.
>
>There are two kinds really.
>
>a) Ones that looks like an OS hang but 
>   really isn't. 
>
>These result if a flood of traffic that causes
>the box to become unresponsive because it's
>doing too much work to service the telnet/console.
>
>One of the first things I tell people when they
>tell me their box hangs is to ask them to start
>doing some form of netflow, memory usage, and CPU
>trending of the box.  You would be amazed the hangs
>I've seen solved by watching the spikes on a MRTG
>graph just prior to a reported "OS hange". :)
>
>A sniffer handy always helps in these scenarios too.
>
>I've also seen people on the console and say it
>stops responded and they will pull cables one by
>one or watch the LED's to try and figure out if
>traffic is heavy on a particular interface.  If you
>pull the cables one by one and then the console stops
>responding that tells you at least that that IOS or the
>hardware itself isn't in a hung condition or it wouldn't
>come back at all.
> 
>What is also suggested is to put a PC on the console
>and start logging the session and turn off console
>exec timeout to see if there is anything printed
>to the console just prior to a failure.
>
>b) The second type of hang is a real OS/hardware hang.
>   If it's an OS hang that the watchdog process doesn't
>   detect the only way to get a valid idea of what the
>   OS is doing is to get a stack trace from rommon.
>   Here are the instructions:
>
>http://www.cisco.com/en/US/products/hw/routers/ps359/products_tech_note09186a0080106fd7.shtml
>
>Also monitor around the time everything you possibly can
>about what is happening to the box.
>AAA logs for any commands that get entered, interfaces coming
>up/down, etc...
>
>The more you know about the system and what's happening about
>the time of a failure the more it helps us figure out what
>is really going wrong.
>
>And always correlate events between boxes to look for things
>that are similar.
>
>ie: NPE, modules, traffic flows, etc.
>
>For problems like this my motto is "it's always about the trigger
>for the problem. Figuring that out is 99% of the work". (rodunn)
>
>Rodney
>
>
>On Mon, Dec 13, 2004 at 01:11:39PM +0300, Osama I. Dosary wrote:
>  
>
>>Janet Sullivan wrote:
>>
>>    
>>
>>>Osama I. Dosary wrote:
>>>
>>>      
>>>
>>>>In the past week or we have been experiencing total hangs on 
>>>>*several* 7301 and 7200 routers (all NPE-G1).
>>>>        
>>>>
>>>What version(s) of IOS are you running?
>>>      
>>>
>>The IOSs we ran when a hang occurred, were the following: 12.3-5a.B1, 
>>12.3-5a.B3, 12.3-9c, and 12.3-10a.
>>When we could not figure out the problem, we thought it might be an 
>>un-registered IOS bug, so we jumped from one to another.
>>_______________________________________________
>>cisco-nsp mailing list  cisco-nsp at puck.nether.net
>>https://puck.nether.net/mailman/listinfo/cisco-nsp
>>archive at http://puck.nether.net/pipermail/cisco-nsp/
>>    
>>


More information about the cisco-nsp mailing list