[c-nsp] Total hangs on 7301/7200 (NPE-G1)

Tue Dec 14 00:29:30 EST 2004

Rodney Dunn wrote:

>On Mon, Dec 13, 2004 at 06:01:47PM +0300, Osama I. Dosary wrote:
>  
>
>>Thank you Rodney,
>>
>>I ruled out a) a few days back, since I was able to catch a hang while 
>>connected to console -- finding no messages. Also, we pulled out the one 
>>and only network cable, waited for over 20 minutes, and it still did not 
>>come back. In addition, prior to a hang, there are no significant 
>>traffic (bps&pps) or CPU increase noticible on our monitoring system 
>>(NMIS 5min polling interval.) (And I don't know an easy way to sniff 
>>100Mbps, and make sense of it. Suggestions welcomed.)
>>
>>Regarding b) "real OS/hardware hang". I believe this to be the culprit. 
>>I've already set the register for stack trace few days back, but haven't 
>>been lucky to have a hang on that router since. (Also wondering, will a 
>>control-break be accepted by the console, when the console hangs?)
>>    
>>
>
>Yes.
>
>  
>
>>After a bit of testing, we came up with a work around: Using an external 
>>FastEthernet PA. This work around has been working pretty well the past 
>>3 days (no hangs.) There are very few "spurious interrupts", but still a 
>>lot of in-queue drops.
>>    
>>
>
>What do you mean an external FE PA?
>  
>
I inserted a PA-FE-TX (or Fast-ethernet (TX-ISL) Port adapter, 1 port) 
into the only 7301 slot.

>>Someone on the mailing list pointed me to this unresolved bug: "NPE-G1 
>>hangs in unqueue_inline"/ CSCef31300. (*Thank you Ian Chaloner*)
>>http://www.cisco.com/cgi-bin/Support/Bugtool/onebug.pl?bugid=CSCef31300+&Submit=Search
>>Although our routers are both 7200 and 7301, I figured that that bug 
>>might be it. To test it, I did "no ip cef" globally, and pounded the 
>>router with L2TP. So far so good. We lost some CPU power, though (~30%).
>>    
>>
>
>Please don't leave CEF off.  Let's fix the problem.  I'm ok with
>turning it off for just the short term until we get a plan together.
>  
>
Its already off. So let's work out a plan. What do you suggest?
I set the register for stack trace, and I will follow the documented 
procedure, when it happens.
Is there anything else there is to do?

>>The status of the bug is: *Unreproducible*. Does that mean no one is 
>>working to fix it?
>>    
>>
>
>No.  It means we couldn't reproduce it in the lab and the particular
>customer that reported the problem isn't seeing it anymore.
>  
>
We've reproduced several times. I think we can continue to. But, 
unfortunately our customers will suffer.

>I looked at the bug.  I've seen something similar a couple times
>before and this appears to be something along the lines.  It
>appears to possibly be a problem in the switching vector.
>The rommon trace decoded will tell us for sure but to find
>out the root problem will most likely take a debug image.
>I see for that bug DE had one ready to go for the customer.
>  
>
DE?

>Please have the TAC engineer look at your configuration and
>match it up to see if it's similar to that bug.  On the surface
>I'd say they look very close.  If so, have them get you a debug
>image so we can figure out the root of the problem and get it fixed.
>
>Rodney
>  
>
Will do.

Thank you,
Osama

>>Is our fate to run without cef until more networks show up with this 
>>problem?
>>
>>Help/Osama
>>
>>Rodney Dunn wrote:
>>
>>    
>>
>>>So the deal with hangs are this.
>>>
>>>There are two kinds really.
>>>
>>>a) Ones that looks like an OS hang but 
>>>  really isn't. 
>>>
>>>These result if a flood of traffic that causes
>>>the box to become unresponsive because it's
>>>doing too much work to service the telnet/console.
>>>
>>>One of the first things I tell people when they
>>>tell me their box hangs is to ask them to start
>>>doing some form of netflow, memory usage, and CPU
>>>trending of the box.  You would be amazed the hangs
>>>I've seen solved by watching the spikes on a MRTG
>>>graph just prior to a reported "OS hange". :)
>>>
>>>A sniffer handy always helps in these scenarios too.
>>>
>>>I've also seen people on the console and say it
>>>stops responded and they will pull cables one by
>>>one or watch the LED's to try and figure out if
>>>traffic is heavy on a particular interface.  If you
>>>pull the cables one by one and then the console stops
>>>responding that tells you at least that that IOS or the
>>>hardware itself isn't in a hung condition or it wouldn't
>>>come back at all.
>>>
>>>What is also suggested is to put a PC on the console
>>>and start logging the session and turn off console
>>>exec timeout to see if there is anything printed
>>>to the console just prior to a failure.
>>>
>>>b) The second type of hang is a real OS/hardware hang.
>>>  If it's an OS hang that the watchdog process doesn't
>>>  detect the only way to get a valid idea of what the
>>>  OS is doing is to get a stack trace from rommon.
>>>  Here are the instructions:
>>>
>>>http://www.cisco.com/en/US/products/hw/routers/ps359/products_tech_note09186a0080106fd7.shtml
>>>
>>>Also monitor around the time everything you possibly can
>>>about what is happening to the box.
>>>AAA logs for any commands that get entered, interfaces coming
>>>up/down, etc...
>>>
>>>The more you know about the system and what's happening about
>>>the time of a failure the more it helps us figure out what
>>>is really going wrong.
>>>
>>>And always correlate events between boxes to look for things
>>>that are similar.
>>>
>>>ie: NPE, modules, traffic flows, etc.
>>>
>>>For problems like this my motto is "it's always about the trigger
>>>for the problem. Figuring that out is 99% of the work". (rodunn)
>>>
>>>Rodney
>>>
>>>
>>>On Mon, Dec 13, 2004 at 01:11:39PM +0300, Osama I. Dosary wrote:
>>> 
>>>
>>>      
>>>
>>>>Janet Sullivan wrote:
>>>>
>>>>   
>>>>
>>>>        
>>>>
>>>>>Osama I. Dosary wrote:
>>>>>
>>>>>     
>>>>>
>>>>>          
>>>>>
>>>>>>In the past week or we have been experiencing total hangs on 
>>>>>>*several* 7301 and 7200 routers (all NPE-G1).
>>>>>>       
>>>>>>
>>>>>>            
>>>>>>
>>>>>What version(s) of IOS are you running?
>>>>>     
>>>>>
>>>>>          
>>>>>
>>>>The IOSs we ran when a hang occurred, were the following: 12.3-5a.B1, 
>>>>12.3-5a.B3, 12.3-9c, and 12.3-10a.
>>>>When we could not figure out the problem, we thought it might be an 
>>>>un-registered IOS bug, so we jumped from one to another.
>>>>_______________________________________________
>>>>cisco-nsp mailing list  cisco-nsp at puck.nether.net
>>>>https://puck.nether.net/mailman/listinfo/cisco-nsp
>>>>archive at http://puck.nether.net/pipermail/cisco-nsp/
>>>>   
>>>>
>>>>        
>>>>