[c-nsp] Total hangs on 7301/7200 (NPE-G1)

Rodney Dunn rodunn at cisco.com
Tue Dec 14 21:17:20 EST 2004


> >
> I inserted a PA-FE-TX (or Fast-ethernet (TX-ISL) Port adapter, 1 port) 
> into the only 7301 slot.

Ok.

> 
> >>Someone on the mailing list pointed me to this unresolved bug: "NPE-G1 
> >>hangs in unqueue_inline"/ CSCef31300. (*Thank you Ian Chaloner*)
> >>http://www.cisco.com/cgi-bin/Support/Bugtool/onebug.pl?bugid=CSCef31300+&Submit=Search
> >>Although our routers are both 7200 and 7301, I figured that that bug 
> >>might be it. To test it, I did "no ip cef" globally, and pounded the 
> >>router with L2TP. So far so good. We lost some CPU power, though (~30%).
> >>    
> >>
> >
> >Please don't leave CEF off.  Let's fix the problem.  I'm ok with
> >turning it off for just the short term until we get a plan together.
> >  
> >
> Its already off. So let's work out a plan. What do you suggest?
> I set the register for stack trace, and I will follow the documented 
> procedure, when it happens.
> Is there anything else there is to do?
> 
> >>The status of the bug is: *Unreproducible*. Does that mean no one is 
> >>working to fix it?
> >>    
> >>
> >
> >No.  It means we couldn't reproduce it in the lab and the particular
> >customer that reported the problem isn't seeing it anymore.
> >  
> >
> We've reproduced several times. I think we can continue to. But, 
> unfortunately our customers will suffer.
> 
> >I looked at the bug.  I've seen something similar a couple times
> >before and this appears to be something along the lines.  It
> >appears to possibly be a problem in the switching vector.
> >The rommon trace decoded will tell us for sure but to find
> >out the root problem will most likely take a debug image.
> >I see for that bug DE had one ready to go for the customer.
> >  
> >
> DE?


Sorr.  Development Engineers (the ones that write the code)

> 
> >Please have the TAC engineer look at your configuration and
> >match it up to see if it's similar to that bug.  On the surface
> >I'd say they look very close.  If so, have them get you a debug
> >image so we can figure out the root of the problem and get it fixed.
> >
> >Rodney
> >  
> >
> Will do.

I think you probably need that debug image given the bug information
and the details you have provided.

> 
> Thank you,
> Osama
> 
> >>Is our fate to run without cef until more networks show up with this 
> >>problem?
> >>
> >>Help/Osama
> >>
> >>Rodney Dunn wrote:
> >>
> >>    
> >>
> >>>So the deal with hangs are this.
> >>>
> >>>There are two kinds really.
> >>>
> >>>a) Ones that looks like an OS hang but 
> >>>  really isn't. 
> >>>
> >>>These result if a flood of traffic that causes
> >>>the box to become unresponsive because it's
> >>>doing too much work to service the telnet/console.
> >>>
> >>>One of the first things I tell people when they
> >>>tell me their box hangs is to ask them to start
> >>>doing some form of netflow, memory usage, and CPU
> >>>trending of the box.  You would be amazed the hangs
> >>>I've seen solved by watching the spikes on a MRTG
> >>>graph just prior to a reported "OS hange". :)
> >>>
> >>>A sniffer handy always helps in these scenarios too.
> >>>
> >>>I've also seen people on the console and say it
> >>>stops responded and they will pull cables one by
> >>>one or watch the LED's to try and figure out if
> >>>traffic is heavy on a particular interface.  If you
> >>>pull the cables one by one and then the console stops
> >>>responding that tells you at least that that IOS or the
> >>>hardware itself isn't in a hung condition or it wouldn't
> >>>come back at all.
> >>>
> >>>What is also suggested is to put a PC on the console
> >>>and start logging the session and turn off console
> >>>exec timeout to see if there is anything printed
> >>>to the console just prior to a failure.
> >>>
> >>>b) The second type of hang is a real OS/hardware hang.
> >>>  If it's an OS hang that the watchdog process doesn't
> >>>  detect the only way to get a valid idea of what the
> >>>  OS is doing is to get a stack trace from rommon.
> >>>  Here are the instructions:
> >>>
> >>>http://www.cisco.com/en/US/products/hw/routers/ps359/products_tech_note09186a0080106fd7.shtml
> >>>
> >>>Also monitor around the time everything you possibly can
> >>>about what is happening to the box.
> >>>AAA logs for any commands that get entered, interfaces coming
> >>>up/down, etc...
> >>>
> >>>The more you know about the system and what's happening about
> >>>the time of a failure the more it helps us figure out what
> >>>is really going wrong.
> >>>
> >>>And always correlate events between boxes to look for things
> >>>that are similar.
> >>>
> >>>ie: NPE, modules, traffic flows, etc.
> >>>
> >>>For problems like this my motto is "it's always about the trigger
> >>>for the problem. Figuring that out is 99% of the work". (rodunn)
> >>>
> >>>Rodney
> >>>
> >>>
> >>>On Mon, Dec 13, 2004 at 01:11:39PM +0300, Osama I. Dosary wrote:
> >>> 
> >>>
> >>>      
> >>>
> >>>>Janet Sullivan wrote:
> >>>>
> >>>>   
> >>>>
> >>>>        
> >>>>
> >>>>>Osama I. Dosary wrote:
> >>>>>
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>>>In the past week or we have been experiencing total hangs on 
> >>>>>>*several* 7301 and 7200 routers (all NPE-G1).
> >>>>>>       
> >>>>>>
> >>>>>>            
> >>>>>>
> >>>>>What version(s) of IOS are you running?
> >>>>>     
> >>>>>
> >>>>>          
> >>>>>
> >>>>The IOSs we ran when a hang occurred, were the following: 12.3-5a.B1, 
> >>>>12.3-5a.B3, 12.3-9c, and 12.3-10a.
> >>>>When we could not figure out the problem, we thought it might be an 
> >>>>un-registered IOS bug, so we jumped from one to another.
> >>>>_______________________________________________
> >>>>cisco-nsp mailing list  cisco-nsp at puck.nether.net
> >>>>https://puck.nether.net/mailman/listinfo/cisco-nsp
> >>>>archive at http://puck.nether.net/pipermail/cisco-nsp/
> >>>>   
> >>>>
> >>>>        
> >>>>


More information about the cisco-nsp mailing list