[c-nsp] Total hangs on 7301/7200 (NPE-G1)

Mon Dec 13 10:16:29 EST 2004

On Mon, Dec 13, 2004 at 06:01:47PM +0300, Osama I. Dosary wrote:
> Thank you Rodney,
> 
> I ruled out a) a few days back, since I was able to catch a hang while 
> connected to console -- finding no messages. Also, we pulled out the one 
> and only network cable, waited for over 20 minutes, and it still did not 
> come back. In addition, prior to a hang, there are no significant 
> traffic (bps&pps) or CPU increase noticible on our monitoring system 
> (NMIS 5min polling interval.) (And I don't know an easy way to sniff 
> 100Mbps, and make sense of it. Suggestions welcomed.)
> 
> Regarding b) "real OS/hardware hang". I believe this to be the culprit. 
> I've already set the register for stack trace few days back, but haven't 
> been lucky to have a hang on that router since. (Also wondering, will a 
> control-break be accepted by the console, when the console hangs?)

Yes.

> 
> After a bit of testing, we came up with a work around: Using an external 
> FastEthernet PA. This work around has been working pretty well the past 
> 3 days (no hangs.) There are very few "spurious interrupts", but still a 
> lot of in-queue drops.

What do you mean an external FE PA?

> 
> Someone on the mailing list pointed me to this unresolved bug: "NPE-G1 
> hangs in unqueue_inline"/ CSCef31300. (*Thank you Ian Chaloner*)
> http://www.cisco.com/cgi-bin/Support/Bugtool/onebug.pl?bugid=CSCef31300+&Submit=Search
> Although our routers are both 7200 and 7301, I figured that that bug 
> might be it. To test it, I did "no ip cef" globally, and pounded the 
> router with L2TP. So far so good. We lost some CPU power, though (~30%).

Please don't leave CEF off.  Let's fix the problem.  I'm ok with
turning it off for just the short term until we get a plan together.

> 
> The status of the bug is: *Unreproducible*. Does that mean no one is 
> working to fix it?

No.  It means we couldn't reproduce it in the lab and the particular
customer that reported the problem isn't seeing it anymore.

I looked at the bug.  I've seen something similar a couple times
before and this appears to be something along the lines.  It
appears to possibly be a problem in the switching vector.
The rommon trace decoded will tell us for sure but to find
out the root problem will most likely take a debug image.
I see for that bug DE had one ready to go for the customer.
Please have the TAC engineer look at your configuration and
match it up to see if it's similar to that bug.  On the surface
I'd say they look very close.  If so, have them get you a debug
image so we can figure out the root of the problem and get it fixed.

Rodney

> Is our fate to run without cef until more networks show up with this 
> problem?
> 
> Help/Osama
> 
> Rodney Dunn wrote:
> 
> >So the deal with hangs are this.
> >
> >There are two kinds really.
> >
> >a) Ones that looks like an OS hang but 
> >   really isn't. 
> >
> >These result if a flood of traffic that causes
> >the box to become unresponsive because it's
> >doing too much work to service the telnet/console.
> >
> >One of the first things I tell people when they
> >tell me their box hangs is to ask them to start
> >doing some form of netflow, memory usage, and CPU
> >trending of the box.  You would be amazed the hangs
> >I've seen solved by watching the spikes on a MRTG
> >graph just prior to a reported "OS hange". :)
> >
> >A sniffer handy always helps in these scenarios too.
> >
> >I've also seen people on the console and say it
> >stops responded and they will pull cables one by
> >one or watch the LED's to try and figure out if
> >traffic is heavy on a particular interface.  If you
> >pull the cables one by one and then the console stops
> >responding that tells you at least that that IOS or the
> >hardware itself isn't in a hung condition or it wouldn't
> >come back at all.
> > 
> >What is also suggested is to put a PC on the console
> >and start logging the session and turn off console
> >exec timeout to see if there is anything printed
> >to the console just prior to a failure.
> >
> >b) The second type of hang is a real OS/hardware hang.
> >   If it's an OS hang that the watchdog process doesn't
> >   detect the only way to get a valid idea of what the
> >   OS is doing is to get a stack trace from rommon.
> >   Here are the instructions:
> >
> >http://www.cisco.com/en/US/products/hw/routers/ps359/products_tech_note09186a0080106fd7.shtml
> >
> >Also monitor around the time everything you possibly can
> >about what is happening to the box.
> >AAA logs for any commands that get entered, interfaces coming
> >up/down, etc...
> >
> >The more you know about the system and what's happening about
> >the time of a failure the more it helps us figure out what
> >is really going wrong.
> >
> >And always correlate events between boxes to look for things
> >that are similar.
> >
> >ie: NPE, modules, traffic flows, etc.
> >
> >For problems like this my motto is "it's always about the trigger
> >for the problem. Figuring that out is 99% of the work". (rodunn)
> >
> >Rodney
> >
> >
> >On Mon, Dec 13, 2004 at 01:11:39PM +0300, Osama I. Dosary wrote:
> >  
> >
> >>Janet Sullivan wrote:
> >>
> >>    
> >>
> >>>Osama I. Dosary wrote:
> >>>
> >>>      
> >>>
> >>>>In the past week or we have been experiencing total hangs on 
> >>>>*several* 7301 and 7200 routers (all NPE-G1).
> >>>>        
> >>>>
> >>>What version(s) of IOS are you running?
> >>>      
> >>>
> >>The IOSs we ran when a hang occurred, were the following: 12.3-5a.B1, 
> >>12.3-5a.B3, 12.3-9c, and 12.3-10a.
> >>When we could not figure out the problem, we thought it might be an 
> >>un-registered IOS bug, so we jumped from one to another.
> >>_______________________________________________
> >>cisco-nsp mailing list  cisco-nsp at puck.nether.net
> >>https://puck.nether.net/mailman/listinfo/cisco-nsp
> >>archive at http://puck.nether.net/pipermail/cisco-nsp/
> >>    
> >>