[c-nsp] Total hangs on 7301/7200 (NPE-G1)
Osama I. Dosary
oid at saudi.net.sa
Tue Dec 14 00:29:30 EST 2004
Rodney Dunn wrote:
>On Mon, Dec 13, 2004 at 06:01:47PM +0300, Osama I. Dosary wrote:
>
>
>>Thank you Rodney,
>>
>>I ruled out a) a few days back, since I was able to catch a hang while
>>connected to console -- finding no messages. Also, we pulled out the one
>>and only network cable, waited for over 20 minutes, and it still did not
>>come back. In addition, prior to a hang, there are no significant
>>traffic (bps&pps) or CPU increase noticible on our monitoring system
>>(NMIS 5min polling interval.) (And I don't know an easy way to sniff
>>100Mbps, and make sense of it. Suggestions welcomed.)
>>
>>Regarding b) "real OS/hardware hang". I believe this to be the culprit.
>>I've already set the register for stack trace few days back, but haven't
>>been lucky to have a hang on that router since. (Also wondering, will a
>>control-break be accepted by the console, when the console hangs?)
>>
>>
>
>Yes.
>
>
>
>>After a bit of testing, we came up with a work around: Using an external
>>FastEthernet PA. This work around has been working pretty well the past
>>3 days (no hangs.) There are very few "spurious interrupts", but still a
>>lot of in-queue drops.
>>
>>
>
>What do you mean an external FE PA?
>
>
I inserted a PA-FE-TX (or Fast-ethernet (TX-ISL) Port adapter, 1 port)
into the only 7301 slot.
>>Someone on the mailing list pointed me to this unresolved bug: "NPE-G1
>>hangs in unqueue_inline"/ CSCef31300. (*Thank you Ian Chaloner*)
>>http://www.cisco.com/cgi-bin/Support/Bugtool/onebug.pl?bugid=CSCef31300+&Submit=Search
>>Although our routers are both 7200 and 7301, I figured that that bug
>>might be it. To test it, I did "no ip cef" globally, and pounded the
>>router with L2TP. So far so good. We lost some CPU power, though (~30%).
>>
>>
>
>Please don't leave CEF off. Let's fix the problem. I'm ok with
>turning it off for just the short term until we get a plan together.
>
>
Its already off. So let's work out a plan. What do you suggest?
I set the register for stack trace, and I will follow the documented
procedure, when it happens.
Is there anything else there is to do?
>>The status of the bug is: *Unreproducible*. Does that mean no one is
>>working to fix it?
>>
>>
>
>No. It means we couldn't reproduce it in the lab and the particular
>customer that reported the problem isn't seeing it anymore.
>
>
We've reproduced several times. I think we can continue to. But,
unfortunately our customers will suffer.
>I looked at the bug. I've seen something similar a couple times
>before and this appears to be something along the lines. It
>appears to possibly be a problem in the switching vector.
>The rommon trace decoded will tell us for sure but to find
>out the root problem will most likely take a debug image.
>I see for that bug DE had one ready to go for the customer.
>
>
DE?
>Please have the TAC engineer look at your configuration and
>match it up to see if it's similar to that bug. On the surface
>I'd say they look very close. If so, have them get you a debug
>image so we can figure out the root of the problem and get it fixed.
>
>Rodney
>
>
Will do.
Thank you,
Osama
>>Is our fate to run without cef until more networks show up with this
>>problem?
>>
>>Help/Osama
>>
>>Rodney Dunn wrote:
>>
>>
>>
>>>So the deal with hangs are this.
>>>
>>>There are two kinds really.
>>>
>>>a) Ones that looks like an OS hang but
>>> really isn't.
>>>
>>>These result if a flood of traffic that causes
>>>the box to become unresponsive because it's
>>>doing too much work to service the telnet/console.
>>>
>>>One of the first things I tell people when they
>>>tell me their box hangs is to ask them to start
>>>doing some form of netflow, memory usage, and CPU
>>>trending of the box. You would be amazed the hangs
>>>I've seen solved by watching the spikes on a MRTG
>>>graph just prior to a reported "OS hange". :)
>>>
>>>A sniffer handy always helps in these scenarios too.
>>>
>>>I've also seen people on the console and say it
>>>stops responded and they will pull cables one by
>>>one or watch the LED's to try and figure out if
>>>traffic is heavy on a particular interface. If you
>>>pull the cables one by one and then the console stops
>>>responding that tells you at least that that IOS or the
>>>hardware itself isn't in a hung condition or it wouldn't
>>>come back at all.
>>>
>>>What is also suggested is to put a PC on the console
>>>and start logging the session and turn off console
>>>exec timeout to see if there is anything printed
>>>to the console just prior to a failure.
>>>
>>>b) The second type of hang is a real OS/hardware hang.
>>> If it's an OS hang that the watchdog process doesn't
>>> detect the only way to get a valid idea of what the
>>> OS is doing is to get a stack trace from rommon.
>>> Here are the instructions:
>>>
>>>http://www.cisco.com/en/US/products/hw/routers/ps359/products_tech_note09186a0080106fd7.shtml
>>>
>>>Also monitor around the time everything you possibly can
>>>about what is happening to the box.
>>>AAA logs for any commands that get entered, interfaces coming
>>>up/down, etc...
>>>
>>>The more you know about the system and what's happening about
>>>the time of a failure the more it helps us figure out what
>>>is really going wrong.
>>>
>>>And always correlate events between boxes to look for things
>>>that are similar.
>>>
>>>ie: NPE, modules, traffic flows, etc.
>>>
>>>For problems like this my motto is "it's always about the trigger
>>>for the problem. Figuring that out is 99% of the work". (rodunn)
>>>
>>>Rodney
>>>
>>>
>>>On Mon, Dec 13, 2004 at 01:11:39PM +0300, Osama I. Dosary wrote:
>>>
>>>
>>>
>>>
>>>>Janet Sullivan wrote:
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>>Osama I. Dosary wrote:
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>>In the past week or we have been experiencing total hangs on
>>>>>>*several* 7301 and 7200 routers (all NPE-G1).
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>What version(s) of IOS are you running?
>>>>>
>>>>>
>>>>>
>>>>>
>>>>The IOSs we ran when a hang occurred, were the following: 12.3-5a.B1,
>>>>12.3-5a.B3, 12.3-9c, and 12.3-10a.
>>>>When we could not figure out the problem, we thought it might be an
>>>>un-registered IOS bug, so we jumped from one to another.
>>>>_______________________________________________
>>>>cisco-nsp mailing list cisco-nsp at puck.nether.net
>>>>https://puck.nether.net/mailman/listinfo/cisco-nsp
>>>>archive at http://puck.nether.net/pipermail/cisco-nsp/
>>>>
>>>>
>>>>
>>>>
More information about the cisco-nsp
mailing list