[c-nsp] Total hangs on 7301/7200 (NPE-G1)

Rodney Dunn rodunn at cisco.com
Mon Dec 13 08:49:18 EST 2004


So the deal with hangs are this.

There are two kinds really.

a) Ones that looks like an OS hang but 
   really isn't. 

These result if a flood of traffic that causes
the box to become unresponsive because it's
doing too much work to service the telnet/console.

One of the first things I tell people when they
tell me their box hangs is to ask them to start
doing some form of netflow, memory usage, and CPU
trending of the box.  You would be amazed the hangs
I've seen solved by watching the spikes on a MRTG
graph just prior to a reported "OS hange". :)

A sniffer handy always helps in these scenarios too.

I've also seen people on the console and say it
stops responded and they will pull cables one by
one or watch the LED's to try and figure out if
traffic is heavy on a particular interface.  If you
pull the cables one by one and then the console stops
responding that tells you at least that that IOS or the
hardware itself isn't in a hung condition or it wouldn't
come back at all.
 
What is also suggested is to put a PC on the console
and start logging the session and turn off console
exec timeout to see if there is anything printed
to the console just prior to a failure.

b) The second type of hang is a real OS/hardware hang.
   If it's an OS hang that the watchdog process doesn't
   detect the only way to get a valid idea of what the
   OS is doing is to get a stack trace from rommon.
   Here are the instructions:

http://www.cisco.com/en/US/products/hw/routers/ps359/products_tech_note09186a0080106fd7.shtml

Also monitor around the time everything you possibly can
about what is happening to the box.
AAA logs for any commands that get entered, interfaces coming
up/down, etc...

The more you know about the system and what's happening about
the time of a failure the more it helps us figure out what
is really going wrong.

And always correlate events between boxes to look for things
that are similar.

ie: NPE, modules, traffic flows, etc.

For problems like this my motto is "it's always about the trigger
for the problem. Figuring that out is 99% of the work". (rodunn)

Rodney


On Mon, Dec 13, 2004 at 01:11:39PM +0300, Osama I. Dosary wrote:
> 
> 
> Janet Sullivan wrote:
> 
> > Osama I. Dosary wrote:
> >
> >> In the past week or we have been experiencing total hangs on 
> >> *several* 7301 and 7200 routers (all NPE-G1).
> >
> >
> > What version(s) of IOS are you running?
> 
> The IOSs we ran when a hang occurred, were the following: 12.3-5a.B1, 
> 12.3-5a.B3, 12.3-9c, and 12.3-10a.
> When we could not figure out the problem, we thought it might be an 
> un-registered IOS bug, so we jumped from one to another.
> _______________________________________________
> cisco-nsp mailing list  cisco-nsp at puck.nether.net
> https://puck.nether.net/mailman/listinfo/cisco-nsp
> archive at http://puck.nether.net/pipermail/cisco-nsp/


More information about the cisco-nsp mailing list