[c-nsp] Any way to disable crashdump on Catalyst 35xx/37xx?

Sun Mar 27 12:38:20 EDT 2011

Eem

Sent from my BlackBerry® smartphone provided by Airtel Kenya

-----Original Message-----
From: "David DeSimone" <fox at verio.net>
Sender: cisco-nsp-bounces at puck.nether.net
Date: Sun, 27 Mar 2011 10:42:39 
To: <cisco-nsp at puck.nether.net>
Subject: [c-nsp] Any way to disable crashdump on Catalyst 35xx/37xx?

This is a stupid problem, and I feel stupid for asking this, but here it
is.

We have some switches that sometimes have their TCAM overloaded, and
this causes them to thrash their CPU's.  Sometimes the thrashing gets
so bad that the devices crash.  Sometimes they crash so badly that they
don't reload, and we must manually power-cycle the devices to revive
them.

Obviously the right thing to do is to remove the sources of overload,
and for various reasons, things are not proceeding at a good pace
towards that solution.  In the meantime, I am trying to find ways to
keep downtime to a minimum.  A reloading switch gives us about 5 minutes
of downtime, but a switch that won't reload often goes 30 minutes before
an operator can reach it and lay hands upon it.  I'm trying to see
if I can improve the probability of the switch reloading, instead of
freezing, when it crashes.

I have noticed that whenever the switches crash, they often have this
final message on their serial console:

    %Software-forced reload

    Preparing to dump core...

    Mar  8 00:00:06.172: %SYS-2-WATCHDOG: Process aborted on watchdog timeout, process = MDFS Reload.
    -Traceback= 1A80E58 1A80C40 1A7AF04 1E199F4 1E1DA2C 1A8FE00 1A8FF38 14AD164 BC0648 BB7118

So, it looks like the switch starts trying to dump its core, but another
thread of execution generates an exception at the wrong time and
interrupts the process, so it never completes.

Since we already know why these switches are crashing, and we don't need
to do any analysis on them, I would like to try disabling crashdumps
entirely, so that perhaps this race condition is removed, and we can
improve the chances of the switches reloading instead of freezing.  But,
is there a way to disable the crashdump process?

My reading suggested that this was the command I was looking for:

    no exception crashinfo

The documentation states that this will disable crashdumps.  However,
what I find in practice is that it disables "extended" crashinfo
files from being written.  That is, normally on a crash, both a
"crashinfo" and a "crashinfo_ext" entry will normally be written
to flash.  The above command disables the "crashinfo_ext" but the
"crashinfo" entry continues to be written, which is still a waste
of time, and perhaps helps my switches freeze up because they give
themselves more opportunity to run into lockup conditions.

Is there perhaps a better way to force a switch to reload immediately
when it hits an exception, without attempting to write anything to
flash?

Like I said, I feel stupid for asking, but just wondering if anyone has
found a way to do this.

-- 
David DeSimone == Network Admin
  "I don't like spinach, and I'm glad I don't, because if I
   liked it I'd eat it, and I just hate it." -- Clarence Darrow

This email message is intended for the use of the person to whom it has been sent, and may contain information that is confidential or legally protected. If you are not the intended recipient or have received this message in error, you are not authorized to copy, distribute, or otherwise use this message or its attachments. Please notify the sender immediately by return e-mail and permanently delete this message and any attachments. Verio, Inc. makes no warranty that this email is error or virus free.  Thank you.
_______________________________________________
cisco-nsp mailing list  cisco-nsp at puck.nether.net
https://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisco-nsp/