[j-nsp] EX 8200 deployment

Wed Mar 24 19:21:31 EDT 2010

On Thu, Mar 25, 2010 at 12:31:15AM +0300, Pavel Lunin wrote:
> Richard, one more thing. What do you do with the crash dumps
> untarzipping them on the router/switch itself? I have never done
> anything with them but sending to JTA. I believe it can have a lot of
> sense to pick them and discover yourself (though I've never tried),
> but why on the switch itself? Am I missing something important?

You can run gdb on the coredump files locally and get a pretty good idea
of what blew up and where, which is often quite helpful in working
around the original problem. Also, JTAC is far too often surprisingly
bad at working with coredumps, and without the ability to independently
verify things myself and tell them they were confused I've had some
cases which would probably never have been solved.

The story that was explained to me was that JTAC has some point and
click tool that they load the core into, which parses it and searches
their PR database to find matching backtraces. The problem is I'm
convinced at this point nobody in JTAC actually knows what a backtrace
is or how to read it, they just match it to whatever their tool tells
them, and surprisingly often their tool is very very wrong.

The other big problem of course is file size and compression. Apparently
their tool only works with .zip files not .tgz files (which is a small
bit of a problem, seeing as how the router only has gzip :P), so they
have to uncompress it locally first before they can load it. I've had
JTAC not know what a .tgz file was, I've had Advanced JTAC spend days
trying to figure out why they couldn't get any data out of a coredump
when the problem turned out to be their local filesystem quota wasn't
big enough to work with a large core file, etc, etc. Even when things
work "right" it seems to take them 12-72 hours to parse a coredump even
on a p1 case, and a healthy percentage of the time their analysis is
just flat out wrong. Without the ability to look at the dump yourself, 
you'd never know they were barking up the wrong tree.

Because EX uses PowerPC, it isn't even particularly easy to find a
FreeBSD ppc box where you can actually do any useful analysis of the
coredumps. That assumes of course that you have working connectivity on 
the box in question and can quickly copy the sometimes very large files 
off, which due to the original problem that caused the crash is often 
times not the case. And where do they plan on writing a 2GB core dump 
when there is an EX kernel panic and you only have 600MB of free space 
on an "empty" box? You can bet there will be, I run into them at least 
2 or 3 times a year on MX easily, it's just a fact of life. I mean 
seriously what does 32GB of flash cost, $100? Think about the amount of 
grief that will be caused by this in comparison, and tell me it was a 
smart move on their part. :)

-- 
Richard A Steenbergen <ras at e-gerbil.net>       http://www.e-gerbil.net/ras
GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)