[c-nsp] TAC using us as a NPE-G1 bug fix test site

Mike O'Connor mjo at dojo.mi.org
Mon Feb 7 09:03:41 EST 2005


Forgive me if this isn't particularly Cisco-specific, but speaking
as a vendor who occasionally tests fixes with live customers...

:On Mon, Feb 07, 2005 at 09:26:07AM +0300, Osama I Dosary wrote:
:> We have about 4 NPE-G1s (both 7200 and 7301) that keep hanging. After a
:> few weeks of dialogue the TAC agent said they identified the bug,
:> another customer has it, and that they've made a debug image with a fix.
:> Now the TAC agent is asking us to try it. Unfortunately the only way to
:> try it is on production, since it needs some load. But we've already

Usually, it's not quite that simple.  You usually need a particular
load or usage pattern that you might not have an easy time defining
from the nature of the fix -- especially true if you look at the code
corresponding to where it's hanging, say "ooh, this is gross", and do
some sort of rewrite as opposed to a highly-targetted fix.  It's not
as if vendors aren't beating up systems in QA, but they can't beat up
systems in every way imaginable, hoping to find just that right set of
circumstances that triggers the bug.  Even if I could dream up with an
endlessly abusive set of QA routines, the net result is that folks
would wait endlessly for product releases.  I doubt most customers'
networks could be instrumented sufficiently to capture what's going
on without inducing a lot of other problems.

And that's assuming you even have an honest-to-goodness "fix", vs. 
"debug code to grovel more information about the condition", which
can get even hairier...  the Heisenberg uncertainty principle and
I are on alcoholic terms with each other...

:> suffered enough service disruption because of this.

This is a past tense sort of statement, but it sounds like you are
continuing to suffer.  So, really, you have to balance between "known
pain" and "maybe something better, maybe something worse".  I can
generally sell folks on applying debug fixes if the current pain is
bad enough.  What gets me are the issues where three customers out of
a boatload have seen one similar hang each, over many years.  Odds
are, whether you give them debug fixes or not, they'll never see it
again.  You want to be sure that you don't have that obscure bug
happen again for any customer, but if the fix isn't blindingly obvious
and you have some understanding of why the bug is infrequent, you may
be hesitant to just arbitrarily apply some fix, debugging, or whatnot
for fear that you make something worse for someone else.

To summarize, not every bug will correspond to some discrete test case
that exist in the lab independent of the customer, much as customers
might want that to be the case.  That's just reality.

One other thing to note: Cisco (or any vendor) _could_ just arbitrarily
call something "production", just long enough for you to get it in your
hands and see if it works, then release another "production" release if
the fix didn't work.  Trying to decide the destiny of your production
site solely based on semi-arbitrary labels a vendor applies to a product
can put you in a world of hurt.  And Cisco has enough alphabet soup in
IOS to feed small countries...

:> When I told him such, and that we will wait until the fix is out on
:> mainline trains, he said that in order for the DE's (development
:> engineers) to apply the fix to mainline IOS we must verify that fix works.
:> 
:> I thought this demand strange, especially when the bug seems
:> reproducible, but this is my first encounter with TAC.
:> So I wanted to ask: Is this a reasonable demand? Is this how TAC usually
:> works, or just for middle east customers?
:
:	This is generally how the TAC works, they need a way to
:validate if a bugfix works..  If you're willing to run the image,
:please do so as it will allow the bugfix to be available sooner.

Depending on the specific nature of the fix, it is possible that you
might be given (or you might want to ask for) bits that "act the way
they always have been acting" with one tunable, then "act in the new
way that we think the bug fix will be" with another.  This can be a
non-trivial task in some instances, but sometimes it might just be the
case that the engineering guy just don't think along those lines when
giving fixes to the TAC and customers to try out. 

-- 
 Mail: mjo at dojo.mi.org  WWW: http://dojo.mi.org/~mjo/  Phone: +1 248 427 4481
 =--==--==--==--==--==--==--==--==--==--==--==--==--==--==--==--==--==--==--=
"What happens if you get scared half to death twice?"          -Steven Wright


More information about the cisco-nsp mailing list