[j-nsp] bfd = busted failure detection :)

Richard A Steenbergen ras at e-gerbil.net
Wed Dec 9 18:21:21 EST 2009


On Wed, Dec 09, 2009 at 09:13:28AM -0700, David Ball wrote:
> Do your KRT queues eventually flush though?  Is it just a slow
> control->fwding thing when large route updates occur?  I've done 2
> upgrades in as many years to resolve a KRT related bug, but that
> resulted in the queue NEVER emptying.  It's apparently related to a
> residual variable being set after an RPD restart (caused by another
> bug) resulting in a kernel/rpd inconsistency.  I'm told mine is
> resolved in 9.5R3 (PR291407), but I got nervous when I read Richard's
> earlier post.

The behavior we've always seen (from mid 7.x's until today) is that
something seems to "block" the KRT queue while the pending changes keep
piling up, then eventually whatever is causing the blockage clears and
all the routes quickly install immediately thereafter. I saw the exact
same behavior last night during an upgrade to 9.5R3, 263k routes stuck
in Pending state for a hair over 10 minutes, then they all synced in
just a few seconds. But I think it's actually getting worse, because in
older versions the routes that were stuck in pending state didn't seem
to be advertised to peers. This time it seemed to advertise the routes
even though it didn't have them installed in hardware, resulting in
blackholing of traffic.

> 2009/12/9 Mark Tinka <mtinka at globaltransit.net>:
> >
> > I'd be willing to help if we can offline this to a
> > reproduction in my lab.
> >
> > I have a case that will have been open for 1 year, if
> > February 2010 comes and we still haven't fixed it. So I know
> > what it's like :-).

I've personally never had any luck reproducing it in the lab, so I
understand Juniper's frustration. It seems to require a complexity of
routes, ports, and/or protocols which we simply don't have the time or
money to reproduce in the lab, but I can reproduce it in the field (with
undesired customer impact of course) nearly every time I reboot a
router. Maybe we just need to help provide them with an example
configuration that they can try to reproduce themselves.

As for the cases which have been open for over 1 year... I had quite a
few a year ago, JTAC was really dropping the ball in absolutely abysmal
ways. IMHO it has gotten significantly better over the last year, both
in JTAC and code quality, but they're still pretty hit or miss in the
initial stages. The sad part was none of my issues were complex problems
that actually took a year to resolve, they were all relatively simple
problems that took JTAC a year and several escalations just to find
someone competent who could actually read, understand, and follow my
instructions to reproduce the problem.

-- 
Richard A Steenbergen <ras at e-gerbil.net>       http://www.e-gerbil.net/ras
GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)


More information about the juniper-nsp mailing list