[j-nsp] bfd = busted failure detection :)

Richard A Steenbergen ras at e-gerbil.net
Tue Dec 15 03:16:23 EST 2009


On Tue, Dec 15, 2009 at 02:59:04PM +0800, Mark Tinka wrote:
> On Monday 14 December 2009 05:23:45 pm Richard A Steenbergen 
> wrote:
> 
> > Oh what good timing, just had to reboot a router tonight
> >  to recover from a differnet Juniper bug (enabling
> >  graceful-switchover on a 9.5R3 box caused blackholing of
> >  traffic, disabling it didn't fix it, had to reboot the
> >  box to clear the issue which of course blew away all the
> >  state, so there will be no finding the root cause).
> 
> Sorry to side-track the thread a bit, but curious whether 
> you had NSR or Graceful Restart enabled when the bug hit.
> 
> We chose to wait for 9.5R4, but would like to be sure this 
> won't be an issue when we upgrade. However, we're a Graceful 
> Switchover + Graceful Restart house. No NSR.

Neither, they were both off when graceful switchover was turned on
(which caused the breakage). The best way to do a non-ISSU upgrade is to
pre-stage the new code/config on the backup RE, then do a non-graceful
switchover. It still reboots the PFE, but it avoids the jinstall upgrade
time, or the RE bootup time if you were doing it from a cold boot.

I've actually seen similar bugs in the past, where graceful-switchover
caused blackholing of live traffic. It would stop blackholing when we
rebooted the backup RE, then started blackholing again when the backup
RE came back online. JTAC was unable to reproduce it, and we were
unwilling to turn gracful-switchover back on (and start blackholing
traffic again :P), so eventually they just closed the case with no
resolution of the actual problem. In this particular incident, we turned
graceful-switchover on for (4) recently upgraded 9.5R3 boxes, and only
one of the started blackholing traffic. But unlike the previous issues,
it didn't recover when g-s was turned off this time. If it makes you
feel any better I highly doubt 9.5R4 will fix the problem (though at
least it seems to be pretty rare). :)

As for GR vs NSR, we're actually in the process of turning GR off in
favor of NSR. So far, in very limited tests mind you, ISSU has actually
worked for us without anything exploding or catching on fire (surprising
I know :P). Also, NSR you can turn on and off without causing any impact
to the router, but GR causes protocol bounces when you turn it on/off. 
Thus we decided it is better to ditch GR now, with the hope of working
NSR and widespread ISSU success in the future. :)

-- 
Richard A Steenbergen <ras at e-gerbil.net>       http://www.e-gerbil.net/ras
GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)


More information about the juniper-nsp mailing list