[j-nsp] bfd = busted failure detection :)

Richard A Steenbergen ras at e-gerbil.net
Fri Dec 4 15:40:14 EST 2009


On Sat, Nov 21, 2009 at 05:16:57PM -0600, Richard A Steenbergen wrote:
> On Sat, Nov 21, 2009 at 12:53:58PM -0800, Nilesh Khambal wrote:
> > Hi Richard,
> > 
> > Just talking from this router perspective, it looks like the remote
> > end router has problem receiving BFD packets from this router. It
> > signaled the BFD session down because of that.
> 
> There are actually two particular interfaces between this pair of
> routers (both MX960s running 9.4R3, both circuits are long-haul ~70ms
> latency) that are flapping because of BFD. The interesting part is that
> they both land on different DPCs (on both ends), there are other 
> circuits between these same devices which are not having BFD issues, and 
> I ran regular RE based pings between the devices (with src/dst set 
> correctly to force traffic over the links in question) and didn't record 
> any loss when BFD thought that it was detecting a failure.

FYI I found the root problem and hereby take back any comments impugning
BFD's reputation. It turns out there actually WAS some kind of pfe bug
which was causing intermittent blackholing of traffic for a few seconds
at a time at seemingly random intervals several times a day. Ping from
the affected devices didn't catch the issue becuase of the re->pfe
forwarding path, only traffic routed entirely via pfe was being
affected. BFD was actually doing its job and detected the failures that
were too short to be noticed by normal routing protocols. I discovered
the issue on several MX960s (mostly running 9.2R4, but one pair was
running 9.4R3), and upgrading them to 9.5R3 seems to solve it (or
perhaps it was just the pfe rstart that did it, remains to be seen).

-- 
Richard A Steenbergen <ras at e-gerbil.net>       http://www.e-gerbil.net/ras
GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)


More information about the juniper-nsp mailing list