[j-nsp] bfd = busted failure detection :)

Kevin Day toasty at dragondata.com
Wed Dec 16 00:03:08 EST 2009


On Dec 9, 2009, at 5:21 PM, Richard A Steenbergen wrote:
>
> The behavior we've always seen (from mid 7.x's until today) is that
> something seems to "block" the KRT queue while the pending changes  
> keep
> piling up, then eventually whatever is causing the blockage clears and
> all the routes quickly install immediately thereafter. I saw the exact
> same behavior last night during an upgrade to 9.5R3, 263k routes stuck
> in Pending state for a hair over 10 minutes, then they all synced in
> just a few seconds. But I think it's actually getting worse, because  
> in
> older versions the routes that were stuck in pending state didn't seem
> to be advertised to peers. This time it seemed to advertise the routes
> even though it didn't have them installed in hardware, resulting in
> blackholing of traffic.

I went back and forth on this forever (pestering you while doing it),  
because it was affecting us badly on old M20s. My "lab" boxes would  
never ever show the problem, but it would happen in on the production  
routers. I finally gave up and decided to figure out what the  
difference was between my production configuration and the lab  
simulation by slowly changing my production config to match the nearly  
identical lab config.

The problem went away when I removed a BGP session with a peer that  
was extremely slow to accept routes, and we were exchanging full  
tables with each other. I think it was some kind of deadlock where the  
peer wasn't accepting routes because it was blocked trying to send me  
stuff, and I was in the same boat. Snooping at the TCP layer, I didn't  
see anything unusual except both peers ended up in a state where they  
were advertising 0 window size to each other. The moment the KRT queue  
cleared up, they finished exchanging routes and all was happy.

I can't say for certain that was the problem, but shutting down that  
peer was a pretty reliable way to clear the KRT queue problem whenever  
it happened.

-- Kevin



More information about the juniper-nsp mailing list