[j-nsp] bfd = busted failure detection :)

Wed Dec 16 15:28:25 EST 2009

On Tue, Dec 15, 2009 at 11:03:08PM -0600, Kevin Day wrote:
> 
> I went back and forth on this forever (pestering you while doing it),  
> because it was affecting us badly on old M20s. My "lab" boxes would  
> never ever show the problem, but it would happen in on the production  
> routers. I finally gave up and decided to figure out what the  
> difference was between my production configuration and the lab  
> simulation by slowly changing my production config to match the nearly  
> identical lab config.
> 
> The problem went away when I removed a BGP session with a peer that  
> was extremely slow to accept routes, and we were exchanging full  
> tables with each other. I think it was some kind of deadlock where the  
> peer wasn't accepting routes because it was blocked trying to send me  
> stuff, and I was in the same boat. Snooping at the TCP layer, I didn't  
> see anything unusual except both peers ended up in a state where they  
> were advertising 0 window size to each other. The moment the KRT queue  
> cleared up, they finished exchanging routes and all was happy.
> 
> I can't say for certain that was the problem, but shutting down that  
> peer was a pretty reliable way to clear the KRT queue problem whenever  
> it happened.

What code was this? In theory shouldn't the routes be in a bgp queue 
regardless of whats happening with the tcp layer? Should see if we can 
reproduce this with modern hardware and code.

-- 
Richard A Steenbergen <ras at e-gerbil.net>       http://www.e-gerbil.net/ras
GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)