[j-nsp] bfd = busted failure detection :)

Richard A Steenbergen ras at e-gerbil.net
Sun Dec 13 04:11:29 EST 2009


On Fri, Dec 11, 2009 at 02:50:51PM -0500, Ross Vandegrift wrote:
> On Wed, Dec 09, 2009 at 05:21:21PM -0600, Richard A Steenbergen wrote:
> > I've personally never had any luck reproducing it in the lab, so I
> > understand Juniper's frustration. It seems to require a complexity of
> > routes, ports, and/or protocols which we simply don't have the time or
> > money to reproduce in the lab, but I can reproduce it in the field (with
> > undesired customer impact of course) nearly every time I reboot a
> > router. Maybe we just need to help provide them with an example
> > configuration that they can try to reproduce themselves.
> 
> Hmmm, I may have just reproduced something like this in the lab.  I
> had two static routes for 10.57.55.0/24 and realized I hadn't applied
> a per-packet load balancing policy to the forwarding table export.  So
> I wrote a policy and applied it to the forwarding-table export.  This
> has been stuck in the KRT queue for over four hours now.
> 
> This is different than the indirect next-hop change, but I wonder if
> its related.  Note that interesting error "EPERM -- Jtree walk in
> progress".

That one is pretty different from the usual slowness issue that seems to
be affecting most people. I just cleared bgp sessions on a router to
demonstrate the issue, which you can portions of any time you make a
major routing change. Unfortunately (for my demonstration) this router
was pretty small and didn't exhibit any stalls in processing fib
updates. The performance was pretty acceptable, fully syncing in under a
minute. I'm sure the simultanious loss of IGP routes and having more
complex routing protocol configurations has something to do with it too.

At any rate, what you'll see while the paths are trying to change is 
something like this:

2.0.0.0/16         +[BGP/170] 00:00:31, MED 0, localpref 100, from x.x.x.x
                      AS path: ### ### I
                    > to x.x.x.x via xe-2/1/0.54
                      to x.x.x.x via xe-2/2/0.54
                   -[BGP/170] 00:03:10, MED 0, localpref 100, from x.x.x.x
                      AS path: ### ### I
                    > to x.x.x.x via xe-2/1/0.54

And in a show bgp summary, you'll see the routes stuck in pending state 
(again this one processed the routes quite fast, this is just for demo 
purposes):

Groups: 3 Peers: 7 Down peers: 7
Table          Tot Paths  Act Paths Suppressed    History Damp State    Pending
inet.0                 0          0          0          0          0      13831

I've sent Juniper the krt queue as the stall is happening many dozens of 
times in the past, but it basically just looks like a router syncing a 
bunch of new routes, like this:

Indirect next hop add/change: 0 queued
MPLS add queue: 0 queued
Indirect next hop delete: 0 queued
High-priority deletion queue: 0 queued
High-priority change queue: 0 queued
High-priority add queue: 0 queued
Normal-priority indirect next hop queue: 0 queued
Normal-priority deletion queue: 0 queued
Normal-priority composite next hop deletion queue: 0 queued
Normal-priority change queue: 0 queued
Normal-priority add queue: 53504 queued
                ADD gf 1 inst id 0 62.56.0.0/17 type 3
                ADD gf 1 inst id 0 62.49.0.0/16 type 3
                ADD gf 1 inst id 0 84.19.96.0/19 type 3
                ADD gf 1 inst id 0 85.91.40.0/22 type 3
                ADD gf 1 inst id 0 85.91.48.0/20 type 3
                ADD gf 1 inst id 0 83.104.0.0/14 type 3
                ADD gf 1 inst id 0 80.176.0.0/15 type 3
                ADD gf 1 inst id 0 80.255.192.0/19 type 3
                ADD gf 1 inst id 0 146.179.0.0/18 type 3
                ADD gf 1 inst id 0 34.252.140.0/22 type 3
                ADD gf 1 inst id 0 147.108.200.0/21 type 3
                ADD gf 1 inst id 0 146.23.177.0/24 type 3
                ADD gf 1 inst id 0 146.23.178.0/24 type 3
                ADD gf 1 inst id 0 134.144.108.0/24 type 3
                ADD gf 1 inst id 0 138.32.240.0/22 type 3
                ADD gf 1 inst id 0 32.97.185.0/24 type 3
                ADD gf 1 inst id 0 52.124.0.0/21 type 3
                etc etc

The problem is that something seems to cause this process to stall for
several minutes at a time, leaving the router in limbo without routes
installed in the fib, and in some cases blackholing packets as a result.

-- 
Richard A Steenbergen <ras at e-gerbil.net>       http://www.e-gerbil.net/ras
GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)


More information about the juniper-nsp mailing list