[j-nsp] bfd = busted failure detection :)

Richard A Steenbergen ras at e-gerbil.net
Mon Dec 14 04:23:45 EST 2009


On Sun, Dec 13, 2009 at 03:11:29AM -0600, Richard A Steenbergen wrote:
> That one is pretty different from the usual slowness issue that seems to
> be affecting most people. I just cleared bgp sessions on a router to
> demonstrate the issue, which you can portions of any time you make a
> major routing change. Unfortunately (for my demonstration) this router
> was pretty small and didn't exhibit any stalls in processing fib
> updates. The performance was pretty acceptable, fully syncing in under a
> minute. I'm sure the simultanious loss of IGP routes and having more
> complex routing protocol configurations has something to do with it too.

Oh what good timing, just had to reboot a router tonight to recover from
a differnet Juniper bug (enabling graceful-switchover on a 9.5R3 box
caused blackholing of traffic, disabling it didn't fix it, had to reboot
the box to clear the issue which of course blew away all the state, so
there will be no finding the root cause). But it did provide a perfect 
example of the FIB blocking issue, with the vast majority of the routing 
table blocking for over 13 minutes before finally installing within a 
few seconds.

Here we are at just past the 13 minute mark, BGP fully synchronized, but 
the vast majority of the routing table not actually installed to FIB:

Groups: 65 Peers: 92 Down peers: 15
Table          Tot Paths  Act Paths Suppressed    History Damp State    Pending
inet.0           2793497     333891          0          0          0     292429
inetflow.0            27         27          0          0          0          4
inet6.0             9438       2075          0          0          0        811

Here is the show krt queue from the same time, showing almost nothing in
the queue. A followup command a second later showed completely different
items in the queue, leading one to believe that the krt queue was not 
stuck.

Routing table add queue: 0 queued
Interface add/delete/change queue: 0 queued
Indirect next hop add/change: 0 queued
MPLS add queue: 0 queued
Indirect next hop delete: 2 queued
             DELETE index 1048789
             DELETE index 1048790
High-priority deletion queue: 0 queued
High-priority change queue: 0 queued
High-priority add queue: 0 queued
Normal-priority indirect next hop queue: 0 queued
Normal-priority deletion queue: 0 queued
Normal-priority composite next hop deletion queue: 0 queued
Normal-priority change queue: 0 queued
Normal-priority add queue: 7 queued
                ADD gf 1 inst id 0 173.164.0.0/19 type 3
         (20)
                ADD gf 1 inst id 0 173.162.16.0/20 type 3
         (20)
                ADD gf 1 inst id 0 173.160.64.0/19 type 3
         (20)
                ADD gf 1 inst id 0 217.168.224.0/20 type 3
         (20)
                ADD gf 1 inst id 0 209.211.136.0/24 nexthop 
         x.x.x.x, xe-7/1/0.0
         (19)
                ADD gf 1 inst id 0 208.45.191.0/24 nexthop 
         x.x.x.x, xe-7/1/0.0
         (19)
                ADD gf 1 inst id 0 208.45.190.0/24 nexthop 
         x.x.x.x, xe-7/1/0.0
         (19)
Routing table delete queue: 0 queued

Here is an example of a route which has been stuck trying to install for
over 8 minutes (first entry in a show route, the rest all look roughly
the same though):

2.0.0.0/16         +[BGP/170] 00:08:40, MED 0, localpref 200, from xx.xx.xxx.xxx
                      AS path: 5413 12654 I
                    > to xx.xx.xxx.xx via xe-3/2/0.0, label-switched-path XXXXX
                      to xx.xx.xxx.xx via xe-3/2/0.0, label-switched-path XXXXX
                      to xx.xx.xxx.xx via xe-3/2/0.0, label-switched-path XXXXX
                      to xx.xx.xxx.xx via xe-3/2/0.0, label-switched-path XXXXX
                      to xx.xx.xxx.xx via ae0.50, label-switched-path Bypass->xx.xx.xxx.xx->xx.xx.xxx.xx
                      to xx.xx.xxx.xx via ae0.50, label-switched-path Bypass->xx.xx.xxx.xx->xx.xx.xxx.xx
                      to xx.xx.xxx.xx via ae0.50, label-switched-path Bypass->xx.xx.xxx.xx->xx.xx.xxx.xx
                      to xx.xx.xxx.xx via ae0.50, label-switched-path Bypass->xx.xx.xxx.xx->xx.xx.xxx.xx

The above is pretty representative of the issue, which has been going on 
in one form or another since around the mid 7.x's (confirmed by dozens 
of people I've talked to who saw the same behavior beginning at around 
the same time).

-- 
Richard A Steenbergen <ras at e-gerbil.net>       http://www.e-gerbil.net/ras
GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)


More information about the juniper-nsp mailing list