[c-nsp] ASR9k: RIB/FIB convergence
adamv0025 at netconsultings.com
adamv0025 at netconsultings.com
Thu Aug 2 06:23:04 EDT 2018
> Thomas Schmid
> Sent: Thursday, August 02, 2018 10:14 AM
>
> Hi all,
>
> sort of a heads up ...
>
> I'd be interested to hear if, and under which circumstances others are seeing
> this behavior, since the root cause is still unknown.
>
> In the beginning there were some anecdotical complaints by customers that
> they experienced persistent reachability problems to some destinations
> when we did a scheduled maintenance in our network somewhere else.
> Further investigations pointed to routing inconsistencies during large RIB
> changes.
>
> To give you some numbers: we found out that in our environment
> processing 70k BGP changes takes 2-3 min to write the updates to FIB, 700k
> routes takes 20-30 min!!
>
> During that period, RIB and FIB are not consistent with all the nasty
> consequences:
> blackholing, routing loops etc.
>
> Convergence time seems to be somehow related to the number of eBGP
> sessions on the box. On routers with less than 200 sessions, convergence
> time looks ok, from 300+ sessions on, things get bad.
>
> This affects both XR 5.3.3, 6.2.3 and Typhoon, Tomahawk linecards.
>
> TAC/BU are currently working on this, but they have a hard time to find out
> what's going wrong here. Processing the updates on the RP takes less than
> 1s, but writing the updates to the LC takes forever ...
>
First thing first,
To mitigate the damage due to RIB-FIB inconsistencies you could use the:
"BGP-RIB Feedback Mechanism for Update Generation"
"To configure BGP to wait for feedback from RIB indicating that the routes that BGP installed in RIB are installed in FIB, before BGP sends out updates to neighbors, use the "update wait-install" command in router address-family IPv4 or router address-family VPNv4 configuration mode."
Are you seeing any log messages indicating bottleneck between RIB and FIB please?
Do you drop BGP updates on ingress with "as-path length ge 51" please? -not only it's a good practice, but apparently long as-paths caused RIB-FIB clogging in the past.
On your note regarding the apparent relation to number of peers.
So how long does it take for the process to complete for the 200 peers nodes is it linearly proportional to the 20-30 minutes seen on 300 peers nodes please?
Or the relation between number of peers and time follows more of an exponential function (e.g. 290 all good and then 301 bang 30min) , in which case that could also indicate something special with those "delta" peers (e.g. some peers sending somewhat funky updates) (any slow peers btw?)
adam
netconsultings.com
::carrier-class solutions for the telecommunications industry::
More information about the cisco-nsp
mailing list