[j-nsp] BGP output queue priorities between RIBs/NLRIs

Wed Nov 11 10:10:42 EST 2020

> Rob Foehl
> Sent: Tuesday, November 10, 2020 6:26 PM
> 
> On Tue, 10 Nov 2020, Jeffrey Haas wrote:
> 
> > The thing to remember is that even though you're not getting a given
afi/safi
> as front-loaded as you want (absolute front of queue), as soon as we have
> routes for that priority they're dispatched accordingly.
> 
> Right, that turns out to be the essential issue -- the output queues
actually are
> working as configured, but the AFI/SAFI routes relevant to a higher
priority
> queue arrive so late in the process that it's basically irrelevant whether
they
> get to cut in line at that point.  Certainly wasn't observable to human
eyes, had
> to capture the traffic to verify.
> 
I agree if priority of route processing is to be user
controllable/selectable then it needs to apply end-to-end, i.e. RX,
processing, TX.

> > Full table walks to populate the queues take some seconds to several
minutes
> depending on the scale of the router.  In the absence of prioritization,
> something like the evpn routes might not go out for most of a minute
rather
> than getting delayed some number of seconds until the rib walker has
reached
> that table.
> 
> Ah, maybe this is the sticking point: on a route reflector with an
> RE-S-X6-64 carrying ~10M inet routes and ~10K evpn routes, a new session
> toward an RR client PE needing to be sent ~1.6M inet routes (full table,
add-
> path 2) and maybe ~3K evpn routes takes between 11-17 minutes to get
> through the initial batch.  The evpn routes only arrive at the tail end of
that,
> and may only preempt around 1000 inet routes in the output queues, as
> confirmed by TAC.
> 
> I have some RRs that tend toward the low end of that range and some that
tend
> toward the high end -- and not entirely sure why in either case -- but
that
> timing is pretty consistent overall, and pretty horrifying.  I could
almost live
> with "most of a minute", but this is not that.
> 
Well regardless of this issue at hand I urge you to use separate RRs for
distribution of Internet prefixes and separate ones for VPN(L3/L3) prefixes.
Not only it might *address your problem but it's also much safer since the
probability of malformed message arriving via the Internet (e.g. some
university doing experiments) is much higher then it being originated by
your own PEs.

*it won't address your issue cause PEs on the receiving end will still have
broken priority. 

> 
> [on the topic of route refreshes]
> 
> > The intent of the code is to issue the minimum set of refreshes for new
> configuration.  If it's provably not minimum for a given config, there
should be
> a PR on that.
> 
> I'm pretty sure that much is working as intended, given what is actually
> sent -- this issue is the time spent walking other RIBs that have no
> bearing on what's being refreshed.
> 
This is a notorious case actually, again probably because of a missing
state.
Ran into an issue with 2k VRFs + VRF containing internet routes, 
Say after adding 2001st VRF it would take up to 10 minutes for routes
already in VPNv4 on a local PE to actually make it into the newly configured
VRF (directly connected prefixes and static routes appeared instantly).

> > The cost of the refresh in getting routes sent to you is another
artifact of "we
> don't keep that state" - at least in that configuration.  This is a
circumstance
> where family route-target (RT-Constrain) may help.  You should find when
> using that feature that adding a new VRF with support for that feature
results in
> the missing routes arriving quite fast - we keep the state.
> 
> I'd briefly looked at RT-Constrain, but wasn't convinced it'd be useful
> here since disinterested PEs only have to discard at most ~10K EVPN routes
> at present.  Worth revisiting that assessment?
> 
It would definitely save some cycles and I'd say it's worth implementing.

adam