[j-nsp] BGP output queue priorities between RIBs/NLRIs

Tue Nov 10 13:26:09 EST 2020

On Tue, 10 Nov 2020, Jeffrey Haas wrote:

> The thing to remember is that even though you're not getting a given afi/safi as front-loaded as you want (absolute front of queue), as soon as we have routes for that priority they're dispatched accordingly.

Right, that turns out to be the essential issue -- the output queues 
actually are working as configured, but the AFI/SAFI routes relevant to a 
higher priority queue arrive so late in the process that it's basically 
irrelevant whether they get to cut in line at that point.  Certainly 
wasn't observable to human eyes, had to capture the traffic to verify.

> Full table walks to populate the queues take some seconds to several minutes depending on the scale of the router.  In the absence of prioritization, something like the evpn routes might not go out for most of a minute rather than getting delayed some number of seconds until the rib walker has reached that table.

Ah, maybe this is the sticking point: on a route reflector with an 
RE-S-X6-64 carrying ~10M inet routes and ~10K evpn routes, a new session 
toward an RR client PE needing to be sent ~1.6M inet routes (full table, 
add-path 2) and maybe ~3K evpn routes takes between 11-17 minutes to get 
through the initial batch.  The evpn routes only arrive at the tail end of 
that, and may only preempt around 1000 inet routes in the output queues, 
as confirmed by TAC.

I have some RRs that tend toward the low end of that range and some that 
tend toward the high end -- and not entirely sure why in either case -- 
but that timing is pretty consistent overall, and pretty horrifying.  I 
could almost live with "most of a minute", but this is not that.

This has problems with blackholing traffic for long periods in several 
cases, but the consequences for DF elections are particularly disastrous, 
given that they make up their own minds based on received state without 
any affirmative handshake: the only possible behaviors are discarding or 
looping traffic for every ethernet segment involved until the routes 
settle, depending on whether the PE involved believes it's going to win 
the election and how soon.  Setting extremely long 20 minute DF election 
hold timers is currently the least worst "solution", as losing traffic for 
up to 20 minutes is preferable to flooding a segment into oblivion -- but 
only just.

I wouldn't be nearly as concerned with this if we weren't taking 15-20 
minute outages every time anything changes on one of the PEs involved...

[on the topic of route refreshes]

> The intent of the code is to issue the minimum set of refreshes for new configuration.  If it's provably not minimum for a given config, there should be a PR on that.

I'm pretty sure that much is working as intended, given what is actually 
sent -- this issue is the time spent walking other RIBs that have no 
bearing on what's being refreshed.

> The cost of the refresh in getting routes sent to you is another artifact of "we don't keep that state" - at least in that configuration.  This is a circumstance where family route-target (RT-Constrain) may help.  You should find when using that feature that adding a new VRF with support for that feature results in the missing routes arriving quite fast - we keep the state.

I'd briefly looked at RT-Constrain, but wasn't convinced it'd be useful 
here since disinterested PEs only have to discard at most ~10K EVPN routes 
at present.  Worth revisiting that assessment?

-Rob