[j-nsp] BGP output queue priorities between RIBs/NLRIs
Rob Foehl
rwf at loonybin.net
Tue Nov 10 13:26:09 EST 2020
On Tue, 10 Nov 2020, Jeffrey Haas wrote:
> The thing to remember is that even though you're not getting a given afi/safi as front-loaded as you want (absolute front of queue), as soon as we have routes for that priority they're dispatched accordingly.
Right, that turns out to be the essential issue -- the output queues
actually are working as configured, but the AFI/SAFI routes relevant to a
higher priority queue arrive so late in the process that it's basically
irrelevant whether they get to cut in line at that point. Certainly
wasn't observable to human eyes, had to capture the traffic to verify.
> Full table walks to populate the queues take some seconds to several minutes depending on the scale of the router. In the absence of prioritization, something like the evpn routes might not go out for most of a minute rather than getting delayed some number of seconds until the rib walker has reached that table.
Ah, maybe this is the sticking point: on a route reflector with an
RE-S-X6-64 carrying ~10M inet routes and ~10K evpn routes, a new session
toward an RR client PE needing to be sent ~1.6M inet routes (full table,
add-path 2) and maybe ~3K evpn routes takes between 11-17 minutes to get
through the initial batch. The evpn routes only arrive at the tail end of
that, and may only preempt around 1000 inet routes in the output queues,
as confirmed by TAC.
I have some RRs that tend toward the low end of that range and some that
tend toward the high end -- and not entirely sure why in either case --
but that timing is pretty consistent overall, and pretty horrifying. I
could almost live with "most of a minute", but this is not that.
This has problems with blackholing traffic for long periods in several
cases, but the consequences for DF elections are particularly disastrous,
given that they make up their own minds based on received state without
any affirmative handshake: the only possible behaviors are discarding or
looping traffic for every ethernet segment involved until the routes
settle, depending on whether the PE involved believes it's going to win
the election and how soon. Setting extremely long 20 minute DF election
hold timers is currently the least worst "solution", as losing traffic for
up to 20 minutes is preferable to flooding a segment into oblivion -- but
only just.
I wouldn't be nearly as concerned with this if we weren't taking 15-20
minute outages every time anything changes on one of the PEs involved...
[on the topic of route refreshes]
> The intent of the code is to issue the minimum set of refreshes for new configuration. If it's provably not minimum for a given config, there should be a PR on that.
I'm pretty sure that much is working as intended, given what is actually
sent -- this issue is the time spent walking other RIBs that have no
bearing on what's being refreshed.
> The cost of the refresh in getting routes sent to you is another artifact of "we don't keep that state" - at least in that configuration. This is a circumstance where family route-target (RT-Constrain) may help. You should find when using that feature that adding a new VRF with support for that feature results in the missing routes arriving quite fast - we keep the state.
I'd briefly looked at RT-Constrain, but wasn't convinced it'd be useful
here since disinterested PEs only have to discard at most ~10K EVPN routes
at present. Worth revisiting that assessment?
-Rob
More information about the juniper-nsp
mailing list