[c-nsp] XR6 process conflicts

nivalMcNd d nivalmcnd at gmail.com
Sun Sep 6 05:18:20 EDT 2020


Hi all,

Recently I was doing some testing on XR6 and noticed interesting behavior.

I enabled OSPF adjacency traps to see how the router performs with these
traps. I was getting only a handful of traps for a big router with only a
few OSPF sessions.
That behavior aligns with the RFC 1850 definition back from 1995:
















*4.4.  Throttling Traps   The mechanism for throttling the traps is similar
to the mechanism   explained in RFC 1224 [11], section 5.  The basic idea
is that there   is a sliding window in seconds and an upper bound on the
number of   traps that may be generated within this window.  Unlike RFC
1224,   traps are not sent to inform the network manager that the
throttling   mechanism has kicked in.   A single window should be used to
throttle all OSPF traps types   except for the ospfLsdbOverflow and the
ospfLsdbApproachingOverflow   trap which should not be throttled.  For
example, if the window time   is 3, the upper bound is 3 and the events
that would cause trap types   1,3,5 and 7 occur within a 3 second period,
the type 7 trap should   not be generated.   Appropriate values are 7 traps
with a window time of 10 seconds. *


TAC mentioned that it is expected behavior targeted on protecting CPUs that
can be changed tweaking how snmpd reacts to OL signal. While it kinda makes
sense from an overall perspective, it seems very strange that a lab XR
router running on a multicore Xeon CPU is unable to send out a handful of
traps (i was expecting about 50 traps over a 1 minute period). At the same
time, TAC didn't mention anything about OSPF's built-in OSPF trap
throttling mechanisms.
Further investigation showed that snmpd was silently dropping internal
messages because of the OL condition. This is not the way I would expect a
Linux based system to behave.

Trying to diagnose XR, I collected some syslogs and with a quick script to
compare what XR is sending out over it's SNMP interface and it's Syslog
interface. To my surprise, while OSPF related SNMP traps were really bad,
Syslog feed was clear and accurate. As if syslogd doesn't react to OL
condition and relies on a different OS scheduling mechanism.

I tried to put extra stress on a router and created a few hundreds of
subinterfaces and only a few OSPF neighbors and flapped an interface. Yet
another time, SNMP traps sent to the collector were crippled, syslog was
clear, and reflected everything.

Going deeper, I tested BGP reconvergence and tried to observe what is
happening in there. Yet another time SNMP trap feeds were very bad at
reflecting the status of BGP FASM transitions (I looked at
cbgpFsmStateChange). However now even outgoing syslog messages were
affected as well. This is a bit surprising as well as if BGP on XR
uses CPUs differently. But I still not get it how an idling lab router is
unable to send tramps and syslog messages indicating that it just
experienced a big outage.

On the one hand side, some of that behavior is actually derived from the
fact that the router is trying to bring the connectivity up and operational
asap, but on the other hand side, these CPU throttling aspects were drafted
back in the days when routers ran using 800Mhz CPUs, while now we're
running on server-grade multi-core x86s that should not only be able to do
the whole SPF computation in milliseconds, but also swamp any alarm
collector at the same time. However, it feels like either SNMP/syslog
reporting function in XR didn't address any hw improvements that happened
over time, or that XR's OS process scheduler has major deficiencies. And I
feel like it is the later.

I knew SNMP and syslogs are inherently not reliable being UDP based, but I
never expected that even a router itself doesn't try to inform alarm
collectors about potential large scale outages. If it is actually the OS
scheduler, a lot of existing processes and any new upcoming features, like
BGP-LS or telemetry may be affected.

Later I remembered that XR6 is now based on Windriver Linux, now QNX as it
was previously. While QNX is an RTOS, Linux is not, and it's kernel relies
on a totally different scheduling principle. At the same time, it feels as
if XR's internals were hard on-bolted onto a new OS without much of a
thought about its architecture. Potentially that may lead to a large number
of gray outages that are not properly detected by XR6 routers, and even not
reported NOCs globally.

Am I the only one seeing that behavior in XR?
Did anyone else test how XR routers running on multicore CPUs handle
concurrency? Maybe anyone compared how XR handles process concurrency for
network events?
Unfortunately, I am unable to share any data dumps, but I would be happy to
share scripts and methods I used for data analysis.
Hope I am just overreacting.

Rgds,
Nival


More information about the cisco-nsp mailing list