[j-nsp] Automation - The Skinny (Was: Re: ACX5448 & ACX710)

Mon Jan 27 03:06:58 EST 2020

On Mon, 27 Jan 2020 at 00:18, Robert Raszuk <robert at raszuk.net> wrote:

> The other one is actually of keeping your network running. Imagine router maintaining entire control plane perfectly fine, imagine BFD working fine to the box from peers but dropping between line cards via fabric from 20% to 80% traffic. Unfortunately this is not a theory but real world :(
>
> Without proper automation in place going way above basic IGP, BGP, LDP, BFD etc ... you need a bit of clever automation to detect it and either alarm noc or if they are really smart take such router out of the SPF network wide. If not you sit and wait till pissed customers call - which is already a failure.

Automation and monitoring to me are a very different subjects.
Everyone has war stories of those long tail problems when something
utterly weird is happening in the network and how problematic it was
to find. But this particular example is fairly easy, either you are
polling drop counter which shows the drops or your packets in -
packets out+drop delta is off.
But there will always be massive amount of long tail risks which your
nms won't know about, things break in a very creative and complex
ways. And you can monitor these very carefully, you can screenscrape
all NPU counters and your network is behaving _right now_
suboptimally, you see NPU exceptions/trapstats increasing which should
not and you can spend months figuring out 1 issue out of hundred you
have, all of which are real issues, but which might affect one packet
in a billion.
Is it worth knowing these? We are screenscraping and graphing all NPU
counters, as these typically are not available in GUI in case of JunOS
they are not even modelled because they are PFE counters. We rarely
proactively tend to them, because fixing them causes more outages than
letting them be. But often when strange issues do happen at scale
which customers care about, these counters reduce MTTR.
So if you think you don't have active issues, you're not monitoring
well enough. When you do monitor well enough you have to decide which
issues to fix and which to let be.

-- 
  ++ytti