[j-nsp] Network automation vs. manual config

Sun Aug 19 20:56:38 EDT 2018

Hi,

We've stared on the automation journey some time ago. These are just some
generic subjects that you might have to think about:

1. What to automate
There are many different types of 'automation' out there. We currently
concentrate on 'orchestration' of product instances and on 'automation' of
various operational tasks. That means that the 'base' configuration of any
device is assumed, so for example all the policers and shapers we use for
our customers are already pre-defined and can be used.

2. Orchestration - model
The key for us is the Product-Service-Resource model. In that approach each
product has a defined list of parameters (and their values) and consists of
a number of services (in networking world, for example a L3VPN product
would have services such as access interface, QoS, routing instance etc).
Both product and services are 'abstract' and don't reflect any
device/model. Services use 'resources' as a way of implementing the
configuration on individual devices. This allows for reusability of code,
but also allows for products that live across multiple domains. For example
- one product can contain a number of routers, switches, firewalls and
applications. Some might be provisioned using SSH/CLI scrubbing, some using
APIs.  Currently we only generate the instance configuration (for example a
L3VPN), and not the base configuration. Base configuration in our case is
'automation' - like adding new PE to a network and is subject to different
rules.
We use ansible as the engine, with multiple modules on top (including our
own ones).

3. Handling errors
Things go wrong even when you automate them. When a product instance uses
resources across 10 devices (and takes 30 minutes to fully roll out) there
must be a reliable roll-back process available. We don't relay on the
devices to do it (as the config could have been changed by something else
already) but instead we pre-generate 'reversal' config that we deploy if we
run into problems. In case of upgrades that config reverts to previously
known good state, for new installation it simply removes deployed config.

4. Making updates
When a customer wants to upgrade their product from 500Mb/s to 1Gb/s on the
access layer - how do you do it? In our case we hold 'instance data' which
is the set of input values of the product parameters, any change to that
set cuases all the configs to be regenerated and reprovisioned (details are
down to individual devices, some actually roll-out the configs, even if its
not different, some not)

5. Logging/reporting
All automated operations must be logged, including the changes they make to
all systems. High level reporting (on number of failures/successes) across
devices/types etc helps to pin problems quickly.

6. Dealing with shared resources
Sometimes making changes means changing objects that might already be
configured. For example creating a unit on an interface that requires
particular encapsulation on the interface. The easiest way to deal with
this is to standardise all shared resource, but we found it's not always
possible.

7. Good inventory system
You need a way of storing all the information about your network and
systems, also ability to automatically allocate things like VLANs, IPs etc.
All of that must be available over an API.
We also store what we call 'instance data' - all the parameters that are
used to create the instance of the product on all devices.

8. Change process that allows for 'automatic deployments'
If you currently have a process that relies on peer reviews, CAB meetings
etc - those things will have to change. Our goal is to be able to provision
an instance using a single API call (but we're not there yet).

9. Offline generation and validation
We generate our configs offline, verify variables and syntax (where
possible) before deployment. This way a lot of errors and inconsistencies
can be detected even before touching the network/systems. Failing here is
'cheap' - nothing is really changed yet. If the failure happens during
deployment it is more 'expensive' - it has to be rolled back carefully on a
number of devices. Each service and resource is responsible for its own
validation. Some of them query external data sources, some query live
devices (for example to make sure that that VLAN id is not in use), some
only do syntactic and semantic validation.

10. Post-deployment verification
Once all bits are pieces are in - how to confirm that the setup is actually
working? For example for a L3VPN that might mean prefixes visible in
routing tables on devices, ICMP ping working between different PEs etc. For
things like BGP sessions with customers (and any other customer-dependant
services) it's worth marking them as 'soft' failures at this stage.

11. RBAC
Who should have access to what products, on which devices they can deploy?

kind regards
Pshem

On Fri, 17 Aug 2018 at 22:55 Antti Ristimäki <antti.ristimaki at csc.fi> wrote:

> Hi colleagues,
>
> This is something that I've been thinking quite a lot, so I would be
> delighted to hear some comments, experiences or recommendations.
>
> So, now that more and more of us are automating their network, there will
> be the question about how to manage the configurations, if they are
> partially automated and partially manually maintained. This will be the
> case especially while transitioning from a pure CLI jockey network towards
> a more automated one. There are probably multiple approaches to solve this,
> but below are a few of them:
>
> One option is to generate the whole config automatically e.g. from a
> template or a database and just _not_accepting_ any manual configurations
> at all. Then when there are needs to do something custom not yet supported
> by the automation tools, instead of manually configuring it one would take
> some additional time and build the support into the automation tools. The
> cost for this might be that deploying something new/custom/tailor-made
> might take a bit more time compared to just manually configuring it, but in
> a long run the benefits are obvious. I'm personally preferring this
> approach.
>
> Generating the _whole_ configuration automatically off-line from the
> scratch makes it also easy to remove elements from the configuration, as
> the auto-generated config can completely replace the existing
> running-config.
>
> If the above mentioned is not doable for the entire configuration, one can
> take one configuration hierarchy level at a time and automate it, after
> which no manual configurations will be accepted under that hierarchy. This
> is rather trivial especially for those configuration hierarchies that tend
> to be static most of the time.
>
> Another option is to apply the auto-generated configuration via
> apply-groups and apply all manual configurations explicitly so that the
> automatic and manual configurations merge with each other. The positive
> side of this approach is that it makes easy to develop the automation tools
> so that manual configs are not overridden by auto-generated config, but I
> personally see somewhat inconvenient that one really doesn't see the
> effective running-config when using apply-groups, unless one remembers to
> display inheritance.
>
> Any thoughts appreciated.
>
> Antti
> _______________________________________________
> juniper-nsp mailing list juniper-nsp at puck.nether.net
> https://puck.nether.net/mailman/listinfo/juniper-nsp
>