[j-nsp] Segment Routing Real World Deployment (was: VPC mc-lag)

Sun Jul 8 19:13:27 EDT 2018

Hi experts,

I had a pleasure time reading the whole thread. Thanks, folks !

Honestly, I also (a bit like Saku) feel that Alexandre's case is more about
throwing the *unneeded* complexity away than about BGP vs. LDP.

The whole story of Kompella-style signaling for L2VPN and VPLS is
auto-discovery in a multi-point VPN service case.

But yes, there is the whole bunch of reasons why multi-point L2 VPN sucks,
and when bridged it sucks 10x more. So if you can throw it away, just throw
it away and you won't need to discuss how to signal it and auto-discover
remote sites.

And yes, as pseudo-wire data plane is way simpler than VPLS, depending on
your access network design, you can [try to] extend it end-to-end, all the
way to the access switch and [maybe, if you are lucky] dramatically
simplify your NOC's life.

However p2p pseudo-wire service is a kind of rare thing these days. There
are [quite a lot of] those poor folks who were never asked whether bridged
L2 VPN (aka VPLS) is needed in the network, they operate. They have no much
choice.

BGP signaling is the coolest part of the VPLS hell (some minimal magic is
required though). In general I agree with the idea that iBGP stability is
all about making the underlaying stuff simple and clean (IGP, BFD, Loss of
Light, whatever). Who said "policies"? For VPLS BGP signaling? Please don't.

And yes, switching frames between fancy full-feature PEs is just half of
the game. The autodiscovery beauty breaks when the frames say bye bye to
the MPLS backbone and meet the ugly access layer. Now you need to switch it
down to the end-point and this often ends up in old good^W VLAN
provisioning. But it's not about BGP, it's about VPLS. Or rather about
those brave folks, who build their services relying on all these
ethernet-on-steroid things.

--
Kind regards,
Pavel

On Sun, Jul 8, 2018 at 10:57 PM, <adamv0025 at netconsultings.com> wrote:

> > From: James Bensley [mailto:jwbensley at gmail.com]
> > Sent: Friday, July 06, 2018 2:04 PM
> >
> >
> >
> > On 5 July 2018 09:56:40 BST, adamv0025 at netconsultings.com wrote:
> > >> Of James Bensley
> > >> Sent: Thursday, July 05, 2018 9:15 AM
> > >>
> > >> - 100% rFLA coverage: TI-LA covers the "black spots" we currently
> > >have.
> > >>
> > >Yeah that's an interesting use case you mentioned, that I haven't
> > >considered, that is no TE need but FRR need.
> > >But I guess if it was business critical to get those blind spots
> > >FRR-protected then you would have done something about it already
> > >right?
> >
> > Hi Adam,
> >
> > Yeah correct, no mission critical services are effected by this for us,
> so the
> > business obviously hasn't allocated resource to do anything about it. If
> it was
> > a major issue, it should be as simple as adding an extra back haul link
> to a
> > node or shifting existing ones around (to reshape the P space and Q
> space to
> > "please" the FRR algorithm).
> >
> > >So I guess it's more like it would be nice to have,  now is it enough
> > >to expose the business to additional risk?
> > >Like for instance yes you'd test the feature to death to make sure it
> > >works under any circumstances (it's the very heart of the network after
> > >all if that breaks everything breaks), but the problem I see is then
> > >going to a next release couple of years later -since SR is a new thing
> > >it would have a ton of new stuff added to it by then resulting in
> > >higher potential for regression bugs with comparison to LDP or RSVP
> > >which have been around since
> > >ever and every new release to these two is basically just bug fixes.
> >
> > Good point, I think its worth breaking that down into two separate
> > points/concerns:
> >
> > Initial deployment bugs:
> > We've done stuff like pay for a CPoC with Cisco, then deployed, then had
> it
> > all blow up, then paod Cisco AS to asses the situation only to be told
> it's not a
> > good design :D So we just assume a default/safe view now that no amount
> > of testing will protect us. We ensure we have backout plans if something
> > immediately blows up, and heightened reporting for issues that take 72
> > hours to show up, and change freezes to cover issues that take a week to
> > show up etc. etc. So I think as far as an initial SR deployment goes,
> all we can
> > do is our best with regards to being cautious, just as we would with any
> > major core changes. So I don't see the initial deployment as any more
> risky
> > than other core projects we've undertaken like changing vendors, entire
> > chassis replacements, code upgrades between major versions etc.
> >
> > Regression bugs:
> > My opinion is that in the case of something like SR which is being
> deployed
> > based on early drafts, regression bugs is potentially a bigger issue
> than an
> > initial deployment. I hadn't considered this. Again though I think its
> > something we can reasonably prepare for. Depending on the potential
> > impact to the business you could go as far as standing up a new chassis
> next
> > to an existing one, but on the newer code version, run them in parallel,
> > migrating services over slowly, keep the old one up for a while before
> you
> > take it down. You could just do something as simple and physically
> replace
> > the routing engine, keep the old one on site for a bit so you can
> quickly swap
> > back. Or just drain the links in the IGP, downgraded the code, and then
> un-
> > drain the links, if you've got some single homed services on there. If
> you
> > have OOB access and plan all the rollback config in advance, we can
> > operationally support the risks, no differently to any other major core
> > change.
> >
> > Probably the hardest part is assessing what the risk actually is? How to
> know
> > what level of additional support, monitoring, people, you will need. If
> you
> > under resource a rollback of a major failure, and fuck the rollback too,
> you
> > might need some new pants :)
> >
> Well yes I suppose one could actually look at it as on any other major
> project like upgrade to a new SW release, or migration from LDP to RSVP-TE
> or adding a second plane -or all 3 together.
> And apart from the tedious and rigorous testing (god there's got to be a
> better way of doing SW validation testing) you made me think about scoping
> the fallback and contingency options in case things down work out.
> These huge projects are always carried out in number of stages each broken
> down to several individual steps all this is to ease out the deployment but
> also to scope the fallout in case things go south.
> Like in migrations from LDP to RSVP you go intra-pop first then inter-pop
> between a pair of POPs and so on using small incremental steps and all this
> time the fallback option is the good old LDP maybe even well after the
> project is done until the operational confidence is high enough or till the
> next code upgrade. And I think a similar approach can be used to de-risk an
> SR rollout.
>
>
> adam
>
> netconsultings.com
> ::carrier-class solutions for the telecommunications industry::
>
>
> _______________________________________________
> juniper-nsp mailing list juniper-nsp at puck.nether.net
> https://puck.nether.net/mailman/listinfo/juniper-nsp
>