[j-nsp] Segment Routing Real World Deployment (was: VPC mc-lag)

Sun Jul 8 16:57:52 EDT 2018

> From: James Bensley [mailto:jwbensley at gmail.com]
> Sent: Friday, July 06, 2018 2:04 PM
> 
> 
> 
> On 5 July 2018 09:56:40 BST, adamv0025 at netconsultings.com wrote:
> >> Of James Bensley
> >> Sent: Thursday, July 05, 2018 9:15 AM
> >>
> >> - 100% rFLA coverage: TI-LA covers the "black spots" we currently
> >have.
> >>
> >Yeah that's an interesting use case you mentioned, that I haven't
> >considered, that is no TE need but FRR need.
> >But I guess if it was business critical to get those blind spots
> >FRR-protected then you would have done something about it already
> >right?
> 
> Hi Adam,
> 
> Yeah correct, no mission critical services are effected by this for us, so the
> business obviously hasn't allocated resource to do anything about it. If it was
> a major issue, it should be as simple as adding an extra back haul link to a
> node or shifting existing ones around (to reshape the P space and Q space to
> "please" the FRR algorithm).
> 
> >So I guess it's more like it would be nice to have,  now is it enough
> >to expose the business to additional risk?
> >Like for instance yes you'd test the feature to death to make sure it
> >works under any circumstances (it's the very heart of the network after
> >all if that breaks everything breaks), but the problem I see is then
> >going to a next release couple of years later -since SR is a new thing
> >it would have a ton of new stuff added to it by then resulting in
> >higher potential for regression bugs with comparison to LDP or RSVP
> >which have been around since
> >ever and every new release to these two is basically just bug fixes.
> 
> Good point, I think its worth breaking that down into two separate
> points/concerns:
> 
> Initial deployment bugs:
> We've done stuff like pay for a CPoC with Cisco, then deployed, then had it
> all blow up, then paod Cisco AS to asses the situation only to be told it's not a
> good design :D So we just assume a default/safe view now that no amount
> of testing will protect us. We ensure we have backout plans if something
> immediately blows up, and heightened reporting for issues that take 72
> hours to show up, and change freezes to cover issues that take a week to
> show up etc. etc. So I think as far as an initial SR deployment goes, all we can
> do is our best with regards to being cautious, just as we would with any
> major core changes. So I don't see the initial deployment as any more risky
> than other core projects we've undertaken like changing vendors, entire
> chassis replacements, code upgrades between major versions etc.
> 
> Regression bugs:
> My opinion is that in the case of something like SR which is being deployed
> based on early drafts, regression bugs is potentially a bigger issue than an
> initial deployment. I hadn't considered this. Again though I think its
> something we can reasonably prepare for. Depending on the potential
> impact to the business you could go as far as standing up a new chassis next
> to an existing one, but on the newer code version, run them in parallel,
> migrating services over slowly, keep the old one up for a while before you
> take it down. You could just do something as simple and physically replace
> the routing engine, keep the old one on site for a bit so you can quickly swap
> back. Or just drain the links in the IGP, downgraded the code, and then un-
> drain the links, if you've got some single homed services on there. If you
> have OOB access and plan all the rollback config in advance, we can
> operationally support the risks, no differently to any other major core
> change.
> 
> Probably the hardest part is assessing what the risk actually is? How to know
> what level of additional support, monitoring, people, you will need. If you
> under resource a rollback of a major failure, and fuck the rollback too, you
> might need some new pants :)
> 
Well yes I suppose one could actually look at it as on any other major project like upgrade to a new SW release, or migration from LDP to RSVP-TE or adding a second plane -or all 3 together. 
And apart from the tedious and rigorous testing (god there's got to be a better way of doing SW validation testing) you made me think about scoping the fallback and contingency options in case things down work out.
These huge projects are always carried out in number of stages each broken down to several individual steps all this is to ease out the deployment but also to scope the fallout in case things go south.  
Like in migrations from LDP to RSVP you go intra-pop first then inter-pop between a pair of POPs and so on using small incremental steps and all this time the fallback option is the good old LDP maybe even well after the project is done until the operational confidence is high enough or till the next code upgrade. And I think a similar approach can be used to de-risk an SR rollout. 


adam   

netconsultings.com
::carrier-class solutions for the telecommunications industry::