[j-nsp] Segment Routing Real World Deployment (was: VPC mc-lag)

Fri Jul 6 09:04:04 EDT 2018

On 5 July 2018 09:56:40 BST, adamv0025 at netconsultings.com wrote:
>> Of James Bensley
>> Sent: Thursday, July 05, 2018 9:15 AM
>> 
>> - 100% rFLA coverage: TI-LA covers the "black spots" we currently
>have.
>> 
>Yeah that's an interesting use case you mentioned, that I haven't
>considered, that is no TE need but FRR need.
>But I guess if it was business critical to get those blind spots
>FRR-protected then you would have done something about it already
>right?

Hi Adam,

Yeah correct, no mission critical services are effected by this for us, so the business obviously hasn't allocated resource to do anything about it. If it was a major issue, it should be as simple as adding an extra back haul link to a node or shifting existing ones around (to reshape the P space and Q space to "please" the FRR algorithm).

>So I guess it's more like it would be nice to have,  now is it enough
>to
>expose the business to additional risk? 
>Like for instance yes you'd test the feature to death to make sure it
>works
>under any circumstances (it's the very heart of the network after all
>if
>that breaks everything breaks), but the problem I see is then going to
>a
>next release couple of years later -since SR is a new thing it would
>have a
>ton of new stuff added to it by then resulting in higher potential for
>regression bugs with comparison to LDP or RSVP which have been around
>since
>ever and every new release to these two is basically just bug fixes.   

Good point, I think its worth breaking that down into two separate points/concerns:

Initial deployment bugs:
We've done stuff like pay for a CPoC with Cisco, then deployed, then had it all blow up, then paod Cisco AS to asses the situation only to be told it's not a good design :D So we just assume a default/safe view now that no amount of testing will protect us. We ensure we have backout plans if something immediately blows up, and heightened reporting for issues that take 72 hours to show up, and change freezes to cover issues that take a week to show up etc. etc. So I think as far as an initial SR deployment goes, all we can do is our best with regards to being cautious, just as we would with any major core changes. So I don't see the initial deployment as any more risky than other core projects we've undertaken like changing vendors, entire chassis replacements, code upgrades between major versions etc.

Regression bugs:
My opinion is that in the case of something like SR which is being deployed based on early drafts, regression bugs is potentially a bigger issue than an initial deployment. I hadn't considered this. Again though I think its something we can reasonably prepare for. Depending on the potential impact to the business you could go as far as standing up a new chassis next to an existing one, but on the newer code version, run them in parallel, migrating services over slowly, keep the old one up for a while before you take it down. You could just do something as simple and physically replace the routing engine, keep the old one on site for a bit so you can quickly swap back. Or just drain the links in the IGP, downgraded the code, and then un-drain the links, if you've got some single homed services on there. If you have OOB access and plan all the rollback config in advance, we can operationally support the risks, no differently to any other major core change.

Probably the hardest part is assessing what the risk actually is? How to know what level of additional support, monitoring, people, you will need. If you under resource a rollback of a major failure, and fuck the rollback too, you might need some new pants :)

Cheers,
James.