[Outages-discussion] [outages] FRIDAY OUTAGE

Matthew Petach matt at petach.org
Sat Jul 18 16:18:56 EDT 2020


The outage blog highlights something I've put into nearly religious
practice,
ever since I learned it from Sean Doran so many years ago.

When doing traffic engineering, *always* use values that are
*below* (less preferred than) the "default" setting for that knob,
so that you are "pushing" traffic away from nodes, rather than
"pulling" traffic towards; that way, in case a match clause is
mistakenly deactivated like this, at worst, you fall back to the
default value, and traffic spreads across all nodes at the default
level, rather than become a violent attractor like this.

I suspect that's exactly what this bullet point in the
post-mortem is about:


   - Change the BGP local-preference for local server routes. This change
   will prevent a single location from attracting other locations’ traffic in
   a similar manner. This change has been deployed following the incident.



Kudos to the Cloudflare team for recognizing the
danger in "strong attractors" and moving away from
them.  I'd council anyone else doing traffic engineering
with similarly strong knobs to consider doing likewise;
shift your policies so that your traffic engineering consists
of de-preferencing sites and pathways away from default
value, rather than raising *above* the default value.

Good work on being so open and transparent,
Cloudflare team--we should all have management
that is so honest.  :)

Matt

(not crossposting to NANOG, though I am debating with
myself if it might not be worth sharing my "lesson learned"
with that list as well, for any networks that might have a
similar ticking time bomb waiting to go off...)






On Fri, Jul 17, 2020 at 10:24 PM DaZZa <dazzagibbs at gmail.com> wrote:

> RFO here, if anyone is interested
>
> https://blog.cloudflare.com/cloudflare-outage-on-july-17-2020/
>
> D
>
> On Sat, 18 Jul 2020, 9:20 am Pete Templin, <petelists at templin.org> wrote:
>
>> It caused some potions? That's some magical stuff!
>>
>> On 7/17/20 3:11 PM, Daniel Marks via Outages wrote:
>> >  From Cloudflare:
>> >
>> > "This afternoon we saw an outage across some parts of our network. It
>> was not as a result of an attack. It appears a router on our global
>> backbone announced bad routes and caused some potions of the network to not
>> be available. We believe we have addressed the root cause and monitoring
>> systems for stability now."
>> >
>> > Daniel Marks
>> > Systems Administrator
>> > May Mobility
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://puck.nether.net/pipermail/outages-discussion/attachments/20200718/6e42cf80/attachment-0001.htm>


More information about the Outages-discussion mailing list