[Outages-discussion] [outages] Google network issues

Jared Mauch jared at puck.nether.net
Tue Jun 4 21:17:03 EDT 2019



> On Jun 4, 2019, at 8:51 PM, Matt Hoppes <mattlists at rivervalleyinternet.net> wrote:
> 
> Who makes critical changes in the middle of a day on Sunday to core infrastructure?
> 

That’s a very IT/enterprise question.

If you run a global network, and can’t do things during business hours, when can you do work?

Well, M-F.. but oh wait, some places it’s S-Th, then there’s the international date line, so uh..

Or you build systems that can be automated, have detection, fail over and redundancy in place.

It sounds like the efforts they undertook with search didn’t translate to other teams.  This is often the case when something is different.  Remember YouTube was an acquisition and likely still has some legacy vs search which was the main product.

If you read up on what places do, they’re often making changes every few minutes or even every few seconds if operating at a large enough scale.  I remember telling a router vendor that “we are testing your software every 3 minutes and notice each time it fails in X way as we make network changes”.

The changes were on average every 3 minutes.. Sometimes they were faster on a device, sometimes a bit longer, but some device was changed on average every 3 minutes.

When you move from that being your rate of change to trying to do a scheduled maintenance off-hours, it may be infeasible.  It’s always someones time to call 911 or other emergency services like 0118 999 881 999 119 725...3 etc

Sometimes it’s just always bad for someone, so maybe it’s just easier to do it whenever.  This time it just went south and took longer to recover.  I’m sure people will be studying it for weeks/months/years internally.  I know cases where I’ve been involved in a postmortem it’s been a thing where we reflect on that for long times to avoid doing it again.

- Jared



More information about the Outages-discussion mailing list