[Outages-discussion] More on the google outage Yesterday

virendra rode virendra.rode at outages.org
Tue Dec 11 15:14:31 EST 2012

On 12/11/2012 11:05 AM, Jeremy Chadwick wrote:
> Respectfully, that's just as speculative as my own statement; both beg
> the question.
> There's absolutely nothing in any of the ref. material that indicate the
> LB change was performed at an earlier time (e.g. the night before):
> http://www.google.com/appsstatus#hl=en&v=issue&ts=1355183999000&iid=4abb2f6c40f6bd39677195b9a60ad77d
> http://code.google.com/p/chromium/issues/detail?id=165171#c27
> Thus I am left to conclude that the LB config change was in fact done at
> 0854 PST on a Monday morning, because there is no evidence at this time
> proving otherwise.  In fact, the 2nd recurrence (0904 to 0916) almost
> implies someone made a mistake, tried to correct it, and failed a 2nd
> time.
> Frank's comment about timezone/area of the world is valid.  The way this
> is solved in "high-volume environments" is to actually have scheduled
> and (keyword!) announced maintenance windows.
I understand I'm assuming here and not pointing fingers as I've had my 
share of lessons learned and I'm still learning. Having said that, 
that's exactly what went through my head, how did a customer facing CM 
of this scale get approved during production hours.

Unfortunate coincidences, I get it, we all have been introduced to 
murphy. Could I have done this better, maybe not but I sure do 
appreciate the team owning it and being transparent.

As the saying goes, in any 5 why exercise is "people do not fail, 
processes do".

Looking forward to the postmortem.

> Changes to high-volume production infrastructures *absolutely* happen on
> a whim.  There are many ride-em-cowboy operate-by-seats-of-their-pants
> engineers who exist at every company, regardless of their size of
> presence online.  This is why scheduled + announced maintenance windows
> are a win-win for both the engineers making the changes as well as the
> folks (users) who could be potentially impacted by mistakes.

More information about the Outages-discussion mailing list