[Outages-discussion] More on the google outage Yesterday

Warren Kumari warren at kumari.net
Tue Dec 11 16:10:48 EST 2012


On Dec 11, 2012, at 3:14 PM, virendra rode <virendra.rode at outages.org> wrote:

> On 12/11/2012 11:05 AM, Jeremy Chadwick wrote:
>> Respectfully, that's just as speculative as my own statement; both beg
>> the question.
>> 
>> There's absolutely nothing in any of the ref. material that indicate the
>> LB change was performed at an earlier time (e.g. the night before):
>> 
>> http://www.google.com/appsstatus#hl=en&v=issue&ts=1355183999000&iid=4abb2f6c40f6bd39677195b9a60ad77d
>> http://code.google.com/p/chromium/issues/detail?id=165171#c27
>> 
>> Thus I am left to conclude that the LB config change was in fact done at
>> 0854 PST on a Monday morning, because there is no evidence at this time
>> proving otherwise.  In fact, the 2nd recurrence (0904 to 0916) almost
>> implies someone made a mistake, tried to correct it, and failed a 2nd
>> time.
>> 
>> Frank's comment about timezone/area of the world is valid.  The way this
>> is solved in "high-volume environments" is to actually have scheduled
>> and (keyword!) announced maintenance windows.
> ----------------------
> I understand I'm assuming here and not pointing fingers as I've had my share of lessons learned and I'm still learning. Having said that, that's exactly what went through my head, how did a customer facing CM of this scale get approved during production hours.
> 

You are assuming that this was a CM that was expected to be a large scale change, and not simply a small scale change that had unexpected side-effects.
You are also assuming that there are such things as "production hours" for a large scale global property -- unfortunately Jimmy Buffett  is always right with "It's Five O'Clock Somewhere".

Past a certain scale in a  large, dynamic environment it becomes infeasible to only perform changes during scheduled times -- you end up with so many changes being applied at the same time that a: you risk causing instability, b: changes that rely on other changes end up being really delays  and c: troubleshooting issues becomes almost impossible. If there are any sort of issues, which of the N hundred or thousand changes caused it?

W


> Unfortunate coincidences, I get it, we all have been introduced to murphy. Could I have done this better, maybe not but I sure do appreciate the team owning it and being transparent.
> 
> As the saying goes, in any 5 why exercise is "people do not fail, processes do".
> 
> Looking forward to the postmortem.
> 
> regards,
> /virendra
>> 
>> Changes to high-volume production infrastructures *absolutely* happen on
>> a whim.  There are many ride-em-cowboy operate-by-seats-of-their-pants
>> engineers who exist at every company, regardless of their size of
>> presence online.  This is why scheduled + announced maintenance windows
>> are a win-win for both the engineers making the changes as well as the
>> folks (users) who could be potentially impacted by mistakes.
>> 
> 
> _______________________________________________
> Outages-discussion mailing list
> Outages-discussion at outages.org
> https://puck.nether.net/mailman/listinfo/outages-discussion
> 

--
The duke had a mind that ticked like a clock and, like a clock, it regularly went cuckoo.

    -- (Terry Pratchett, Wyrd Sisters)





More information about the Outages-discussion mailing list