[Outages-discussion] More on the google outage Yesterday

Lori Barfield itdirector at gmail.com
Tue Dec 11 14:22:11 EST 2012


also respectfully, i personally won't be concluding anything about
what time a change was made, and whether or not it occurred in
violation of best practices, until we have the facts.  (the google
folks on this list would probably appreciate that.)  in a high
availability environment, unplanned customer impacting events usually
require complicated circumstances, and Murphy obliged yesterday.  we
do have the statement that their remedial procedure was a problem, and
that is interesting enough for now.

...lori

On Tue, Dec 11, 2012 at 11:05 AM, Jeremy Chadwick <jdc at koitsu.org> wrote:
> Respectfully, that's just as speculative as my own statement; both beg
> the question.
>
> There's absolutely nothing in any of the ref. material that indicate the
> LB change was performed at an earlier time (e.g. the night before):
>
> http://www.google.com/appsstatus#hl=en&v=issue&ts=1355183999000&iid=4abb2f6c40f6bd39677195b9a60ad77d
> http://code.google.com/p/chromium/issues/detail?id=165171#c27
>
> Thus I am left to conclude that the LB config change was in fact done at
> 0854 PST on a Monday morning, because there is no evidence at this time
> proving otherwise.  In fact, the 2nd recurrence (0904 to 0916) almost
> implies someone made a mistake, tried to correct it, and failed a 2nd
> time.
>
> Frank's comment about timezone/area of the world is valid.  The way this
> is solved in "high-volume environments" is to actually have scheduled
> and (keyword!) announced maintenance windows.
>
> Changes to high-volume production infrastructures *absolutely* happen on
> a whim.  There are many ride-em-cowboy operate-by-seats-of-their-pants
> engineers who exist at every company, regardless of their size of
> presence online.  This is why scheduled + announced maintenance windows
> are a win-win for both the engineers making the changes as well as the
> folks (users) who could be potentially impacted by mistakes.
>
> --
> | Jeremy Chadwick                                   jdc at koitsu.org |
> | UNIX Systems Administrator                http://jdc.koitsu.org/ |
> | Mountain View, CA, US                                            |
> | Making life hard for others since 1977.             PGP 4BD6C0CB |
>
> On Tue, Dec 11, 2012 at 10:49:44AM -0800, Lori Barfield wrote:
>> what we know based on the article is that the outage was during prime
>> time PDT, not necessarily the originating LB config change.  in a
>> high-volume environment, delayed impact is a familiar phenomenon.
>>
>> ...lori
>>
>>
>> On Tue, Dec 11, 2012 at 10:38 AM, Jeremy Chadwick <jdc at koitsu.org> wrote:
>> > What better time to make LB config changes than at 0854 PST on a Monday?
>> >
>>
>> > On Tue, Dec 11, 2012 at 12:33:49PM -0500, kondrak wrote:
>> >> http://arstechnica.com/information-technology/2012/12/why-gmail-went-down-google-misconfigured-chromes-sync-server/
>> >>
>> >>   Why Gmail went down: Google misconfigured Chrome's sync server
>> >>
>> >>
>> >>     Load balancing change in Chrome servers affected multiple Google
>> >>     products


More information about the Outages-discussion mailing list