[Outages-discussion] More on the google outage Yesterday

Jeremy Chadwick jdc at koitsu.org
Tue Dec 11 14:05:49 EST 2012


Respectfully, that's just as speculative as my own statement; both beg
the question.

There's absolutely nothing in any of the ref. material that indicate the
LB change was performed at an earlier time (e.g. the night before):

http://www.google.com/appsstatus#hl=en&v=issue&ts=1355183999000&iid=4abb2f6c40f6bd39677195b9a60ad77d
http://code.google.com/p/chromium/issues/detail?id=165171#c27

Thus I am left to conclude that the LB config change was in fact done at
0854 PST on a Monday morning, because there is no evidence at this time
proving otherwise.  In fact, the 2nd recurrence (0904 to 0916) almost
implies someone made a mistake, tried to correct it, and failed a 2nd
time.

Frank's comment about timezone/area of the world is valid.  The way this
is solved in "high-volume environments" is to actually have scheduled
and (keyword!) announced maintenance windows.

Changes to high-volume production infrastructures *absolutely* happen on
a whim.  There are many ride-em-cowboy operate-by-seats-of-their-pants
engineers who exist at every company, regardless of their size of
presence online.  This is why scheduled + announced maintenance windows
are a win-win for both the engineers making the changes as well as the
folks (users) who could be potentially impacted by mistakes.

-- 
| Jeremy Chadwick                                   jdc at koitsu.org |
| UNIX Systems Administrator                http://jdc.koitsu.org/ |
| Mountain View, CA, US                                            |
| Making life hard for others since 1977.             PGP 4BD6C0CB |

On Tue, Dec 11, 2012 at 10:49:44AM -0800, Lori Barfield wrote:
> what we know based on the article is that the outage was during prime
> time PDT, not necessarily the originating LB config change.  in a
> high-volume environment, delayed impact is a familiar phenomenon.
> 
> ...lori
> 
> 
> On Tue, Dec 11, 2012 at 10:38 AM, Jeremy Chadwick <jdc at koitsu.org> wrote:
> > What better time to make LB config changes than at 0854 PST on a Monday?
> >
> 
> > On Tue, Dec 11, 2012 at 12:33:49PM -0500, kondrak wrote:
> >> http://arstechnica.com/information-technology/2012/12/why-gmail-went-down-google-misconfigured-chromes-sync-server/
> >>
> >>   Why Gmail went down: Google misconfigured Chrome's sync server
> >>
> >>
> >>     Load balancing change in Chrome servers affected multiple Google
> >>     products
> _______________________________________________
> Outages-discussion mailing list
> Outages-discussion at outages.org
> https://puck.nether.net/mailman/listinfo/outages-discussion


More information about the Outages-discussion mailing list