[Outages-discussion] More on the google outage Yesterday
Jeremy Chadwick
jdc at koitsu.org
Tue Dec 11 14:05:49 EST 2012
Respectfully, that's just as speculative as my own statement; both beg
the question.
There's absolutely nothing in any of the ref. material that indicate the
LB change was performed at an earlier time (e.g. the night before):
http://www.google.com/appsstatus#hl=en&v=issue&ts=1355183999000&iid=4abb2f6c40f6bd39677195b9a60ad77d
http://code.google.com/p/chromium/issues/detail?id=165171#c27
Thus I am left to conclude that the LB config change was in fact done at
0854 PST on a Monday morning, because there is no evidence at this time
proving otherwise. In fact, the 2nd recurrence (0904 to 0916) almost
implies someone made a mistake, tried to correct it, and failed a 2nd
time.
Frank's comment about timezone/area of the world is valid. The way this
is solved in "high-volume environments" is to actually have scheduled
and (keyword!) announced maintenance windows.
Changes to high-volume production infrastructures *absolutely* happen on
a whim. There are many ride-em-cowboy operate-by-seats-of-their-pants
engineers who exist at every company, regardless of their size of
presence online. This is why scheduled + announced maintenance windows
are a win-win for both the engineers making the changes as well as the
folks (users) who could be potentially impacted by mistakes.
--
| Jeremy Chadwick jdc at koitsu.org |
| UNIX Systems Administrator http://jdc.koitsu.org/ |
| Mountain View, CA, US |
| Making life hard for others since 1977. PGP 4BD6C0CB |
On Tue, Dec 11, 2012 at 10:49:44AM -0800, Lori Barfield wrote:
> what we know based on the article is that the outage was during prime
> time PDT, not necessarily the originating LB config change. in a
> high-volume environment, delayed impact is a familiar phenomenon.
>
> ...lori
>
>
> On Tue, Dec 11, 2012 at 10:38 AM, Jeremy Chadwick <jdc at koitsu.org> wrote:
> > What better time to make LB config changes than at 0854 PST on a Monday?
> >
>
> > On Tue, Dec 11, 2012 at 12:33:49PM -0500, kondrak wrote:
> >> http://arstechnica.com/information-technology/2012/12/why-gmail-went-down-google-misconfigured-chromes-sync-server/
> >>
> >> Why Gmail went down: Google misconfigured Chrome's sync server
> >>
> >>
> >> Load balancing change in Chrome servers affected multiple Google
> >> products
> _______________________________________________
> Outages-discussion mailing list
> Outages-discussion at outages.org
> https://puck.nether.net/mailman/listinfo/outages-discussion
More information about the Outages-discussion
mailing list