[Outages-discussion] More on the google outage Yesterday

Frank Bulk frnkblk at iname.com
Tue Dec 11 14:55:44 EST 2012


Remember that Google Apps has a "no maintenance" approach
(http://www.theregister.co.uk/2011/01/14/google_apps_sla_change/), which may
leak over to other services they provide.  So when they have scheduled
maintenance windows, it's more and more likely that they would be for
internal operations that should not affect public-facing operations.  What
would have been considered maintenance is then day-to-day operational work.

Frank

-----Original Message-----
From: outages-discussion-bounces at outages.org
[mailto:outages-discussion-bounces at outages.org] On Behalf Of Jeremy Chadwick
Sent: Tuesday, December 11, 2012 1:06 PM
To: Lori Barfield
Cc: outages-discussion at outages.org
Subject: Re: [Outages-discussion] More on the google outage Yesterday

Respectfully, that's just as speculative as my own statement; both beg
the question.

There's absolutely nothing in any of the ref. material that indicate the
LB change was performed at an earlier time (e.g. the night before):

http://www.google.com/appsstatus#hl=en&v=issue&ts=1355183999000&iid=4abb2f6c
40f6bd39677195b9a60ad77d
http://code.google.com/p/chromium/issues/detail?id=165171#c27

Thus I am left to conclude that the LB config change was in fact done at
0854 PST on a Monday morning, because there is no evidence at this time
proving otherwise.  In fact, the 2nd recurrence (0904 to 0916) almost
implies someone made a mistake, tried to correct it, and failed a 2nd
time.

Frank's comment about timezone/area of the world is valid.  The way this
is solved in "high-volume environments" is to actually have scheduled
and (keyword!) announced maintenance windows.

Changes to high-volume production infrastructures *absolutely* happen on
a whim.  There are many ride-em-cowboy operate-by-seats-of-their-pants
engineers who exist at every company, regardless of their size of
presence online.  This is why scheduled + announced maintenance windows
are a win-win for both the engineers making the changes as well as the
folks (users) who could be potentially impacted by mistakes.

-- 
| Jeremy Chadwick                                   jdc at koitsu.org |
| UNIX Systems Administrator                http://jdc.koitsu.org/ |
| Mountain View, CA, US                                            |
| Making life hard for others since 1977.             PGP 4BD6C0CB |

On Tue, Dec 11, 2012 at 10:49:44AM -0800, Lori Barfield wrote:
> what we know based on the article is that the outage was during prime
> time PDT, not necessarily the originating LB config change.  in a
> high-volume environment, delayed impact is a familiar phenomenon.
> 
> ...lori
> 
> 
> On Tue, Dec 11, 2012 at 10:38 AM, Jeremy Chadwick <jdc at koitsu.org> wrote:
> > What better time to make LB config changes than at 0854 PST on a Monday?
> >
> 
> > On Tue, Dec 11, 2012 at 12:33:49PM -0500, kondrak wrote:
> >>
http://arstechnica.com/information-technology/2012/12/why-gmail-went-down-go
ogle-misconfigured-chromes-sync-server/
> >>
> >>   Why Gmail went down: Google misconfigured Chrome's sync server
> >>
> >>
> >>     Load balancing change in Chrome servers affected multiple Google
> >>     products
> _______________________________________________
> Outages-discussion mailing list
> Outages-discussion at outages.org
> https://puck.nether.net/mailman/listinfo/outages-discussion
_______________________________________________
Outages-discussion mailing list
Outages-discussion at outages.org
https://puck.nether.net/mailman/listinfo/outages-discussion




More information about the Outages-discussion mailing list