[outages] Google outage after-action

Jay Ashworth jra at baylink.com
Sat Jan 25 14:01:56 EST 2014


Here's the formal one, 'pon the Google Blog, though not in any great
detail.  In short: whatever they build configurations with, for pushing
to servers via cfengine or puppet or whatever they use, had a bug, and
built a broken config, and didn't catch it, and it got pushed.

  http://googleblog.blogspot.com/2014/01/todays-outage-for-several-google.html

No details on whether there was contributory control-plane congestion,
as often causes, or extends, AWS outages.

Ironically, Google's Site Reliability Engineering team *was doing an AMA
when the outage hit*.  No comment from the 'plex, either, on whether those
are connected, and the AMAers declined to comment.

  http://www.reddit.com/r/IAmA/comments/1w1y5m/we_are_the_google_site_reliability_engineering/

Highlights:

Q: Pager?!?! Why don't you use text messages for that?

A: "Pager" is a synonynm for "A beepy thing that goes 'beep'."

and

I can assure you that Google does not use nagios.
Source: I used to be a Google SRE.

and

Q: Sooo....what's it like there when a Google service goes down? How much freaking out is done?

A: Very little freaking out actually, we have a well-oiled process for this that all services use - we use thoroughly documented incident management procedures, so people understand their role explicitly and can act very quickly. We also exercise this processes regularly as part of our DiRT testing.

Running regular service-specific drills is also a big part of making sure that once something goes wrong, we’re straight on it.

Folo: In a highly parallel environment, it should not be necessary that all pants are shat in unison. The map-reduce pattern allows each dump to be taken in its own time, and then collected once all shits are complete.

Cheers,
-- jra

-- 
Jay R. Ashworth                  Baylink                       jra at baylink.com
Designer                     The Things I Think                       RFC 2100
Ashworth & Associates       http://www.bcp38.info          2000 Land Rover DII
St Petersburg FL USA      BCP38: Ask For It By Name!           +1 727 647 1274




More information about the Outages mailing list