[Outages-discussion] More on the google outage Yesterday

Jeremy Chadwick jdc at koitsu.org
Tue Dec 11 16:31:51 EST 2012


Warren, you make good points, and I have worked in an environment that
consisted of that exact problem (repeatedly throughout all the services
and infrastructures).

The simple answer to that problem are two words: baby steps.

The problem with scheduled maintenances is that managerial folks embrace
them with a wrong mindset.  You do not schedule a maintenance window and
then fit 8 different things (some possibly intertwined, others
supposedly unrelated/segregated) into the same window.  That situation
results in exactly what you describe.  I can assure you at one of my
past jobs this was done regularly (2-3 times a week) and it had
catastrophic outcomes maybe 50-70% of the time, and in some cases this
did impact customers.  I won't get into a rant about why the mindsets of
the managers never changed despite the evidence.

The other option as I see it is to stop deploying highly intertwined and
overly complex pieces that must interact with one another or have very
strange nuances/requirements.  I cannot tell you how many places I've
worked at where KISS principle had obviously *never* been involved in
anything that was engineered or created, all the way down to the network
level.  Sorry for cursing but: the less shit you have the better.

Finally, a third option is to do what we did at my previous job: go to
very great lengths to educate the engineers/folks touching the
equipment/deploying the changes to understand *every single intricacy*
of how everything interfaced/interacted.  DO NOT even for a moment tell
me this is unreasonable or impossible -- of the 8 years I was there,
this was a very key part of my job and my department.  Yes it's very
stressful, and yes after a few years it requires multiple humans
bouncing ideas/questions off one another before deploying a change, but
it can address this.

Instead what we have today are segregated groups within companies who
refuse to look outside their little box.  Rose-tinted glasses, blinders
on, whatever you want to call it.  "I am the system administrator who
maintains the web servers and I do not care about the 90 things that
rely on those webservers, nor do I know how any of the code on the
webservers works" -- this attitude is more common than you think, and
it's very depressing.  The best kinds of engineers/operations folks are
those who have that driving force and desire to understand (truly, if
needed, all the way down to the code) how something interacts and *how
it works*.

Sorry, that was a long-winded ranty Email, but this topic really hits
home with me because it's something I dealt with literally every single
day at my past job.  My point is that these nuances/complexities you
mention can be solved, it's just that most people don't care enough
about doing their jobs to the fullest to actually *understand* how
something works at all levels.  But then again maybe the Aquarius in me
is showing...  ;-)

-- 
| Jeremy Chadwick                                   jdc at koitsu.org |
| UNIX Systems Administrator                http://jdc.koitsu.org/ |
| Mountain View, CA, US                                            |
| Making life hard for others since 1977.             PGP 4BD6C0CB |

On Tue, Dec 11, 2012 at 04:10:48PM -0500, Warren Kumari wrote:
> 
> On Dec 11, 2012, at 3:14 PM, virendra rode <virendra.rode at outages.org> wrote:
> 
> > On 12/11/2012 11:05 AM, Jeremy Chadwick wrote:
> >> Respectfully, that's just as speculative as my own statement; both beg
> >> the question.
> >> 
> >> There's absolutely nothing in any of the ref. material that indicate the
> >> LB change was performed at an earlier time (e.g. the night before):
> >> 
> >> http://www.google.com/appsstatus#hl=en&v=issue&ts=1355183999000&iid=4abb2f6c40f6bd39677195b9a60ad77d
> >> http://code.google.com/p/chromium/issues/detail?id=165171#c27
> >> 
> >> Thus I am left to conclude that the LB config change was in fact done at
> >> 0854 PST on a Monday morning, because there is no evidence at this time
> >> proving otherwise.  In fact, the 2nd recurrence (0904 to 0916) almost
> >> implies someone made a mistake, tried to correct it, and failed a 2nd
> >> time.
> >> 
> >> Frank's comment about timezone/area of the world is valid.  The way this
> >> is solved in "high-volume environments" is to actually have scheduled
> >> and (keyword!) announced maintenance windows.
> > ----------------------
> > I understand I'm assuming here and not pointing fingers as I've had my share of lessons learned and I'm still learning. Having said that, that's exactly what went through my head, how did a customer facing CM of this scale get approved during production hours.
> > 
> 
> You are assuming that this was a CM that was expected to be a large scale change, and not simply a small scale change that had unexpected side-effects.
> You are also assuming that there are such things as "production hours" for a large scale global property -- unfortunately Jimmy Buffett  is always right with "It's Five O'Clock Somewhere".
> 
> Past a certain scale in a  large, dynamic environment it becomes infeasible to only perform changes during scheduled times -- you end up with so many changes being applied at the same time that a: you risk causing instability, b: changes that rely on other changes end up being really delays  and c: troubleshooting issues becomes almost impossible. If there are any sort of issues, which of the N hundred or thousand changes caused it?
> 
> W
> 
> 
> > Unfortunate coincidences, I get it, we all have been introduced to murphy. Could I have done this better, maybe not but I sure do appreciate the team owning it and being transparent.
> > 
> > As the saying goes, in any 5 why exercise is "people do not fail, processes do".
> > 
> > Looking forward to the postmortem.
> > 
> > regards,
> > /virendra
> >> 
> >> Changes to high-volume production infrastructures *absolutely* happen on
> >> a whim.  There are many ride-em-cowboy operate-by-seats-of-their-pants
> >> engineers who exist at every company, regardless of their size of
> >> presence online.  This is why scheduled + announced maintenance windows
> >> are a win-win for both the engineers making the changes as well as the
> >> folks (users) who could be potentially impacted by mistakes.
> >> 
> > 
> > _______________________________________________
> > Outages-discussion mailing list
> > Outages-discussion at outages.org
> > https://puck.nether.net/mailman/listinfo/outages-discussion
> > 
> 
> --
> The duke had a mind that ticked like a clock and, like a clock, it regularly went cuckoo.
> 
>     -- (Terry Pratchett, Wyrd Sisters)
> 
> 
> 
> _______________________________________________
> Outages-discussion mailing list
> Outages-discussion at outages.org
> https://puck.nether.net/mailman/listinfo/outages-discussion


More information about the Outages-discussion mailing list