[Outages-discussion] Avoiding puppet/cfengine as the next SPOF

Sun Jan 26 05:36:43 EST 2014

On Sat, Jan 25, 2014 at 07:48:03PM -0500, Jay Ashworth wrote:
> Sure, you need it on things as big as the Googleplex.  But that doesn't mean
> that you can't use Tom Limoncelli's celebrated "one, few, many, all" 
> staged deployment approach, when setting up pushes.

We did this decades ago as we began to manage ever-larger collections
of systems at Purdue.  Our tools were built around RCS, make, and shell
instead of puppet or chef, but they performed the same set of functions.

And of course like so many others, we learned that partial/isolated
failure is better than the systematic total failure.  We also learned
that (1) understanding and minimizing interdependencies would save our
butts and (2) time invested in checking syntactic and semantic correctness
of configurations would minimize our reliance on (1).

We now have a (1) problem an Internet scale, because far too many
things depend on Google.  (s/Google/Amazon/ or others if you wish.)
This particular incident was an outage caused by an oops, but what
if it was an outage caused by a successful attack?

---rsk