[Outages-discussion] [outages] Linode Fremont outage

Bill Wichers billw at waveform.net
Mon Jun 1 13:59:31 EDT 2015


It's entirely possible that they are doing weekly testing and that the
generator just by chance failed when needed but had been doing OK with
weekly exercise runs previously. Murphy's law does dictate that things fail
at the worst possible time :-)

It's also standard practice to do the weekly test runs on generators without
load, so it's possible that everything was OK until the unit was stressed
with load and then it failed. It's possible to run tests with a load bank,
but this is rarely done weekly (it's usually a yearly check, or maybe
monthly).

The simple reality is things can fail. The best solution is to spread your
critical applications over multiple facilities. Many people don't want to
pay for the maximum possible redundancy, and since I design datacenters for
a living I can tell you that each additional "9" on the way to "five nines"
costs a LOT more than the nine preceding it.

I tell my customers sometimes that it is also necessary to determine if the
extra reliability is worth the expense. An example: a grade school can
continue to produce their primary "product" (teach services to kids) with
their network offline. As such, it is not a wise investment for them to put
in a Tier 4 facility with N+1 generators and UPS systems, etc, it is cheaper
for them to put in "reasonable" levels of protection (a UPS and a generator)
and deal with restoration costs in the rare cases that isn't enough of a
backup system. If I'm designing a full telecom facility with SLAs and "five
nines" requirements, then the cost of downtime is both easily calculated
(SLA terms), and it's generally a wise investment to put in more redundancy
in the facility systems.

It is unfortunately not possible to achieve 100% uptime. All you can do is
minimize the chances of an outage.

  -Bill

> 
> On Sun, May 31, 2015 at 11:00 AM, Mark Keymer via Outages
> <outages at outages.org> wrote:
> > But honestly I am guessing it was a failure with no foul play happening.
> 
> Apparently no solid proactive testing plan either.
> 
> -Jim P.
> _______________________________________________
> Outages mailing list
> Outages at outages.org
> https://puck.nether.net/mailman/listinfo/outages



More information about the Outages-discussion mailing list