[outages] Level 3 down in Atlanta
    Josh Luthman 
    josh at imaginenetworksllc.com
       
    Thu Oct 22 22:52:49 EDT 2009
    
    
  
This is the purpose of learning from your mistakes in the past.  Create a
maintenance plan so it doesn't happen again!
Fool me once...
Josh Luthman
Office: 937-552-2340
Direct: 937-552-2343
1100 Wayne St
Suite 1337
Troy, OH 45373
"When you have eliminated the impossible, that which remains, however
improbable, must be the truth."
--- Sir Arthur Conan Doyle
On Thu, Oct 22, 2009 at 10:43 PM, George Herbert
<george.herbert at gmail.com>wrote:
> On Thu, Oct 22, 2009 at 7:03 PM, Jay R. Ashworth <jra at baylink.com> wrote:
> > ----- "Jeremy Chadwick" <outages at jdc.parodius.com> wrote:
> >> On Tue, Oct 20, 2009 at 09:28:21AM -0700, Scott Howard wrote:
> >> > Looks like it's all back up as of about 30 mins ago.
> >> >
> >> > Apparently either a core switch or router failed, which took down much
> of
> >> > their network in Atlanta, as well as Memphis and Nashville.
> >>
> >> Level 3 has a single router or switch handling packets at a major
> >> POP?
> >> I doubt this, but the outage is confirmation something bad happened.
> >> That said: where's the redundancy, and why didn't it kick in?
> >
> > Oh; you're *always* asking that.
> >
> > :-)
> >
> > The Internet Backbone<tm> has been a commercial, rather than an
> engineering,
> > construct for over 15 years now.
>
> The RFO that went out somewhat after he asked that was more useful...
> N=2 redundancy was in place.  However, when primary had hardware
> failure, secondary had (unknown / unstated) software, config, or
> hardware failure that hadn't been detected or checked, and it didn't
> work either.
>
> It's hard to test clusters of things well when they have near-100%
> uptime requirements.  The dependability of the untested failover unit
> is low, as you're not testing it well.
>
> Sometimes you can test failovers in stream.  But sometimes those
> supposedly harmless failover tests fail for baroque reasons, taking
> down a service when the primary was in fact just fine.
>
> This isn't (just) an economics problem.  Reliability of complex
> problems is an mathematically exponentially hard problem to crack from
> the engineering and theoretical levels.
>
> Some people don't try - and get what they deserve - and some people
> give it a good or best commercial reasonable effort, and still fail.
> Doing better than that is really hard.
>
>
> --
> -george william herbert
> george.herbert at gmail.com
> _______________________________________________
> outages mailing list
> outages at outages.org
> https://puck.nether.net/mailman/listinfo/outages
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://puck.nether.net/pipermail/outages/attachments/20091022/7e5147b0/attachment.htm>
    
    
More information about the Outages
mailing list