[outages] Level 3 down in Atlanta

George Herbert george.herbert at gmail.com
Thu Oct 22 22:43:48 EDT 2009


On Thu, Oct 22, 2009 at 7:03 PM, Jay R. Ashworth <jra at baylink.com> wrote:
> ----- "Jeremy Chadwick" <outages at jdc.parodius.com> wrote:
>> On Tue, Oct 20, 2009 at 09:28:21AM -0700, Scott Howard wrote:
>> > Looks like it's all back up as of about 30 mins ago.
>> >
>> > Apparently either a core switch or router failed, which took down much of
>> > their network in Atlanta, as well as Memphis and Nashville.
>>
>> Level 3 has a single router or switch handling packets at a major
>> POP?
>> I doubt this, but the outage is confirmation something bad happened.
>> That said: where's the redundancy, and why didn't it kick in?
>
> Oh; you're *always* asking that.
>
> :-)
>
> The Internet Backbone<tm> has been a commercial, rather than an engineering,
> construct for over 15 years now.

The RFO that went out somewhat after he asked that was more useful...
N=2 redundancy was in place.  However, when primary had hardware
failure, secondary had (unknown / unstated) software, config, or
hardware failure that hadn't been detected or checked, and it didn't
work either.

It's hard to test clusters of things well when they have near-100%
uptime requirements.  The dependability of the untested failover unit
is low, as you're not testing it well.

Sometimes you can test failovers in stream.  But sometimes those
supposedly harmless failover tests fail for baroque reasons, taking
down a service when the primary was in fact just fine.

This isn't (just) an economics problem.  Reliability of complex
problems is an mathematically exponentially hard problem to crack from
the engineering and theoretical levels.

Some people don't try - and get what they deserve - and some people
give it a good or best commercial reasonable effort, and still fail.
Doing better than that is really hard.


-- 
-george william herbert
george.herbert at gmail.com



More information about the Outages mailing list