[nsp] 5 9's infrastructure, stateful failover

Thu Jul 17 17:32:02 EDT 2003

"Cheung, Rick" <Rick.Cheung at NextelPartners.com> writes:

> 	Good afternoon, folks. We're tasked to 5 9's uptime as a recent
> challenge, and we're doing an audit on our infrastructure to see if we can
> deliver. 
> 
> 	What technologies would lead us closest to 5 9's for uptime, in a
> LAN and WAN environment? (HSRP, Rapid STP, VRRP, redundant circuits)
> 
> 	Ideally, the convergence would be as short as a stateful failover on
> the PIX. 

Five nines would be 5.39 minutes of downtime per year (six seconds per
week).  I wonder if your voice side people do that well...  probably
not given that residential POTS is usually in the 3 nines range of
reliability end-to-end (ever get a fast busy signal?) and cellular is
substantially worse than that.

If you want to take a legitimate crack at making your network more
reliable, the first thing you have to do is to get people completely
away from touching the routers and switches, or changing
configurations at all - even in cases where the network is
misbehaving.  Most outages are caused by technicians, not
hardware. RFC3439, chapter 7, takes a look at this.  All recovery from
any kind of fault will have to be completely automated - there is not
enough time for human analysis or to call someone for escalation if
you're looking at a hard limit of 324 seconds per year.  Naturally,
this is an unattainable goal, but the closer you get to it the more
reliability you get.  Anyway, good policies and procedures (and config
change review) that help the NOC staff resist the urge to stick their
fingers in the boxes will eliminate 3/4 of your outages.  Once you get
the human factor as far out of the equation as possible, your
attention can shift hard in the direction of protocols and hardware
redundancy.

A more attainable approach is to bake your numbers until light golden
brown (mmmmm, tasty!) just like everyone else who claims five nines of
reliability.  Really.  For instance, anything that happens during a
scheduled maintenance window (even a window that was scheduled mere
minutes or seconds before the network died, without customer
notification) must not count towards your 5:24 per year.  Outages that
affect only a small fraction of the system (say, a couple of sites)
should have their outage time pro-rated by the fraction of the total
user set that is affected.  Certainly power availability and other
environmental factors shouldn't be counted as these are external
influences on your network, not reflective of the quality of the
network itself.  Be liberal about what you declare to be "legacy
equipment" and thus not subject to your service metrics.  And so on.
Cisco has a nice paper on the Five Nines myth at
http://www.cisco.com/warp/public/cc/so/neso/vvda/iptl/5nine_wp.htm
that you can read for further ideas.

It sounds as if your corporate masters have fallen into the "you buy
what you read, not what you need" trap.  You can give them that warm
fuzzy feeling whilst snagging down some budget to make your network
somewhat more reliable (pointing out the naked emperor in the room is
somewhat impolitic although viscerally satisfying at some level), or
you can give them a budget showing how much it will really cost to get
you to five nines and let them say "no thanks".  Your choice.  :)

                                        ---Rob