[nsp] 5 9's infrastructure, stateful failover

Scott Morris swm at emanon.com
Fri Jul 18 11:36:26 EDT 2003


Interesting viewpoint, but if you take the measurements per the Bellcore
standards (brought to you by the folks who invented the concept of
5-9's), you'll find a number of interesting things that affect what
"counts".

If an interruption affects less than 250 users (phones/stations), it
doesn't count.
If a single interruption is less than 30 minutes, it doesn't count.

So missing 6 pings, even interpolated wouldn't count.

There are other specific things involved, but measuring five 9's
availability isn't quite as difficult as many people thing with all of
the redundancy capabilities that are possible in today's technology.

Like any statistics though, it's all in how you read them!

Scott

-----Original Message-----
From: cisco-nsp-bounces at puck.nether.net
[mailto:cisco-nsp-bounces at puck.nether.net] On Behalf Of Ryan O'Connell
Sent: Friday, July 18, 2003 5:00 AM
To: Cheung, Rick
Cc: cisco-nsp at puck.nether.net
Subject: Re: [nsp] 5 9's infrastructure, stateful failover


Cheung, Rick wrote:
>Good afternoon, folks. We're tasked to 5 9's uptime as a recent 
>challenge, and we're doing an audit on our infrastructure to see if we 
>can deliver.
>
>         What technologies would lead us closest to 5 9's for uptime, 
>in a LAN and WAN environment? (HSRP, Rapid STP, VRRP, redundant 
>circuits)
>
>         Ideally, the convergence would be as short as a stateful 
>failover on the PIX.

99.999% uptime is a business risk vs. profit-based gamble taken by
service 
providers (And sometimes IT departments!) not a realistic figure in most

(all?) cases. Consider how little outage time you can actually have in a

whole year with five nines - around 5 minutes. What's the granularity of

your monitoring systems in terms of doing even simple pings? Probably
not 
much better than 60 seconds at most. So if you even drop six ICMP
packets 
in a year you're over the limit because you have to treat each outage 
(Probably of "20% packet loss" or similar, which almost certainly
violates 
your SLA) as lasting 60 seconds.

Of course, it never works like this in practice - the NOC will be
"unable 
to verify" the outage so the ticket is just closed with no downtime 
recorded. Or maybe it did go down, but no users complained, so it
doesn't 
get recorded. Or they did complain, but were unable to quantify it
because 
it came back so quick and only 1% of users complained anyway - so it 
doesn't get recorded, etc etc.

If you want fast failover and to learn about fast failover network
design 
in a Campus-style network, the Cisco press book "LAN Switching" is worth
a 
read. (Of course, this doesn't help if you don't know what you're doing
in 
terms of network management - good NOC practices, backing up
router/switch 
configs, change control etc contribute more to reliability than network 
design) Generally, Layer-2 Ethernet technologies (I.e. spanning tree)
can't 
do fast failover - even if you have a relatively small network diameter
and 
tweak the timers you'd still expect 30s convergence times for each flap.

For critical networks, try to push L3 to the edge as far as possible and

have a STP-free dual L2 core. You need to make sure servers are built in

redundant pairs in this case - doesn't help having a reliable 
infrastructure if the one switch or switch port the server is connected
to 
fails. (There are technologies to allow servers to be dual homed but I'm

skeptical about them personally - Sun's solution looks reasonable,
Compaq's 
less so)

_______________________________________________
cisco-nsp mailing list  cisco-nsp at puck.nether.net
http://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisco-nsp/




More information about the cisco-nsp mailing list