[nsp] 5 9's infrastructure, stateful failover

Fri Jul 18 11:00:10 EDT 2003

Cheung, Rick wrote:
>Good afternoon, folks. We're tasked to 5 9's uptime as a recent
>challenge, and we're doing an audit on our infrastructure to see if we can
>deliver.
>
>         What technologies would lead us closest to 5 9's for uptime, in a
>LAN and WAN environment? (HSRP, Rapid STP, VRRP, redundant circuits)
>
>         Ideally, the convergence would be as short as a stateful failover on
>the PIX.

99.999% uptime is a business risk vs. profit-based gamble taken by service 
providers (And sometimes IT departments!) not a realistic figure in most 
(all?) cases. Consider how little outage time you can actually have in a 
whole year with five nines - around 5 minutes. What's the granularity of 
your monitoring systems in terms of doing even simple pings? Probably not 
much better than 60 seconds at most. So if you even drop six ICMP packets 
in a year you're over the limit because you have to treat each outage 
(Probably of "20% packet loss" or similar, which almost certainly violates 
your SLA) as lasting 60 seconds.

Of course, it never works like this in practice - the NOC will be "unable 
to verify" the outage so the ticket is just closed with no downtime 
recorded. Or maybe it did go down, but no users complained, so it doesn't 
get recorded. Or they did complain, but were unable to quantify it because 
it came back so quick and only 1% of users complained anyway - so it 
doesn't get recorded, etc etc.

If you want fast failover and to learn about fast failover network design 
in a Campus-style network, the Cisco press book "LAN Switching" is worth a 
read. (Of course, this doesn't help if you don't know what you're doing in 
terms of network management - good NOC practices, backing up router/switch 
configs, change control etc contribute more to reliability than network 
design) Generally, Layer-2 Ethernet technologies (I.e. spanning tree) can't 
do fast failover - even if you have a relatively small network diameter and 
tweak the timers you'd still expect 30s convergence times for each flap. 
For critical networks, try to push L3 to the edge as far as possible and 
have a STP-free dual L2 core. You need to make sure servers are built in 
redundant pairs in this case - doesn't help having a reliable 
infrastructure if the one switch or switch port the server is connected to 
fails. (There are technologies to allow servers to be dual homed but I'm 
skeptical about them personally - Sun's solution looks reasonable, Compaq's 
less so)