[nsp] 5 9's infrastructure, stateful failover
Ryan O'Connell
ryan at complicity.co.uk
Fri Jul 18 11:00:10 EDT 2003
Cheung, Rick wrote:
>Good afternoon, folks. We're tasked to 5 9's uptime as a recent
>challenge, and we're doing an audit on our infrastructure to see if we can
>deliver.
>
> What technologies would lead us closest to 5 9's for uptime, in a
>LAN and WAN environment? (HSRP, Rapid STP, VRRP, redundant circuits)
>
> Ideally, the convergence would be as short as a stateful failover on
>the PIX.
99.999% uptime is a business risk vs. profit-based gamble taken by service
providers (And sometimes IT departments!) not a realistic figure in most
(all?) cases. Consider how little outage time you can actually have in a
whole year with five nines - around 5 minutes. What's the granularity of
your monitoring systems in terms of doing even simple pings? Probably not
much better than 60 seconds at most. So if you even drop six ICMP packets
in a year you're over the limit because you have to treat each outage
(Probably of "20% packet loss" or similar, which almost certainly violates
your SLA) as lasting 60 seconds.
Of course, it never works like this in practice - the NOC will be "unable
to verify" the outage so the ticket is just closed with no downtime
recorded. Or maybe it did go down, but no users complained, so it doesn't
get recorded. Or they did complain, but were unable to quantify it because
it came back so quick and only 1% of users complained anyway - so it
doesn't get recorded, etc etc.
If you want fast failover and to learn about fast failover network design
in a Campus-style network, the Cisco press book "LAN Switching" is worth a
read. (Of course, this doesn't help if you don't know what you're doing in
terms of network management - good NOC practices, backing up router/switch
configs, change control etc contribute more to reliability than network
design) Generally, Layer-2 Ethernet technologies (I.e. spanning tree) can't
do fast failover - even if you have a relatively small network diameter and
tweak the timers you'd still expect 30s convergence times for each flap.
For critical networks, try to push L3 to the edge as far as possible and
have a STP-free dual L2 core. You need to make sure servers are built in
redundant pairs in this case - doesn't help having a reliable
infrastructure if the one switch or switch port the server is connected to
fails. (There are technologies to allow servers to be dual homed but I'm
skeptical about them personally - Sun's solution looks reasonable, Compaq's
less so)
More information about the cisco-nsp
mailing list