[c-nsp] Stacking 3750X vs diverse 4948E

Sat May 19 08:10:11 EDT 2012

On (2012-05-19 07:47 -0400), Lee wrote:

> How about VSS?  We're considering it mainly because it would eliminate STP

There are already horror stories in c-nsp, where software defect has taken
whole VSS cluster down. STP is very unlikely to do that, as the code is lot
simpler and lot more mature.

To me it is clear that main things that cause outages are

1. Operator
2. Software defect
3. Hardware defect

And there are huge gap between probability of each, i.e. operator is much
more likely to break the network than software defect, and so forth.

Yet typically even high budget, high clue, critical importance networks are
designed with only working around outages caused by 3. Often these efforts
actually increase probability of 1 and 2. Essentially often the 'well
design' network has lower MTBF due to the added software complexity.

Key example here is stateful firewall clusters, which I consistently see
failing more often than single firewalls.
When possible, I would separate elements with routing and accept that users
will see sessions breaking when there is network fault.

If you keep eye open in press, these examples are on the news all the time,
where CIO explains that the setup was fully redundant yadayada, it should
have never failed.

Latest example I can think of was large outsourcing/integrator losing their
whole 'redundant' storage setup, causing 5 day outage. Or bit longer ago,
public sector health care had to resort to dead wood as LAN was down for
1.5weeks.
Both were designed not to fail and neither was designed to workaround (or
even rapid recover from) software defects.

-- 
  ++ytti