[c-nsp] IOS reliability

Phil Mayers p.mayers at imperial.ac.uk
Wed Jan 7 10:57:02 EST 2009


Ross Vandegrift wrote:
> On Wed, Jan 07, 2009 at 11:18:06AM +0000, A.L.M.Buxey at lboro.ac.uk wrote:
>> ...and safeharbor is always a good option unless you cant use it (eg
>> new a feature thats not in a safeharbor release!)
> 
> Don't put too much stock on the "Safe Harbor" label.  We have an
> internal control to only run Safe Harbor code on our 6500s.  I've seen
> more crashes from the 12.2S train than any other IOS, probably by an
> order of magnitude.
> 
> Most of the crashes have been related to SNMP.  For many MIBs, if you
> poll an object at the same time it is changed/removed, there's a race
> condition somewhere that kills IOS.  It's really horrible - we're just
> slowly whittling away at our SNMP view, losing management capabilites
> to keep the damn things from falling over.
> 
> According to TAC, these crashes are rare and hard to trigger.  We've
> done it twice in a lab and four times in production.  On the upside,
> if you don't use SNMP, you're probably golden!
> 

I've never triggered a crash over SNMP on a 6500 12.2sx IOS, and we do 
some pretty extensive and aggressive MIB polling.

What MIBs have you had problems with?

In answer to the OPs question, I think it would be difficult to define 
an MTBF for IOS because it's going to be dependent on the features 
enabled and traffic patterns, but I generally observe 3 patterns of 
behaviour:

  1. known-bad IOS versions e.g. SXF15, SXF2a/3, which have obvious 
crash-bugs that you can trivially trigger and find very quickly (in the 
lab, hopefully).

  2. known-good IOS versions, which seem (for a given feature set and 
traffic pattern) to be more or less indestructible. SXF9, and SXF10 
fitted the bill for us (MPLS L3 VPN, MVPN, >300 SVIs, >5000 ARP/FDB 
entries, multi-gigabit, IMIX traffic including default VRF exposed to 
the internet but protected via CoPP). 10x routers running SXF10 with 
this traffic mix run for >1 year, so I guess the MTBF is on the order of 
10^5 hours.

  3. buggy IOS versions which suddenly reach a threshold then "go bad". 
We ran into memory leak problems on SXF6 where a box which had been 
running for >1 year suddenly started dying as the number of SVIs or 
ARP/FDB entries got "too big".

In short - if the version runs for >2 weeks with a representative config 
and traffic load, I make an assumption of MTBF >10^4 hours, and our 
experiences support this.


More information about the cisco-nsp mailing list