[c-nsp] Parity Errors and Cosmic Rays

Church, Chuck cchurch at netcogov.com
Thu May 5 12:17:09 EDT 2005


There has been research done on it:
http://www.research.ibm.com/journal/rd40-1.html

For now, I guess I'll just keep wrapping my devices in lead... 


Chuck Church
Lead Design Engineer
CCIE #8776, MCNE, MCSE
Netco Government Services - Design & Implementation
1210 N. Parker Rd.
Greenville, SC 29609
Home office: 864-335-9473
Cell: 703-819-3495
cchurch at netcogov.com
PGP key: http://pgp.mit.edu:11371/pks/lookup?op=get&search=0x4371A48D


-----Original Message-----
From: cisco-nsp-bounces at puck.nether.net
[mailto:cisco-nsp-bounces at puck.nether.net] On Behalf Of Chris Roberts
Sent: Thursday, May 05, 2005 12:03 PM
To: 'John Neiberger'; cisco-nsp at puck.nether.net
Subject: RE: [c-nsp] Parity Errors and Cosmic Rays

> Is this actually a common problem? Or at least common enough 
> that I should expect to see it every other month or so? It 
> seems strange that this router has run for years and we've 
> never seen a memory parity error and now we've seen three in 
> three months.
> 

Sometime last year, we started seeing memory parity errors on our 7507s.
This was affecting one card. This gradually spread over the course of
around
a month to 3 cards in the same platform, the first two of which were
replaced. This then spread to another chassis in the same rack, which
then
started losing cards at the same rate over the course of a month. (See
my
mails to this list at around the same time with around the same kind of
content as yours). I'd run 7505s at other ISPs for ~5 or more years and
never seen anything like this. Cisco simply wanted to replace each of
the
offending items of hardware, however this was not fixing the spread. We
then
lost a PA-GE with parity errors in one of our 7206s in another rack in
the
same suite.

After much sobbing we took the 7507s out and upgraded our 6509s to
Sup720s,
which so far have been rock solid, besides some installation issues and
teething problems. I realise this isn't a possibility for everyone
though.

Some things that were suggested at the time:
* Cosmic rays
* Static protection in your data centres
* Metal filings getting into kit from people chopping floor tiles and
such
and getting into the aircon
* Failing PSUs

Also, our offending 7507s were getting old (3-4 years apparently), but
had
always been rock solid. I suspect it may have just been age that killed
them
in the end, we never did find any trace of any of the above, although
obviously static and cosmic rays are hard to prove. At the time it was
also
suggested that the TAC would be able to test the returned cards and
provide
you with some kind of breakdown of the failure mode of the card and let
you
know which components they had to replace, but that they would be loathe
to
do this. Sure enough we requested the TAC do this, and they were loathe
to
do it, and we've never followed this up as we still have most of the
dead
cards and didn't RMA them, but I guess that might be something you may
want
to do.

> Any thoughts?
> 
> Thanks,
> John

Cheers,
Chris.

---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.859 / Virus Database: 585 - Release Date: 14/02/2005
 

_______________________________________________
cisco-nsp mailing list  cisco-nsp at puck.nether.net
https://puck.nether.net/mailman/listinfo/cisco-nsp
archive at http://puck.nether.net/pipermail/cisco-nsp/



More information about the cisco-nsp mailing list