[j-nsp] High failure rates for M7i/M10i hard disks?
Kevin Day
toasty at dragondata.com
Fri Aug 26 17:14:48 EDT 2005
On Aug 26, 2005, at 2:55 PM, Richard A Steenbergen wrote:
>
> I hope they're not actually saying that the hard drive can't handle
> being
> written to every 10 secs?
>
> Bullshit meter: [........../]
A long time ago, several jobs ago, I worked as an arcade game
developer. We used 2.5" hard drives in a few arcade games.
The drives were in constant use. Some of the board in the games had
less than 4MB of ram and were streaming video constantly. There was
no cache, no buffer. The activity light on the drives was on constantly.
Some models of drives appeared to work fine, some would have this 2-5
second "pause" every 10-30 minutes, especially when the drive was
just powered up. We thought it was some kind of correctable ECC
error, and RMA'ed drives by the hundreds. Eventually the manufacturer
explained to us that the drives did a thermal recalibration every so
often. The platters, head arms, and just about everything else would
stretch a few microns as the drive warmed up. The drive would have to
do something (look where some special tracks were at the inner and
outer limits of the drive, or just seek home and see where the home
sensor would be activated, etc) to know how far things had stretched.
It turned out that the drives that were never showing this problem
were just never recalibrating because we never gave them a chance. We
had such immense cooling in our systems (a giant wooden arcade
cabinet with a lone hard drive in a cubic meter of air, and fans
blowing over it), they never got warm enough to really need it. The
drive manufacturer told us that had we been doing writes to the drive
(all of our access was reads, any storage was done to CMOS) it
probably would have eventually caused problems by writing tracks
slightly off center.
Most drives do this without you ever knowing. They wait for a quiet
period of no activity after XX seconds and just do it. How drives
cope with never finding a quiet time seems to be manufacturer/model
dependent. Some stop responding while they force a recalibration
event, some never do one, some do some more intelligent things where
they approximate the recalibration steps needed by using whatever
seeks you're doing on your own.
For us, we could wait for a dead time where a fixed screen was being
displayed for 5 seconds and no hard drive activity was occurring and
say "If you need to re-calibrate, DO IT NOW!". This worked for us.
The command to do this is somewhat model dependent. Some drives that
don't support it will hang if you send it to them.
If Juniper is using drives that will indefinitely postpone
recalibration based on activity AND the drives are getting hot enough
that thermal recalibration is necessary, I can see how this would
cause problems. Either it's missing the track, taking too long to
find the track, or writing data in the wrong places.
My memory is a little fuzzy, I'm sure someone from Seagate or Maxtor
is gonna reply and correct everything, but this is my theory.
More information about the juniper-nsp
mailing list