[j-nsp] High failure rates for M7i/M10i hard disks?

Fri Aug 26 17:14:48 EDT 2005

On Aug 26, 2005, at 2:55 PM, Richard A Steenbergen wrote:
>
> I hope they're not actually saying that the hard drive can't handle  
> being
> written to every 10 secs?
>
> Bullshit meter: [........../]

A long time ago, several jobs ago, I worked as an arcade game  
developer. We used 2.5" hard drives in a few arcade games.

The drives were in constant use. Some of the board in the games had  
less than 4MB of ram and were streaming video constantly. There was  
no cache, no buffer. The activity light on the drives was on constantly.

Some models of drives appeared to work fine, some would have this 2-5  
second "pause" every 10-30 minutes, especially when the drive was  
just powered up. We thought it was some kind of correctable ECC  
error, and RMA'ed drives by the hundreds. Eventually the manufacturer  
explained to us that the drives did a thermal recalibration every so  
often. The platters, head arms, and just about everything else would  
stretch a few microns as the drive warmed up. The drive would have to  
do something (look where some special tracks were at the inner and  
outer limits of the drive, or just seek home and see where the home  
sensor would be activated, etc) to know how far things had stretched.

It turned out that the drives that were never showing this problem  
were just never recalibrating because we never gave them a chance. We  
had such immense cooling in our systems (a giant wooden arcade  
cabinet with a lone hard drive in a cubic meter of air, and fans  
blowing over it), they never got warm enough to really need it. The  
drive manufacturer told us that had we been doing writes to the drive  
(all of our access was reads, any storage was done to CMOS) it  
probably would have eventually caused problems by writing tracks  
slightly off center.

Most drives do this without you ever knowing. They wait for a quiet  
period of no activity after XX seconds and just do it. How drives  
cope with never finding a quiet time seems to be manufacturer/model  
dependent. Some stop responding while they force a recalibration  
event, some never do one, some do some more intelligent things where  
they approximate the recalibration steps needed by using whatever  
seeks you're doing on your own.

For us, we could wait for a dead time where a fixed screen was being  
displayed for 5 seconds and no hard drive activity was occurring and  
say "If you need to re-calibrate, DO IT NOW!". This worked for us.  
The command to do this is somewhat model dependent. Some drives that  
don't support it will hang if you send it to them.

If Juniper is using drives that will indefinitely postpone  
recalibration based on activity AND the drives are getting hot enough  
that thermal recalibration is necessary, I can see how this would  
cause problems. Either it's missing the track, taking too long to  
find the track, or writing data in the wrong places.

My memory is a little fuzzy, I'm sure someone from Seagate or Maxtor  
is gonna reply and correct everything, but this is my theory.