[j-nsp] High failure rates for M7i/M10i hard disks?

Fri Aug 26 17:05:05 EDT 2005

On Fri, Aug 26, 2005 at 10:33:16PM +0200, sthaug at nethelp.no wrote:
> > > And that field alert is now out: PSN-2005-08-014
> > > 
> > > https://www.juniper.net/alerts/viewalert.jsp?actionBtn=Search&txtAlertNumber=PSN-2005-08-014&viewMode=view
> > 
> > I hope they're not actually saying that the hard drive can't handle being 
> > written to every 10 secs?
> 
> I'm not going to defend Juniper here - I think we have suffered quite
> enough of these disk problems (got woken by a phone call from our NOC
> this morning - *another* M7i had stopped working during the night, from
> the same problem).
> 
> However, I *think* what they're saying is that writes every 10 seconds
> for a while is not a problem, but writes every 10 seconds 24x7 may be a
> problem. Remember what the disk manufacturers have been trying to tell
> us - there are differences between disks made for heavy-duty server use
> (typically SCSI) and disks made for PC/home use (typically ATA). The
> M7i/M10i disks are 2.5" ATA disks (laptop type disks) and are probably
> not made for continuous use.

As I understand it, they're claiming that writing to the drive every 10 
secs is preventing the thermal recal. I'm not a hard drive engineer so I 
can't say for certain what is or isn't necessary, but like I said my 
bullshit meter is going off on log file writes every 10 secs preventing 
thermal recal.

However, I will definitely say that the rest is nonsense. First, the 
drives are exactly the same, the only difference is the type of interface 
attached to the drive. Yes the drive manufacturers will reserve the 
fastest and nicest drives for the more expensive commercial-use interfaces 
(fibre channel, scsi, etc), but there are in fact specifically targetted 
server grade 2.5" ATA drives for use in blade servers. These drives are 
subjected to the same high-volume 24/7 reads and writes as any other 
server, I can't imagine how a small log file every 10 secs could possibly 
compare.

I checked on the specific drive used in a RE-5.0/RE-400, based on 
information provided by some actual users of it. The drive detects as a:

ad1: 19077MB <HTS548020M9AT00> [38760/16/63] at ata0-slave using UDMA33

Which seems to be a Travelstr 5K80-20 5400RPM 20G ATA-6 drive:

http://www.hitachigst.com/portal/site/en/menuitem.4a8443e5524e0c5deb4703e3aac4f0a0/

It seems the drives they are marketing for blade server use are the 
E#K##'s not the regular #K##'s. For example:

http://www.hitachigst.com/portal/site/en/menuitem.ec03cadee7c6fb5deb4703e3aac4f0a0/

vs

http://www.hitachigst.com/portal/site/en/menuitem.c8c3966a526cfb5deb4703e3aac4f0a0/

A search on pricewatch seems to put the price for these models (remember 
this is 60G, 3x the capacity of the RE drive, and 7200RPM) at $179 for the 
non-E, and $209 for the E ($30 difference). Now, I don't know for certain 
if this drive is actually any "longer lived", or if it just offers faster 
access rates, but we do know that an actual server blade grade HD is 
obtainable for very cheap. Given that the list price of a RE-400-256 is 
$15,000, and RE-850-1536 is $20,000, you're going to have a pretty damn 
hard time convincing me that Juniper couldn't make sure $30 more per unit 
was spent to get drives which could handle updating a log file every 10 
seconds.

I'm just not buying it. Just speculating without any facts here, but a 
manufacturing defect or bad bios/firmware interaction that would require 
an RMA seems far more likely to me. Maybe Juniper doesn't want to RMA 
every M7i and M10i routing engine they've sold, and thinks that reducing 
the error rates by writing to the drive less is a fix for some of the 
problems?

-- 
Richard A Steenbergen <ras at e-gerbil.net>       http://www.e-gerbil.net/ras
GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)