[j-nsp] High failure rates for M7i/M10i hard disks?
Richard A Steenbergen
ras at e-gerbil.net
Fri Aug 26 17:05:05 EDT 2005
On Fri, Aug 26, 2005 at 10:33:16PM +0200, sthaug at nethelp.no wrote:
> > > And that field alert is now out: PSN-2005-08-014
> > >
> > > https://www.juniper.net/alerts/viewalert.jsp?actionBtn=Search&txtAlertNumber=PSN-2005-08-014&viewMode=view
> >
> > I hope they're not actually saying that the hard drive can't handle being
> > written to every 10 secs?
>
> I'm not going to defend Juniper here - I think we have suffered quite
> enough of these disk problems (got woken by a phone call from our NOC
> this morning - *another* M7i had stopped working during the night, from
> the same problem).
>
> However, I *think* what they're saying is that writes every 10 seconds
> for a while is not a problem, but writes every 10 seconds 24x7 may be a
> problem. Remember what the disk manufacturers have been trying to tell
> us - there are differences between disks made for heavy-duty server use
> (typically SCSI) and disks made for PC/home use (typically ATA). The
> M7i/M10i disks are 2.5" ATA disks (laptop type disks) and are probably
> not made for continuous use.
As I understand it, they're claiming that writing to the drive every 10
secs is preventing the thermal recal. I'm not a hard drive engineer so I
can't say for certain what is or isn't necessary, but like I said my
bullshit meter is going off on log file writes every 10 secs preventing
thermal recal.
However, I will definitely say that the rest is nonsense. First, the
drives are exactly the same, the only difference is the type of interface
attached to the drive. Yes the drive manufacturers will reserve the
fastest and nicest drives for the more expensive commercial-use interfaces
(fibre channel, scsi, etc), but there are in fact specifically targetted
server grade 2.5" ATA drives for use in blade servers. These drives are
subjected to the same high-volume 24/7 reads and writes as any other
server, I can't imagine how a small log file every 10 secs could possibly
compare.
I checked on the specific drive used in a RE-5.0/RE-400, based on
information provided by some actual users of it. The drive detects as a:
ad1: 19077MB <HTS548020M9AT00> [38760/16/63] at ata0-slave using UDMA33
Which seems to be a Travelstr 5K80-20 5400RPM 20G ATA-6 drive:
http://www.hitachigst.com/portal/site/en/menuitem.4a8443e5524e0c5deb4703e3aac4f0a0/
It seems the drives they are marketing for blade server use are the
E#K##'s not the regular #K##'s. For example:
http://www.hitachigst.com/portal/site/en/menuitem.ec03cadee7c6fb5deb4703e3aac4f0a0/
vs
http://www.hitachigst.com/portal/site/en/menuitem.c8c3966a526cfb5deb4703e3aac4f0a0/
A search on pricewatch seems to put the price for these models (remember
this is 60G, 3x the capacity of the RE drive, and 7200RPM) at $179 for the
non-E, and $209 for the E ($30 difference). Now, I don't know for certain
if this drive is actually any "longer lived", or if it just offers faster
access rates, but we do know that an actual server blade grade HD is
obtainable for very cheap. Given that the list price of a RE-400-256 is
$15,000, and RE-850-1536 is $20,000, you're going to have a pretty damn
hard time convincing me that Juniper couldn't make sure $30 more per unit
was spent to get drives which could handle updating a log file every 10
seconds.
I'm just not buying it. Just speculating without any facts here, but a
manufacturing defect or bad bios/firmware interaction that would require
an RMA seems far more likely to me. Maybe Juniper doesn't want to RMA
every M7i and M10i routing engine they've sold, and thinks that reducing
the error rates by writing to the drive less is a fix for some of the
problems?
--
Richard A Steenbergen <ras at e-gerbil.net> http://www.e-gerbil.net/ras
GPG Key ID: 0xF8B12CBC (7535 7F59 8204 ED1F CC1C 53AF 4C41 5ECA F8B1 2CBC)
More information about the juniper-nsp
mailing list