[j-nsp] SSD disks high failure ratio ?

Paul Stewart paul at paulstewart.org
Tue Oct 8 11:37:52 EDT 2013


Our unit that is "effected" I already upgraded software on and didn't have a
problem - thankfully it looks like we got off lucky...:)

From:  Pierre-Yves Maunier <j-nsp at maunier.org>
Date:  Tuesday, 8 October, 2013 10:50 AM
To:  Paul Stewart <paul at paulstewart.org>
Cc:  Phil Mayers <p.mayers at imperial.ac.uk>, Saku Ytti <saku at ytti.fi>,
Juniper-Nsp List <juniper-nsp at puck.nether.net>
Subject:  Re: [j-nsp] SSD disks high failure ratio ?

> I confirmed just by serial number and also by the fact that during the reboot
> after a software upgrade my filesystem died on the /var partition.
> 
> I'm still waiting a confirmation from the TAC.
> 
> 
> On Tuesday, October 8, 2013, Paul Stewart  wrote:
>> Did you confirm by serial number that you were effected?  The reason I ask
>> is we had a pair of RE1800's that matched on part number but after JTAC
>> ran the serial numbers they re-assured us that we were not actually
>> effected (which is kind of scary in itself).
>> 
>> Paul
>> 
>> 
>> On 2013-10-07 7:58 PM, "Pierre-Yves Maunier" <j-nsp at maunier.org
>> <javascript:;> > wrote:
>> 
>>> >Hello,
>>> >
>>> >I have affected REs, and before I had the knowledge of the KB, I found a
>>> >workaround to repair the filesystem because the TAC was unable to tell me
>>> >anything about this KB.
>>> >
>>> >After an upgrade from 12.2R1.8 to 12.3R4.6 I got this :
>>> >
>>> >=================== Bootstrap installer starting ===================
>>> >Initialized the environment
>>> >Routing engine model is RE-S-1800x4
>>> >HW model is Intel(R) Xeon(R) CPU           C5518  @ 1.73GHz
>>> >[: kontron: unexpected operator
>>> >Discovered that flash disk = ad0 , hard disk = ad1
>>> >mount: /dev/ad1s1f : Invalid argument
>>> >ERROR: mount_partition: Mount /dev/ad1s1f /mnt failed
>>> >You are now in a debugging subshell (you may not see a prompt)Š
>>> >#
>>> >
>>> >And after a reboot I got this :
>>> >
>>> >Automatic reboot in progress...
>>> >** /dev/ad1s1a
>>> >FILE SYSTEM CLEAN; SKIPPING CHECKS
>>> >clean, 1673532 free (124 frags, 209176 blocks, 0.0% fragmentation)
>>> >** /dev/ad1s1e
>>> >FILE SYSTEM CLEAN; SKIPPING CHECKS
>>> >clean, 201639 free (31 frags, 25201 blocks, 0.0% fragmentation)
>>> >Cannot find file system superblock
>>> >32 is not a file system superblock
>>> >28740192 is not a file system superblock
>>> >** /dev/ad1s1f
>>> >
>>> >
>>> >LOOK FOR ALTERNATE SUPERBLOCKS? yes
>>> >
>>> >
>>> >SEARCH FOR ALTERNATE SUPER-BLOCK FAILED. YOU MUST USE THE
>>> >-b OPTION TO FSCK TO SPECIFY THE LOCATION OF AN ALTERNATE
>>> >SUPER-BLOCK TO SUPPLY NEEDED INFORMATION; SEE fsck(8).
>>> >tunefs: /var: could not read superblock to fill out disk
>>> >mount: /dev/ad1s1f : Invalid argument
>>> >WARNING:
>>> >WARNING: /var mount failed, building emergency /var
>>> >WARNING:
>>> >Creating initial configuration...mgd: commit complete
>>> >Setting initial options:  debugger_on_panic=NO debugger_on_break=NO.
>>> >Starting optional daemons:  usbd.
>>> >Doing initial network setup:
>>> >.
>>> >Initial interface configuration:
>>> >
>>> >
>>> >So the /var partition on /dev/ad1s1f (SSD) needed a fsck but it failed
>>> >because of a 'bad superblock'
>>> >
>>> >Going in the shell as root, I issued the following command to get a lisk
>>> >of
>>> >'backup' super-blocks :
>>> >
>>> >root at CORE-01% newfs -N /dev/ad1s1f
>>> >/dev/ad1s1f: 18342.8MB (37566076 sectors) block size 16384, fragment size
>>> >2048
>>> >     using 100 cylinder groups of 183.69MB, 11756 blks, 23552 inodes.
>>> >super-block backups (for fsck -b #) at:
>>> > 32, 376224, 752416, 1128608, 1504800, 1880992, 2257184, 2633376, 3009568,
>>> > 3385760, 3761952, 4138144, 4514336, 4890528, 5266720, 5642912, 6019104,
>>> > 6395296, 6771488, 7147680, 7523872, 7900064, 8276256, 8652448, 9028640,
>>> > 9404832, 9781024, 10157216, 10533408, 10909600, 11285792, 11661984,
>>> >12038176,
>>> > 12414368, 12790560, 13166752, 13542944, 13919136, 14295328, 14671520,
>>> > 15047712, 15423904, 15800096, 16176288, 16552480, 16928672, 17304864,
>>> > 17681056, 18057248, 18433440, 18809632, 19185824, 19562016, 19938208,
>>> > 20314400, 20690592, 21066784, 21442976, 21819168, 22195360, 22571552,
>>> > 22947744, 23323936, 23700128, 24076320, 24452512, 24828704, 25204896,
>>> > 25581088, 25957280, 26333472, 26709664, 27085856, 27462048, 27838240,
>>> > 28214432, 28590624, 28966816, 29343008, 29719200, 30095392, 30471584,
>>> > 30847776, 31223968, 31600160, 31976352, 32352544, 32728736, 33104928,
>>> > 33481120, 33857312, 34233504, 34609696, 34985888, 35362080, 35738272,
>>> > 36114464, 36490656, 36866848, 37243040
>>> >
>>> >Then this command fixed the problem (376224 is the first super-block after
>>> >'32' which seem to have an issue) :
>>> >
>>> >root at CORE-01% fsck_ufs -y -b 376224 /dev/ad1s1f
>>> >
>>> >Does anyone knows what is the 'software solution' that 'has also been
>>> >developed to correct the affected REs in the field' as said in the KB ?
>>> >
>>> >Pierre-Yves
>>> >
>>> >
>>> >
>>> >2013/10/4 Phil Mayers <p.mayers at imperial.ac.uk>
>>> >
>>>> >> Saku Ytti <saku at ytti.fi> wrote:
>>>>> >> >On (2013-10-03 18:08 -0400), Paul Stewart wrote:
>>>>> >> >
>>>>>> >> >> "Article is in review and not yet ready for viewing"
>>>>> >> >
>>>>> >> >http://kb.juniper.net/InfoCenter/index?page=content&id=TSB16210
>>>>> >> >
>>>>>> >> >>
>>>>>> >> >>
>>>> >>
>>>> >>http://kb.juniper.net/InfoCenter/index?page=content&id=S:TSB16164&smlogin
>>>> >>=
>>>>> >> >
>>>>> >> >--
>>>>> >> >  ++ytti
>>>>> >> >_______________________________________________
>>>>> >> >juniper-nsp mailing list juniper-nsp at puck.nether.net
>>>>> >> >https://puck.nether.net/mailman/listinfo/juniper-nsp
>>>> >>
>>>> >> Thanks, this is very useful - does look like our new REs are affected
>>>> >>:o(
>>> >




More information about the juniper-nsp mailing list