[cisco-voip] 7825-I4 RAID and HDD strange behaviour or not !!!

Tue Feb 14 13:39:55 EST 2012

hy ryan..

thanx on your support..

R.R.At this point I'd guess that even 3B06 is old.  Make sure your drives
are on the latest firmware, and make sure your array is up and completely
stable before applying the update. If it's degraded when you try to patch it
can miss one of the drives and leave it on old firmware.  If you hit
CSCti52867 then you absolutely have to run the diskex...cop.sgn file (when
the array is optimal) and really should rebuild the server.  Just upgrading
the firmware is not sufficient to fix or prevent future outages once the
drives have started failing due to this bug.

S.DJ.  the RAID was optimal before any action and after install of new CUCM
7.1.5 on new server ( from RMA ) .. we did run disk excerciser..after that
we upgraded the firmware to 3B06 ( according to CSCti52867 instructions
)..after the firmware upgrade, RAID was in resynch state for several hours (
as expected )..so, at the time of testing and pulling one of the HDDs out,
system was fully operational with synched RAID.. regarding 3B06, i think
there is 3B07 out there ( if i'm not mistaking ) but i wasn't aware that we
should upgrade to it just like that..

R.R.If the BIOS doesn't detect the drive you've got a hardware issue either
with the array or the disk, period.  If the BIOS detects the drive but no OS
is found then the drive most likely didn't get fully sync'd from a previous
degraded state.

S.DJ.  as said, RAID utility reported RAID degraded ( as expected since HDD
was pulled out )..drive was fully synched according to CLI " show hardware "
output while the server was up&running 8 before shutting it down )..

R.R.I wouldn't place bets on that just yet.

S.DJ. i cannot imagine having 4 failed drives in two different servers when
all IBM and cisco diagnostic tools shows them fully operational..of course
everything is possible..

as said, it's symptomatic that we have identical behavior on both servers (
so i can exclude HW failure except serial problem )..the original one and
the new one..

regards..

Sinisa Djokic 

System Engineer
CCIE #25996 Voice

MDS Informaticki inzenjering
Milutina Milankovica 7d
11070 Novi Beograd, Serbia
Tel:  +381 11 2015 200  +381 11 2015 200 , 2015 273
Fax: +381 11 3194 954
www.mds.rs
sdjokic at mds.rs

This e-mail message and any attachment are intended exclusively for the
named addressee. They may contain confidential information which may also be
protected by professional secrecy. Unless you are the named addressee (or
authorised to receive for the addressee) you may not copy or use this
message or any attachment or disclose the contents to anyone else. If this
e-mail was sent to you by mistake please notify the sender immediately and
delete this e-mail.

P Save a tree. Don't print this e-mail unless it's really necessary.

From: Ryan Ratliff [mailto:rratliff at cisco.com] 
Sent: Tuesday, February 14, 2012 5:18 PM
To: Sinisa Djokic
Cc: 'Cisco VOIP'
Subject: Re: [cisco-voip] 7825-I4 RAID and HDD strange behaviour or not !!!

In general Raid1 should allow the system to boot off of a single drive after
a hardware failure.  That's what it is designed to do (afaik, I'm not a raid
expert only play one on the internet).  On the 7825 array in my experience
if you are doing lots of pulling drives, failing drives, upgrading firmware
on drives, etc that's going to mean lots and lots of array resyncing.  If
that resynchronization doesn't complete you don't have a redundant array.
The server won't be able to use the backup drive if the failure occurs when
the array isn't optimal.

More inline below...

-Ryan

On Feb 14, 2012, at 10:43 AM, Sinisa Djokic wrote:

hi group..

does anybody have expertise on this matter..

maybe wes or ryan have inside info..

we had a TAC case with failed HDD and problematic RAID controller in
MCS-7825-I4 server running CUCM 7.1.5..

the symptom happend when we shutdown the server in order to check what IBR
FRU is on memory inside the server..we checked it and after that we powered
up the server.. RAID controller reports DEGRADED state..cisco OS started to
boot and in certain point it reported something like " unsupported
hardware......not for the production..without TAC support..bla..bla..bla
)"..we opened the case..

during the case we upgraded  all kinds of  firmware on server and  also find
out that one HDD  failed..we replaced it and troubleshooted more..

finally we got RMA for the server but have spotted very strange behavior on
the new server as well ( we put in into lab to test it )..so, we're thinking
, is the following expected behaviour or there is a serious problem in cisco
OS or IBM HW..

1.       first of all we upgraded the HDD firmware on the new server we got
from RMA to 3B06 ( it was with 3B05 ) having in mind CSCti52867 which we
hitted earlier..after that we'installed CUCM 7.1.5..

At this point I'd guess that even 3B06 is old.  Make sure your drives are on
the latest firmware, and make sure your array is up and completely stable
before applying the update. If it's degraded when you try to patch it can
miss one of the drives and leave it on old firmware.  If you hit CSCti52867
then you absolutely have to run the diskex...cop.sgn file (when the array is
optimal) and really should rebuild the server.  Just upgrading the firmware
is not sufficient to fix or prevent future outages once the drives have
started failing due to this bug.

2.       when server is shut down and we pull out the right hard drive ( the
one in BAY 1)..we power up the server..RAID controller detects one drive
missing and reports DEGRADED state..cisco OS starts to boot and in certain
point it reports again something like " unsupported hardware......not for
the production..without TAC support..bla..bla..bla )..

See the bug Wes pointed out.  This is not expected behavior.

3.       when we shut down the server again..and switch the scenario..we get
back the right HDD in server ( BAY 1 ) and pull out the left HDD out ( BAY 0
)..when we power up the server..it doesn't even  detect bootable device like
there is no HDDs inside..

If the BIOS doesn't detect the drive you've got a hardware issue either with
the array or the disk, period.  If the BIOS detects the drive but no OS is
found then the drive most likely didn't get fully sync'd from a previous
degraded state.

4.       so, when we shut down the server again..and get back both HDDs
where they belong..power up the server..everythings works fine..of course,
RAID is resynching..

Does it ever finish resyncing?  That can take hours and until it does the
drive that is not in a good state is not redundant.  

5.       of course both HDDs used in server are in correct state and aren't
failed..

I wouldn't place bets on that just yet.

so, i'm a little bit confused, since it's happening on 2 different
servers..identical scenario..my RAID1 perception is in serious doubt..

does this mean that if HDD failed during regular maintenance shutdown (
which we had in the first place ), RAID shouldn't provide operational
system?..

to be honest , we didn't have failed disk while server was  up and online so
we couldn't notice what's going on in that situation..

but the fact is, with the old server and with the new server, when we pull
out one of the HDDs the system wouldn't boot up properly..

is this expected behaviour?..what's the purpose of RAID1 then?..

is RAID1 expected to cover just online HDD failure ( if so ) or it should
work in every scenario?..

any thoughts?..

thanx..

regards..

Sinisa Djokic

System Engineer
CCIE #25996 Voice

<image002.png>

MDS Informaticki inzenjering
Milutina Milankovica 7d
11070 Novi Beograd, Serbia
Tel:  +381 11 2015 200  +381 11 2015 200 , 2015 273
Fax: +381 11 3194 954
www.mds.rs <http://www.mds.rs/> 
sdjokic at mds.rs

This e-mail message and any attachment are intended exclusively for the
named addressee. They may contain confidential information which may also be
protected by professional secrecy. Unless you are the named addressee (or
authorised to receive for the addressee) you may not copy or use this
message or any attachment or disclose the contents to anyone else. If this
e-mail was sent to you by mistake please notify the sender immediately and
delete this e-mail.

P Save a tree. Don't print this e-mail unless it's really necessary.

_______________________________________________
cisco-voip mailing list
cisco-voip at puck.nether.net
https://puck.nether.net/mailman/listinfo/cisco-voip

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://puck.nether.net/pipermail/cisco-voip/attachments/20120214/8a602b98/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: image002.png
Type: image/png
Size: 10569 bytes
Desc: not available
URL: <https://puck.nether.net/pipermail/cisco-voip/attachments/20120214/8a602b98/attachment.png>