[cisco-voip] 7825-I4 RAID and HDD strange behaviour or not !!!

Wes Sisk wsisk at cisco.com
Wed Feb 15 13:57:01 EST 2012


For anyone on the list following along we are working with Sinisa off-list to get this resolved.

Regards,
Wes

On Feb 15, 2012, at 1:20 AM, Sinisa Djokic wrote:

hi wes..
 
i did contact my cisco account..and we’re going to consult what to do further more..
what i’m afraid of is that i’m not sure what to expect from my TAC case..
to be honest, i’m getting terrible support from cisco TAC and IBM, and the bottom line is:
1.        that IBM thinks they don’t have a problem since RAID works in the terms it reports RAID degradation when one HDD is pulled out..and it’s resynched when disk is back in server..although we have more specific situation that we don’t have bootable device when left HDD is pulled out..
2.       cisco engineer doesn’t have any clue about almost anything in the case and he just bounces it back to the IBM..i even got a line from him which pretty much tells me about the level of expertise (  Sinisa, I’m sorry, but I have not been able to find you an answer.Our group is geared to handle hardware break-fix.  We were able to replace the drive, but in terms of failure analysis or RAID configurations, I just don’t have any documentation or materials to properly answer your questions.
3.       from TAC perspective we had operational system when we initially repalaced the failed HDD and all the after work and troubleshooting were result of our efforts to clear this out ( including RMA )..
4.       so having in mind the fact that IBM doesn’t get deep enough, and the fact that i stuck with engineer who “doesn’t know and doesn’t have docs and materials”..i’m in pretty sparkling situation - J..i cannot see end to it..
 
 
thanx..
 
regards..
 
Sinisa Djokic
System Engineer
CCIE #25996 Voice
 
<image002.png>
 
MDS Informaticki inzenjering
Milutina Milankovica 7d
11070 Novi Beograd, Serbia
Tel:  +381 11 2015 200  +381 11 2015 200 , 2015 273
Fax: +381 11 3194 954
www.mds.rs
sdjokic at mds.rs
This e-mail message and any attachment are intended exclusively for the named addressee. They may contain confidential information which may also be protected by professional secrecy. Unless you are the named addressee (or authorised to receive for the addressee) you may not copy or use this message or any attachment or disclose the contents to anyone else. If this e-mail was sent to you by mistake please notify the sender immediately and delete this e-mail.
P Save a tree. Don’t print this e-mail unless it’s really necessary.
 
From: Wes Sisk [mailto:wsisk at cisco.com] 
Sent: Tuesday, February 14, 2012 9:42 PM
To: Sinisa Djokic
Cc: 'Cisco VOIP'
Subject: Re: [cisco-voip] 7825-I4 RAID and HDD strange behaviour or not !!!
 
Hi Sinisa,
 
Yes, your account team needs advocate to get CSCtd86222 resolved.
 
Yes, you are still having hardware issues if drives will not sync or boot.  Work with TAC to get additional support from IBM.
 
Regards,
Wes
 
On Feb 14, 2012, at 1:39 PM, Sinisa Djokic wrote:


hi wes..
 
thanx on your support..
 
if we’re hitting this bug, then i’m pleased that i’ve passed sanity check regarding RAID1 expected behavior -J..although i would like to bu sure we’re hitting this so i can explain it to the customer  and to be sure that’s the core issue..
but i’m a little bit confused about two things and it would be great if you can clarify..
 
 
W.S. Net result is any missing drive causes "unsupported platform". Let your Cisco account team know this is a problem. Get your TAC case attached to this to provide more evidence of customer impact.  This is all a software limitation that could be overcome with sufficient motivation.
 
S.DJ. do i understand you correctly?..you mean, that we should inform our cisco account so they could push it and ask for enhancement which could be linked to bug CSCtd86222 severity 6 – enhancement..
we’re going to do that, although i’m not sure how can we describe non-functional RAID just as an enhacement..it looks much more severe to me..since customer who bought servers with more then one HDDs ( all other then 7816 ) don’t have redundancy..in our case we have a failed publisher beacuse of one failed HDD..
so, in this case i suppose motivation is drive from number of enhancement requests - J..
 
W.S. Otherwise, the 7825 hardware uses a rather poor implementation of RAID IMHO.  Technically it is mirrored but we just do not see the same level of reliability in that platform.  You are still facing a hardware failure and we must look to IBM to resolve that
 
S.DJ. what do you mean by that that we’re still facing HW failure?..since we’re having new server, with new HDDs, and the system is fully operational until we pull out any of HDDs ( while server is shit down )..RAID is in operational state, everything looks fine, until one HDD is pulled out..if the left one is pulled out ( primary one ) we don’t even have bootable device and if right one is pulled out we have unsupported hardware issue..if you think we still have HW issue what should we tell TAC and IBM..it’s new hardware and it’s behaving exactly the same as the old one..then it should issue on all the servers..
 
 
 
so, the bottom line is, do you think we should just ask our cisco account to push it our we should push IBM or TAC for any further troubleshooting?..
 
thanx..
 
 
regards..
 
Sinisa Djokic
System Engineer
CCIE #25996 Voice
 
<image003.png>
 
MDS Informaticki inzenjering
Milutina Milankovica 7d
11070 Novi Beograd, Serbia
Tel:  +381 11 2015 200  +381 11 2015 200 , 2015 273
Fax: +381 11 3194 954
www.mds.rs
sdjokic at mds.rs
This e-mail message and any attachment are intended exclusively for the named addressee. They may contain confidential information which may also be protected by professional secrecy. Unless you are the named addressee (or authorised to receive for the addressee) you may not copy or use this message or any attachment or disclose the contents to anyone else. If this e-mail was sent to you by mistake please notify the sender immediately and delete this e-mail.
P Save a tree. Don’t print this e-mail unless it’s really necessary.
 
From: Wes Sisk [mailto:wsisk at cisco.com] 
Sent: Tuesday, February 14, 2012 5:09 PM
To: Sinisa Djokic
Cc: 'Cisco VOIP'
Subject: Re: [cisco-voip] 7825-I4 RAID and HDD strange behaviour or not !!!
 
Sinisa,
 
CallManager hot issues RSS feed is your friend:
https://supportforums.cisco.com/docs/DOC-5727
 
In the CallManager feed there is this well known defect:
Absence of a single physical disk is preventing License Manager to start, Open CSCtd86222Symptom:

License Manager doesn't startup after rebooting a server.


Conditions:

CUCM Server will prompt about a Hardware Configuration Failure if one of the drives from either logical drive arrays are missing. If the customer chooses to continue License Manager service will never startup. Lack of License Manager on the Publisher node prevents any administrative changes being performed like adding a new phone or deleting.


Workaround:

Ensure all 4 drives are plugged in to the server even if they are failed/defunct drives.
 
 
 
Net result is any missing drive causes "unsupported platform". Let your Cisco account team know this is a problem. Get your TAC case attached to this to provide more evidence of customer impact.  This is all a software limitation that could be overcome with sufficient motivation.
 
Otherwise, the 7825 hardware uses a rather poor implementation of RAID IMHO.  Technically it is mirrored but we just do not see the same level of reliability in that platform.  You are still facing a hardware failure and we must look to IBM to resolve that.
 
/wes
 
 
 
On Feb 14, 2012, at 10:43 AM, Sinisa Djokic wrote:
 
hi group..
 
does anybody have expertise on this matter..
maybe wes or ryan have inside info..
 
we had a TAC case with failed HDD and problematic RAID controller in MCS-7825-I4 server running CUCM 7.1.5..
the symptom happend when we shutdown the server in order to check what IBR FRU is on memory inside the server..we checked it and after that we powered up the server.. RAID controller reports DEGRADED state..cisco OS started to boot and in certain point it reported something like “ unsupported hardware......not for the production..without TAC support..bla..bla..bla )”..we opened the case..
during the case we upgraded  all kinds of  firmware on server and  also find out that one HDD  failed..we replaced it and troubleshooted more..
finally we got RMA for the server but have spotted very strange behavior on the new server as well ( we put in into lab to test it )..so, we’re thinking , is the following expected behaviour or there is a serious problem in cisco OS or IBM HW..
 
1.       first of all we upgraded the HDD firmware on the new server we got from RMA to 3B06 ( it was with 3B05 ) having in mindCSCti52867 which we hitted earlier..after that we’installed CUCM 7.1.5..
2.       when server is shut down and we pull out the right hard drive ( the one in BAY 1)..we power up the server..RAID controller detects one drive missing and reports DEGRADED state..cisco OS starts to boot and in certain point it reports again something like “ unsupported hardware......not for the production..without TAC support..bla..bla..bla )..
3.       when we shut down the server again..and switch the scenario..we get back the right HDD in server ( BAY 1 ) and pull out the left HDD out ( BAY 0 )..when we power up the server..it doesn’t even  detect bootable device like there is no HDDs inside..
4.       so, when we shut down the server again..and get back both HDDs where they belong..power up the server..everythings works fine..of course, RAID is resynching..
5.       of course both HDDs used in server are in correct state and aren’t failed..
 
so, i’m a little bit confused, since it’s happening on 2 different servers..identical scenario..my RAID1 perception is in serious doubt..
does this mean that if HDD failed during regular maintenance shutdown ( which we had in the first place ), RAID shouldn’t provide operational system?..
to be honest , we didn’t have failed disk while server was  up and online so we couldn’t notice what’s going on in that situation..
but the fact is, with the old server and with the new server, when we pull out one of the HDDs the system wouldn’t boot up properly..
is this expected behaviour?..what’s the purpose of RAID1 then?..
is RAID1 expected to cover just online HDD failure ( if so ) or it should work in every scenario?..
any thoughts?..
 
thanx..
 
 
regards..
 
Sinisa Djokic
System Engineer
CCIE #25996 Voice
 
<image002.png>
 
MDS Informaticki inzenjering
Milutina Milankovica 7d
11070 Novi Beograd, Serbia
Tel:  +381 11 2015 200  +381 11 2015 200 , 2015 273
Fax: +381 11 3194 954
www.mds.rs
sdjokic at mds.rs
This e-mail message and any attachment are intended exclusively for the named addressee. They may contain confidential information which may also be protected by professional secrecy. Unless you are the named addressee (or authorised to receive for the addressee) you may not copy or use this message or any attachment or disclose the contents to anyone else. If this e-mail was sent to you by mistake please notify the sender immediately and delete this e-mail.
P Save a tree. Don’t print this e-mail unless it’s really necessary.
 
_______________________________________________
cisco-voip mailing list
cisco-voip at puck.nether.net
https://puck.nether.net/mailman/listinfo/cisco-voip
 
 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://puck.nether.net/pipermail/cisco-voip/attachments/20120215/389fb8c6/attachment.html>


More information about the cisco-voip mailing list