[cisco-voip] advice on upgrading large CUCM cluster with CoW from 8.6 to 10.5

Wed Dec 2 09:48:27 EST 2015

Comments below. Hope this helps.

- Danend attach---------

From: cisco-voip [mailto:cisco-voip-bounces at puck.nether.net] On Behalf Of Dave Goodwin
Sent: Tuesday, December 01, 2015 8:41 PM
To: cisco-voip at puck.nether.net
Subject: [cisco-voip] advice on upgrading large CUCM cluster with CoW from 8.6 to 10.5

Has anyone performed an upgrade of a large cluster that uses Clustering over the WAN from 8.6 to 10.5 that can share any lessons learned with that specific type of scenario? The cluster is already virtual, has 16 nodes, and is spread across 3 geographic areas. There is already a standalone PLM on the network that will be loaded with the licenses prior to the upgrade. I am trying to decide whether to do a standard upgrade, or utilize PCD to perform a Migration task.

DPagan: Yes - I would say largest would have been a 20 node cluster and I decided not to use PCD for the task for a handful of reasons. First, I (and my colleagues on the support team) feel more comfortable upgrading UC platforms manually - not only because we have full control over the process but we’re also staffed 24/7, so scheduling an upgrade for afterhours isn’t an inconvenience that would otherwise be handled by a scheduled PCD task. Second, due to the sensitivity of PCD, recovering from a failed automated upgrade on a ~20 node cluster is much more difficult compared to a smaller cluster (2-5 nodes for example). Third, we upgrade UC platforms by hand pretty often and are very comfortable with the manual process. So, for these reasons, I decided against using PCD for an upgrade of this size.

For the standard upgrade, I know there are a handful of things that need to be done to prepare, like installing the necessary COP files. After the initial RU has been completed to 10.5, I know that I would feel the need to backup/reimage/restore, because I want the new VMs to have the ext4 filesystem, utilize the proper NIC driver for 10.5, and have the new partition sizing.

DPagan: No concerns here and it’s a valid desire to restore your 10.5 cluster to new virtual machines. You might want to consider restoring while on 8.6 though for two reasons: 1) avoid having to re-license again after going to 10.5 and 2) there are no problems installing 8.6 on a 10.5 spec VM… but it’s your call. Personally I would schedule a restore of 8.6 to 10.5 spec virtual machines one week before the major upgrade.

For a PCD Migration, I have had a few issues trying to use it in the past for other tasks. It seems to have gotten incrementally better and less fussy over the past year or more, but it can still be somewhat fragile. I see in the latest version of PCD 11.0 that it supports the use of remote SFTP servers for servers that are remote from the PCD running the task. Assuming it works as advertised, that should take out the significant performance issues that would happen without that feature. The appeal here is I know the VMs made by PCD are freshly installed machines with all the right 10.5 traits I mentioned above, with data imported from the source cluster. I do would not have to worry about the time required (and chance of missing one of the many steps) to do all the manual tasks on a 16 node cluster.

DPagan: Unfortunately the remote SFTP server feature does not apply to migrations, only to stand-alone upgrades, so this won’t be an option for you on this task. However, this should work without issues if you restore to 10.5 spec VMs while still on 8.6, then upgrade in-place using PCD and the remote SFTP server option… but then this means using PCD. Some of my coworkers on the deployment team swear by it, but I *personally* would avoid it for large scale migrations/upgrades like this (then again, we’ve perform upgrades by hand for years so there’s a level of comfort that comes along with that).

The remote SFTP feature would allow you to avoid issues resulting from ISO and DB data transfer over low bandwidth connections. PCD has built in timers that aren’t configurable, and when these timers expire prematurely (due to a slow ISO transfer for example) you can encounter false positives where PCD thinks the upgrade failed while it’s actually still running in the background. It does this through AXL calls for CUCM’s active version - if the upgrade is still running, and Tomcat/AXL has yet to start up, it’s flagged as failed.

My suggestion, if you migrate before restore, is to upgrade manually using ISOs mounted on local datastores as opposed to sources remotely from PCD.

Finally, since this is a megacluster with CoW on top of that, I am sensitive to the issues that can happen with DB replication after an upgrade where it can take many hours to complete. Is there a way, either manually or via PCD, to speed that up? For example, I would like to consider running 'utils dbreplication setprocess 40' after the 10.5 publisher is up and running. Would that be the best way to handle it, and it will it affect replication setups that have already begun? Does the server or any services need to be restarted after running it. And, does the command only need to be run on the publisher, or must I run it on each node after it comes up on 10.5?

DPagan: I personally haven’t encountered the need to use the setprocess command. For the large migration I mentioned above, the setrepltimeout was helpful. Ryan Ratliff helped write an article that talks about this in detail and how to calculate the value needed for the command. For the large migration I mentioned above, we did run into replication issues but nothing that couldn’t be resolved through standard DB replication troubleshooting. My suggestion would be to make sure you don’t go into the upgrade with existing replication problems, stop for a moment after each Subscriber group (I’m assuming you’re grouping the Subs for the upgrade) is upgraded and allow replication to set up (if you want to be very careful), and have Cisco TAC’s number on a speed dial.

Remember to deactivate the EM service on applicable nodes as well. Also, if you restore 8.6 on 10.5 spec VMs and upgrade in-place, make sure you switch version on the Publisher before proceeding to upgrade the Subscribers. Avoid upgrading the Publisher, then Subs, then switch -- I can provide the Cisco doc that mentions not to do this and I can explain why in further detail if needed.

If I can offer any additional pieces of advice, I’ll e-mail you directly, but that’s all that comes to mind over the past 10 minutes.

Any experiences or opinions are welcome. Thanks in advance!

-Dave
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://puck.nether.net/pipermail/cisco-voip/attachments/20151202/75f1d753/attachment.html>