[cisco-voip] CUCM split brain question

Wes Sisk wsisk at cisco.com
Wed Apr 6 15:22:36 EDT 2011


For example sake:
CM1: headquarters
CM2: branch

When IP connectivity exists between CM1 and CM2 SDL TCP sessions are 
established between the 2 ccm processes. Through SDL each server tells 
every other server about registered devices.  There is opportunity for 
duplicate registration and some propagation time so there is a window of 
convergence involved.  Once things are in sync each CM node tells every 
other node about all significant local state changes.  Make sense?  This 
usually helps folks understand why QoS is so critical on SDL links.  SDL 
links are the vehicle for synchronization for 2 real time processes.  
This isn't quite as sensitive as parallel graphics processing but it's 
not far off.

When SDL link goes down each node forgets about all entities it learned 
from the remote node.  It literally purges them.  Devices have to 
register to their local node (even the best network admins miss some 
especially when it comes to virtual devices like hunt pilots, route 
lists, and software media resources).  Again there is opportunity for 
some duplicate registration.  SDL links detect outage on the order of 10 
seconds or less.  SCCP devices do keepalives on the order of 30 seconds 
with allowance for 1-2 missed keepalives.  10seconds vs 60-90seconds 
creates a window of overlap.  If a duplicate registration is detected 
then CM resets both device processes.  This extends downtime but there 
really is no way of knowing which is the "right" registration.  This is 
another window of convergence.

So, after convergence the device appears unregistered on the remote 
node.  For an interesting dig into this scenario take a look at
CSCsc62081    CCM SDL Out of Service / In Service causes Unexpected 
Unity Failover.

and similarly related to realtime synchronization of state machines:
CSCsc62073    Locations Out of Bandwidth causes unexpected Unity Failover

It was the same customer who originated both of these.  This customer 
had truly the worst luck with timing that I have ever seen.

Regards,
Wes


On 4/6/2011 11:51 AM, Ovidiu Popa wrote:
> Hello everyone
>
> Here's an unusual scenario that kind of puzzles me.
>
> Here are the details:
> - 2 CUCM with Clustering over WAN (HQ and Branch)
> - Centralized PSTN Access at HQ (DID numbers routed to HQ)
> - 1 phone with the Branch CUCM as primary and the HW CUCM as secondary
>
> Disaster strikes, the WAN link goes down and we have a split brain 
> condition.
>
> What is the state to the phone on HQ CUCM? Will the phone be 
> Unregistered or state Unknown?
>
> And the most important question is will the HQ CUCM follow the CFUR if 
> the state is Unknown?
>
> Thanks for your input.
>
> Regards,
> Ovidiu
>
>
>
>
> _______________________________________________
> cisco-voip mailing list
> cisco-voip at puck.nether.net
> https://puck.nether.net/mailman/listinfo/cisco-voip
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://puck.nether.net/pipermail/cisco-voip/attachments/20110406/05ad1dbb/attachment.html>


More information about the cisco-voip mailing list