[cisco-voip] CUCM split brain question

Wes Sisk wsisk at cisco.com
Wed Apr 6 17:57:32 EDT 2011


For call processing unknown = unregistered.

Remember those SDL links?  They sends signals back and forth called 
"DMPropagateRegister" and UnRegister. Instead of querying RIS CM just 
checks its internal memory for local or remote registrations.  If ccm 
does not know about registration state (either locally or remotely) then 
CFUR.

<strictly personal comment>
Thanks for the props.  I'm just one of a large pool of people who 
generate the need, concepts, code, product, knowledge, and use cases.  
I've been involved with a few book projects but I'm much more committed 
to wikis, web publishing, and mailing lists.  Trees are much more 
beautiful on a trail than stacked on a shelf.  Technical information 
changes so quickly that it's obsolete before the ink dries.
</strictly personal comment>

Regards,
Wes

On 4/6/2011 5:13 PM, Ovidiu Popa wrote:
>
> And the gifts keep on coming... the registered/unknown puzzle unraveled :)
>
> Now for the final question:
>
> How does the Call Forward Unregistered handle Unknown states? In my 
> case if the branch phone never registered to CM1 since the last CM 
> service restart its state will be unknown. Will an incoming call to 
> the phone DN follow the CFUR ?
>
> Sorry to pester you with questions and thank you again.
>
> PS: If you ever decide to write a book on CUCM please sign me up on 
> the pre-order list.
> PPS : somehow I don't think I am the only one.
>
> Best regards,
> Ovidiu
>
>
>
> On 06/Apr/11 10:49 PM, Wes Sisk wrote:
>> If phone is local to CM1 and registers to CM1 then CM1 sees the phone 
>> as registered.  Because the SDL link is down CM2 will see the phone 
>> as unregistered.
>>
>> If phone is local to CM2 and registers to CM2 then CM2 sees the phone 
>> as registered.  Because the SDL links is down CM1 will see the phone 
>> as unregistered.
>>
>> The appearance of "unregistered" vs "unknown" in the UI is a bit of a 
>> red herring.  Registration status is captured in a shared memory 
>> segment by the ccm process. Processes such as AXL and RIS read that 
>> shared memory segment.
>>
>> When AXL or web service goes to read that shared memory segment it 
>> reads all nodes in the cluster.  With connectivity to CM2 being down 
>> AXL or RIS on CM1 will only be able to query CM1 shared memory. If 
>> the phone in question has never registered to CM1 then status will be 
>> "unknown".
>>
>> Similarly if the ccm process on CM2 restarts then the status will be 
>> "unknown" in the CM2 shared memory segment until the first time the 
>> phone registers with CM2.
>>
>> So, "unknown" vs "unregistered" has a very subtle, possibly even 
>> nuance, different meaning.  "unknown" means the shared memory 
>> segments currently available to query have not been updated with a 
>> known status for that device since the last ccm process restart.  
>> "unregistered" means the phone last transitioned to "unregistered" 
>> signaling status with one of the nodes that is currently accessible 
>> from the AXL or RIS instance you are querying
>>
>> In ASCII:
>> browser->Tomcat->RIS1->shared_memory1->CM1
>>               |->RIS2->shared_memory2->CM2
>>               |->RIS3->shared_memory3->CM2
>>
>> All contingent on IP connectivity from Tomcat to the various RIS 
>> processes across the cluster.
>>
>> Regards,
>> Wes
>>
>> On 4/6/2011 4:01 PM, Ovidiu Popa wrote:
>>> Hello Wes
>>>
>>> Excellent information. I always had a difficulty understanding the 
>>> insides of the CUCM box. Thank you.
>>>
>>> One question remains:
>>> Imagine a WAN failure for 30 minutes. CM1 and CM2 are working 
>>> without the SDL layer. CM2 (branch) sees the branch phone as 
>>> Registered, CM1 (hq) sees the phone as unregistered or unknown?
>>>
>>> Thanks.
>>> Ovidiu
>>>
>>>
>>> On 06/Apr/11 9:22 PM, Wes Sisk wrote:
>>>> For example sake:
>>>> CM1: headquarters
>>>> CM2: branch
>>>>
>>>> When IP connectivity exists between CM1 and CM2 SDL TCP sessions 
>>>> are established between the 2 ccm processes. Through SDL each 
>>>> server tells every other server about registered devices.  There is 
>>>> opportunity for duplicate registration and some propagation time so 
>>>> there is a window of convergence involved.  Once things are in sync 
>>>> each CM node tells every other node about all significant local 
>>>> state changes.  Make sense?  This usually helps folks understand 
>>>> why QoS is so critical on SDL links.  SDL links are the vehicle for 
>>>> synchronization for 2 real time processes.  This isn't quite as 
>>>> sensitive as parallel graphics processing but it's not far off.
>>>>
>>>> When SDL link goes down each node forgets about all entities it 
>>>> learned from the remote node.  It literally purges them.  Devices 
>>>> have to register to their local node (even the best network admins 
>>>> miss some especially when it comes to virtual devices like hunt 
>>>> pilots, route lists, and software media resources).  Again there is 
>>>> opportunity for some duplicate registration.  SDL links detect 
>>>> outage on the order of 10 seconds or less.  SCCP devices do 
>>>> keepalives on the order of 30 seconds with allowance for 1-2 missed 
>>>> keepalives.  10seconds vs 60-90seconds creates a window of 
>>>> overlap.  If a duplicate registration is detected then CM resets 
>>>> both device processes.  This extends downtime but there really is 
>>>> no way of knowing which is the "right" registration.  This is 
>>>> another window of convergence.
>>>>
>>>> So, after convergence the device appears unregistered on the remote 
>>>> node.  For an interesting dig into this scenario take a look at
>>>> CSCsc62081    CCM SDL Out of Service / In Service causes Unexpected 
>>>> Unity Failover.
>>>>
>>>> and similarly related to realtime synchronization of state machines:
>>>> CSCsc62073    Locations Out of Bandwidth causes unexpected Unity 
>>>> Failover
>>>>
>>>> It was the same customer who originated both of these.  This 
>>>> customer had truly the worst luck with timing that I have ever seen.
>>>>
>>>> Regards,
>>>> Wes
>>>>
>>>>
>>>> On 4/6/2011 11:51 AM, Ovidiu Popa wrote:
>>>>> Hello everyone
>>>>>
>>>>> Here's an unusual scenario that kind of puzzles me.
>>>>>
>>>>> Here are the details:
>>>>> - 2 CUCM with Clustering over WAN (HQ and Branch)
>>>>> - Centralized PSTN Access at HQ (DID numbers routed to HQ)
>>>>> - 1 phone with the Branch CUCM as primary and the HW CUCM as secondary
>>>>>
>>>>> Disaster strikes, the WAN link goes down and we have a split brain 
>>>>> condition.
>>>>>
>>>>> What is the state to the phone on HQ CUCM? Will the phone be 
>>>>> Unregistered or state Unknown?
>>>>>
>>>>> And the most important question is will the HQ CUCM follow the 
>>>>> CFUR if the state is Unknown?
>>>>>
>>>>> Thanks for your input.
>>>>>
>>>>> Regards,
>>>>> Ovidiu
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> cisco-voip mailing list
>>>>> cisco-voip at puck.nether.net
>>>>> https://puck.nether.net/mailman/listinfo/cisco-voip
>>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://puck.nether.net/pipermail/cisco-voip/attachments/20110406/5d417cdf/attachment.html>


More information about the cisco-voip mailing list