[cisco-voip] Heartbeat Failure & SNRD

Daniel Pagan dpagan at fidelus.com
Tue Jun 10 11:40:46 EDT 2014


Just a quick wrap-up on this one...
Two defects created for this problem are CSCup27726 and CSCup27133.

- Dan

From: Wes Sisk (wsisk) [mailto:wsisk at cisco.com]
Sent: Wednesday, May 21, 2014 2:50 PM
To: Daniel Pagan
Cc: cisco-voip at puck.nether.net
Subject: Re: [cisco-voip] Heartbeat Failure & SNRD

Hi Daniel,

Great find!

For the document:
http://www.cisco.com/c/en/us/support/docs/voice-unified-communications/unified-communications-manager-callmanager/46806-cm-crashes-and-shutdowns.html

The initialization process and timers have changed *significantly* since 4.x. Some examples include:
CSCsj76788    cp-system request to remove initialization timers
"... remove the initialization timers that are started during CUCM initialization.  These timer would previously cause a system restart under certain circumstance..."

Still, there is a global maximum timeout. Individual Daemons must report start and successful initiation by that time.

Historically behavior like you discuss was triggered by service parameters being missing or having incorrect values. This may be a problem with connection to the database ( CSCsc72748 ) or problem with the contents of the database. Other problems include another process grabbing one of the TCP or UDP ports required by the ccm process.

ccm had many issues retrieving initialization information from the database in early linux versions. refinements to informix and in memory database (IMDB) have helped significantly.

-Wes


On May 21, 2014, at 9:33 AM, Daniel Pagan <dpagan at fidelus.com<mailto:dpagan at fidelus.com>> wrote:

Folks:

CUCM ES 8.6.2.24122-1 appears to be creating an issue where CallManager heartbeat fails to increment upon startup and the condition that must be met is very specific. On a problematic node, SDL traces show the following error exactly one hour after the start of the CCM service:

AppError  ||||||Local send blocked: SignalName: Start, DestPID: SNRD[1:100:61:1]

This error is followed by the SDL trace printing an error stating CallManager exceeded the permitted time for initialization and will restart the application. The CCM application restarts and additional SDL traces are printed showing the standard creation of critical processes - one hour later the same "Local send blocked" error is printed regarding the SNRD process.

I saw the DestPID: SNRD error, went to a completely different, non-problematic lab environment where 8.6.2.24122-1 is installed, created a single Remote Destination Profile, and then restarted the standalone node in order to force the creation of SNRD. CallManager heartbeats are now failing to increment in that environment and found another "Local send blocked" error regarding SNRD. Removing the single Remote Destination Profile from the standalone environment and rebooting the node resolves the problem. Re-inserting it again followed by a reboot recreates it, making SNRD the obvious culprit here.

I currently have a TAC case open where they're attempting to recreate the problem. It seems no public facing defects are created for this. Just wanted to give you folks a heads up.

Related to this, can someone tell me if this document, specifally the section describing MMManInit and process creation, is still accurate? If so, then what I fail to see in SDL traces is a InitDone signal from SNRD to MMManInit during the 60 minutes between CCM startup and initialization timeout.

- Daniel

_______________________________________________
cisco-voip mailing list
cisco-voip at puck.nether.net<mailto:cisco-voip at puck.nether.net>
https://puck.nether.net/mailman/listinfo/cisco-voip



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://puck.nether.net/pipermail/cisco-voip/attachments/20140610/a324ce8c/attachment.html>


More information about the cisco-voip mailing list