[cisco-voip] CUCM 10.5 core question - AlterateSyslogAlarm malloc backtrace

Ryan Ratliff (rratliff) rratliff at cisco.com
Fri Jan 22 15:44:14 EST 2016


How about CPU usage leading up to the cores? 
If you have CCM traces from before are there gaps in the timestamps between line numbers?  

If you are virtualized were those two nodes on the same host?  If so did other VMs on that host experience any odd issues at the time?
-Ryan

On Jan 22, 2016, at 3:29 PM, Erick Bergquist <erickbee at gmail.com> wrote:

Ryan,

I'm seeing a small spike in IOWait but not for that long, No gaps in
the Perfmon logs.  Only the ccm service core dumped on both the call
manager nodes on this cluster at the same time.

Would the system shutdown the CCM and CTI Manager service itself if
these timers are to long, like that document from CCM 3.x/4.x days
eludes to with the 16 second SDL Timer Thread and 20 Second Router
Thread threshold?


The subscriber had a different backtrace, Intentional Abort..

====================================
backtrace - CUCM
===================================
#0  0xf776c430 in __kernel_vsyscall ()
#1  0xf697b871 in raise () from /lib/libc.so.6
#2  0xf697d14a in abort () from /lib/libc.so.6
#3  0x0837609e in IntentionalAbort () at ProcessCMProcMon.cpp:88
#4  CMProcMon::verifySdlRouterServices () at ProcessCMProcMon.cpp:748
#5  0x083762ca in CMProcMon::callManagerMonitorThread
(cmProcMon=0xefc73038) at ProcessCMProcMon.cpp:429
#6  0xf6c9c5a8 in ACE_OS_Thread_Adapter::invoke (this=0xe6a5d818) at
OS_Thread_Adapter.cpp:103
#7  0xf6c5c5f1 in ace_thread_adapter (args=0xe6a5d818) at
Base_Thread_Adapter.cpp:126
#8  0xf6932b39 in start_thread () from /lib/libpthread.so.0
#9  0xf6a33c1e in clone () from /lib/libc.so.6
====================================
info registers
====================================


Thanks,
Erick


On Fri, Jan 22, 2016 at 1:23 PM, Ryan Ratliff (rratliff)
<rratliff at cisco.com> wrote:
> That backtrace is usually accompanied by CTIManager crashing due to intentional abort.
> 
>> 09465138.001 |23:55:15.448 |AppInfo  |CMProcMon::star_sdlVerification
>> - Test Timer exceeded minimum timer latency threshold of 1000
>> milliseconds, Actual latency: 54360 milliseconds
> 
> That is quite spectacularly bad.  The SDLTimer thread is responsible for _every_ timer the ccm process uses. It has a one-second self-check to make sure it’s getting enough cpu time and in your case it didn’t get to execute it’s one-second check for 54 seconds.
> 
> This was definitely performance related so check CPU usage and IOWait at the time (+- a minute or so). Gaps in the RISDC perfmon logs would also indicate performance issues, though leave you with little evidence unless it happened to catch some process at the beginning of a spike.
> 
> -Ryan
> 
> On Jan 22, 2016, at 1:13 PM, Erick Bergquist <erickbee at gmail.com> wrote:
> 
> Brian,
> 
> Looked over those and not seeing any processes consuming lots of
> memory, memory used was 70-75% during this time.
> 
> 
> On Fri, Jan 22, 2016 at 10:54 AM, Brian Meade <bmeade90 at vt.edu> wrote:
>> Did you download Perfmon logs?  Might just be high memory in general causing
>> the issue.  I'd use the Perfmon logs to check what processes were using up
>> all the memory at that time.
>> 
>> On Fri, Jan 22, 2016 at 10:45 AM, Erick Bergquist <erickbee at gmail.com>
>> wrote:
>>> 
>>> This happened on CUCM 10.5.2.10000-7 and I'm not finding a bug for
>>> this backtrace info in bug toolkit or the 10.5.2 SU1, SU2, SU3 release
>>> notes.
>>> 
>>> 
>>> From the core analysis,
>>> 
>>> Reading symbols from /lib/libnss_files.so.2...(no debugging symbols
>>> found)...done.
>>> Loaded symbols for /lib/libnss_files.so.2
>>> Reading symbols from /lib/libnss_dns.so.2...(no debugging symbols
>>> found)...done.
>>> Loaded symbols for /lib/libnss_dns.so.2
>>> Core was generated by `/usr/local/cm/bin/ccm'.
>>> Program terminated with signal 11, Segmentation fault.
>>> #0  0xf6a28b55 in _int_malloc () from /lib/libc.so.6
>>> 
>>> ====================================
>>> backtrace - CUCM
>>> ===================================
>>> #0  0xf6a28b55 in _int_malloc () from /lib/libc.so.6
>>> #1  0xf6a29d9e in malloc () from /lib/libc.so.6
>>> #2  0x09be122c in AlternateSyslogAlarmMonitor::manageLogs
>>> (this=0xd7d74f8, severity=6, strBuf=<value optimized out>) at
>>> AlternateSyslogAlarmMonitor.cpp:162
>>> #3  0x09be13f6 in AlternateSyslogAlarmMonitor::AlarmThreadProc
>>> (pParameter=0xd7d74f8) at AlternateSyslogAlarmMonitor.cpp:61
>>> #4  0xf6d005a8 in ACE_OS_Thread_Adapter::invoke (this=0xd7d8dc8) at
>>> OS_Thread_Adapter.cpp:103
>>> #5  0xf6cc05f1 in ace_thread_adapter (args=0xd7d8dc8) at
>>> Base_Thread_Adapter.cpp:126
>>> #6  0xf6996b39 in start_thread () from /lib/libpthread.so.0
>>> #7  0xf6a97c1e in clone () from /lib/libc.so.6
>>> ====================================
>>> info registers
>>> ====================================
>>> eax            0x198408
>>> ecx            0xf6b483d0-155941936
>>> edx            0xf6b483a0-155941984
>>> ebx            0xf6b46ff4-155947020
>>> esp            0xf5d0ab100xf5d0ab10
>>> ebp            0xf5d0ab980xf5d0ab98
>>> esi            0xf4bc0b8256622776
>>> edi            0x00
>>> eip            0xf6a28b550xf6a28b55 <_int_malloc+725>
>>> eflags         0x10212[ AF IF RF ]
>>> cs             0x2335
>>> ss             0x2b43
>>> ds             0x2b43
>>> es             0x2b43
>>> fs             0x00
>>> gs             0x6399
>>> ====================================
>>> info threads
>>> ====================================
>>>  47 Thread 0xe750bb70 (LWP 30608)  0xf6a237f4 in _IO_flush_all_lockp
>>> () from /lib/libc.so.6
>>> 46 Thread 0xf5effb70 (LWP 29077)  0xf77d0430 in __kernel_vsyscall ()
>>> 45 Thread 0xf63ffb70 (LWP 29052)  0xf77d0430 in __kernel_vsyscall ()
>>> 44 Thread 0xf67d0b70 (LWP 29048)  0xf77d0430 in __kernel_vsyscall ()
>>> 43 Thread 0xf67e0710 (LWP 25984)  0xf77d0430 in __kernel_vsyscall ()
>>> 42 Thread 0xe6f2fb70 (LWP 30611)  0xf77d0430 in __kernel_vsyscall ()
>>> 
>>> 
>>> No errors/etc in the Syslog at the time other then errors on the call
>>> manager service failing.
>>> 
>>> However, the SDL CCM traces show of a timer expiring and appears it
>>> shuts down services due to the timer issue.
>>> 
>>> From the following cisco doc,
>>> 
>>> http://www.cisco.com/c/en/us/support/docs/voice-unified-communications/unified-communications-manager-callmanager/46806-cm-crashes-and-shutdowns.html
>>> it looks like the SDL timer thread stopped and service shutdown?
>>> This doc is from CCM 3.x and 4.x time frame.  Are the timers mentioned
>>> in this doc still true for 10.5 - where it shuts down at 16 seconds
>>> for SDL timer thread or 20 seconds for SDL router thread.
>>> 
>>> 
>>> 09465137.000 |23:55:15.416 |Stopping |
>>>     |                               |SdlTCPConnection(1,100,14,37)
>>> |SdlTCPConnection(1,100,14,37)    |
>>>     |NumOfCurrentInstances: 39
>>> 
>>> 09465138.000 |23:55:15.448 |SdlSig   |SdlVerificationTimer
>>>     |monitoring                     |CMProcMon(1,100,234,1)
>>> |SdlTimerService(1,100,3,1)       |1,100,234,1.3139616^*^*
>>>     |*TraceFlagOverrode
>>> 
>>> 09465138.001 |23:55:15.448 |AppInfo  |CMProcMon::star_sdlVerification
>>> - Test Timer exceeded minimum timer latency threshold of 1000
>>> milliseconds, Actual latency: 54360 milliseconds
>>> 
>>> 09465139.000 |23:55:16.135 |AppInfo  |CallManagerFailure - Indicates
>>> an internal failure in Unified CM Host Name of Hosting Node:CUCM1 Host
>>> Node IP Address:192.168.1.5 Reason code:4 Additional Text: App
>>> ID:Cisco CallManager Cluster ID:StandAloneCluster Node ID:CUCM1
>>> 
>>> 09465140.000 |23:55:16.135 |AlarmErr |AlarmClass: CallManager,
>>> AlarmName: CallManagerFailure, AlarmSeverity: Critical, AlarmMessage:
>>> , AlarmDescription: Indicates an internal failure in Unified CM,
>>> AlarmParameters:  HostName:CUCM1, IPAddress:192.168.1.5, Reason:4,
>>> Text:, AppID:Cisco CallManager, ClusterID:StandAloneCluster,
>>> NodeID:CUCM1
>>> 
>>> 09465141.000 |23:55:17.465 |AppInfo  |CallManagerFailure - Indicates
>>> an internal failure in Unified CM Host Name of Hosting Node:CUCM1 Host
>>> Node IP Address:192.168.1.5 Reason code:4 Additional Text:CCM
>>> Intentional Abort: SignalName: SdlVerificationTimer, DestPID:
>>> CMProcMon[1:100:234:1] App ID:Cisco CallManager Cluster
>>> ID:StandAloneCluster Node ID:CUCM1
>>> 
>>> 09465142.000 |23:55:17.465 |AlarmErr |AlarmClass: CallManager,
>>> AlarmName: CallManagerFailure, AlarmSeverity: Critical, AlarmMessage:
>>> , AlarmDescription: Indicates an internal failure in Unified CM,
>>> AlarmParameters:  HostName:CUCM1, IPAddress:192.168.1.5, Reason:4,
>>> Text:CCM Intentional Abort: SignalName: SdlVerificationTimer, DestPID:
>>> CMProcMon[1:100:234:1], AppID:Cisco CallManager,
>>> ClusterID:StandAloneCluster, NodeID:CUCM1,
>>> 
>>> 09465138.002 |23:55:17.566 |AppInfo  |TimerThreadSlowed - Timer thread
>>> has slowed beyond acceptable limits App ID:Cisco CallManager Cluster
>>> ID:StandAloneCluster Node ID:CUCM1
>>> 
>>> 09465138.003 |23:55:17.566 |AlarmErr |AlarmClass: CallManager,
>>> AlarmName: TimerThreadSlowed, AlarmSeverity: Critical, AlarmMessage: ,
>>> AlarmDescription: Timer thread has slowed beyond acceptable limits,
>>> AlarmParameters:  AppID:Cisco CallManager,
>>> ClusterID:StandAloneCluster, NodeID:CUCM1,
>>> 
>>> 09465143.000 |23:55:17.566 |SdlSig   |ReapOldTokenRegistrationsTimer
>>>     |wait                           |SIPStationInit(1,100,72,1)
>>> |SdlTimerService(1,100,3,1)       |1,100,146,1.1^*^*
>>>     |[R:H-H:36,N:55,L:1,V:1,Z:0,D:0]
>>> _______________________________________________
>>> cisco-voip mailing list
>>> cisco-voip at puck.nether.net
>>> https://puck.nether.net/mailman/listinfo/cisco-voip
>> 
>> 
> _______________________________________________
> cisco-voip mailing list
> cisco-voip at puck.nether.net
> https://puck.nether.net/mailman/listinfo/cisco-voip
> 



More information about the cisco-voip mailing list