[cisco-voip] CUCM 10.5 core question - AlterateSyslogAlarm malloc backtrace

Fri Jan 29 18:15:03 EST 2016

I'll unicast you the details Ryan.

On Fri, Jan 29, 2016 at 8:28 AM, Ryan Ratliff (rratliff)
<rratliff at cisco.com> wrote:
> Until you identify what specifically spiked then you don’t know what to
> prevent.
>
> Do you have a TAC case open yet?
>
> -Ryan
>
> On Jan 28, 2016, at 7:42 PM, Erick Bergquist <erickbee at gmail.com> wrote:
>
> Thanks Ryan. Nothing really standing out on the ESXi side.
>
> Any suggestions to prevent this in future?
>
> The VMware KB articles suggest a higher timeout on application side so
> transient issues and conditions on the VM side of things don't negatively
> impact the application.
>
> On Friday, January 22, 2016, Ryan Ratliff (rratliff) <rratliff at cisco.com>
> wrote:
>>
>> That's your culprit right there. That 16 second sdlrouter thread timer is
>> one of the few things I bet hadn't changed since 4.x.
>>
>> Can you look at esxi stats for that time period to see what the host
>> thought was going on?
>>
>> -Ryan
>>
>> > On Jan 22, 2016, at 4:45 PM, Erick Bergquist <erickbee at gmail.com> wrote:
>> >
>> > Normal CPU usage up to core, then a CPU spike right after core when
>> > things are starting back up. Different hosts, same storage.
>> >
>> > Only gap in trace file line entries is 23:54:59.957 to 23:55:18.402
>> > but maybe no activities?
>> >
>> > 11574290.000 |23:54:59.957 |SdlStat  |Period: 6s #Lines: 19 #Bytes:
>> > 3701 Total Number of Buffers: 10000 Free LWM: 9989 Free LWM(total):
>> > 9192
>> > 11574291.000 |23:55:18.402 |AppInfo
>> > |SIPSocketProtocol(2,100,14,137)::handleReadComplete send SdlReadRsp:
>> > size 370
>> > 11574292.000 |23:55:18.402 |SdlStat  |Period: 19s #Lines: 1 #Bytes:
>> > 117 Total Number of Buffers: 10000 Free LWM: 9998 Free LWM(total):
>> > 9192
>> > 11574293.000 |23:55:18.959 |AppInfo  |CMProcMon - ERROR  !!!!!!!! SDL
>> > Router Services Dead: timeBeforeSleep: 804509608, currentMontiorTime:
>> > 804530608, deltaTime: 21000, LoopCount= 19
>> >
>> > The deltaTime is 21000 which would be more then the 20 seconds for the
>> > router thread in the previously mentioned document covering 3.x/4.x.
>> >
>> > Do these services shutdown and restart themselves if over 16 seconds
>> > (SDL Timer) or 20 seconds (Router Thread) still in 10.x like that
>> > document states at those same thresholds?
>> >
>> > VMWare KB suggests the SCSI timeout should be increased to handle
>> > transient conditions. I would think these timers should be able to
>> > handle transient issues without failing the service?  Maybe the
>> > thresholds need to be increased to handle these periods.
>> >
>> >
>> > http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1009465
>> >
>> > I've only seen this with the ccm and ctimanager service. None of the
>> > other services.
>> >
>> > Thanks.
>> >
>> >
>> > On Fri, Jan 22, 2016 at 2:44 PM, Ryan Ratliff (rratliff)
>> > <rratliff at cisco.com> wrote:
>> >> How about CPU usage leading up to the cores?
>> >> If you have CCM traces from before are there gaps in the timestamps
>> >> between line numbers?
>> >>
>> >> If you are virtualized were those two nodes on the same host?  If so
>> >> did other VMs on that host experience any odd issues at the time?
>> >> -Ryan
>> >>
>> >> On Jan 22, 2016, at 3:29 PM, Erick Bergquist <erickbee at gmail.com>
>> >> wrote:
>> >>
>> >> Ryan,
>> >>
>> >> I'm seeing a small spike in IOWait but not for that long, No gaps in
>> >> the Perfmon logs.  Only the ccm service core dumped on both the call
>> >> manager nodes on this cluster at the same time.
>> >>
>> >> Would the system shutdown the CCM and CTI Manager service itself if
>> >> these timers are to long, like that document from CCM 3.x/4.x days
>> >> eludes to with the 16 second SDL Timer Thread and 20 Second Router
>> >> Thread threshold?
>> >>
>> >>
>> >> The subscriber had a different backtrace, Intentional Abort..
>> >>
>> >> ====================================
>> >> backtrace - CUCM
>> >> ===================================
>> >> #0  0xf776c430 in __kernel_vsyscall ()
>> >> #1  0xf697b871 in raise () from /lib/libc.so.6
>> >> #2  0xf697d14a in abort () from /lib/libc.so.6
>> >> #3  0x0837609e in IntentionalAbort () at ProcessCMProcMon.cpp:88
>> >> #4  CMProcMon::verifySdlRouterServices () at ProcessCMProcMon.cpp:748
>> >> #5  0x083762ca in CMProcMon::callManagerMonitorThread
>> >> (cmProcMon=0xefc73038) at ProcessCMProcMon.cpp:429
>> >> #6  0xf6c9c5a8 in ACE_OS_Thread_Adapter::invoke (this=0xe6a5d818) at
>> >> OS_Thread_Adapter.cpp:103
>> >> #7  0xf6c5c5f1 in ace_thread_adapter (args=0xe6a5d818) at
>> >> Base_Thread_Adapter.cpp:126
>> >> #8  0xf6932b39 in start_thread () from /lib/libpthread.so.0
>> >> #9  0xf6a33c1e in clone () from /lib/libc.so.6
>> >> ====================================
>> >> info registers
>> >> ====================================
>> >>
>> >>
>> >> Thanks,
>> >> Erick
>> >>
>> >>
>> >> On Fri, Jan 22, 2016 at 1:23 PM, Ryan Ratliff (rratliff)
>> >> <rratliff at cisco.com> wrote:
>> >>> That backtrace is usually accompanied by CTIManager crashing due to
>> >>> intentional abort.
>> >>>
>> >>>> 09465138.001 |23:55:15.448 |AppInfo  |CMProcMon::star_sdlVerification
>> >>>> - Test Timer exceeded minimum timer latency threshold of 1000
>> >>>> milliseconds, Actual latency: 54360 milliseconds
>> >>>
>> >>> That is quite spectacularly bad.  The SDLTimer thread is responsible
>> >>> for _every_ timer the ccm process uses. It has a one-second self-check to
>> >>> make sure it’s getting enough cpu time and in your case it didn’t get to
>> >>> execute it’s one-second check for 54 seconds.
>> >>>
>> >>> This was definitely performance related so check CPU usage and IOWait
>> >>> at the time (+- a minute or so). Gaps in the RISDC perfmon logs would also
>> >>> indicate performance issues, though leave you with little evidence unless it
>> >>> happened to catch some process at the beginning of a spike.
>> >>>
>> >>> -Ryan
>> >>>
>> >>> On Jan 22, 2016, at 1:13 PM, Erick Bergquist <erickbee at gmail.com>
>> >>> wrote:
>> >>>
>> >>> Brian,
>> >>>
>> >>> Looked over those and not seeing any processes consuming lots of
>> >>> memory, memory used was 70-75% during this time.
>> >>>
>> >>>
>> >>>> On Fri, Jan 22, 2016 at 10:54 AM, Brian Meade <bmeade90 at vt.edu>
>> >>>> wrote:
>> >>>> Did you download Perfmon logs?  Might just be high memory in general
>> >>>> causing
>> >>>> the issue.  I'd use the Perfmon logs to check what processes were
>> >>>> using up
>> >>>> all the memory at that time.
>> >>>>
>> >>>> On Fri, Jan 22, 2016 at 10:45 AM, Erick Bergquist
>> >>>> <erickbee at gmail.com>
>> >>>> wrote:
>> >>>>>
>> >>>>> This happened on CUCM 10.5.2.10000-7 and I'm not finding a bug for
>> >>>>> this backtrace info in bug toolkit or the 10.5.2 SU1, SU2, SU3
>> >>>>> release
>> >>>>> notes.
>> >>>>>
>> >>>>>
>> >>>>> From the core analysis,
>> >>>>>
>> >>>>> Reading symbols from /lib/libnss_files.so.2...(no debugging symbols
>> >>>>> found)...done.
>> >>>>> Loaded symbols for /lib/libnss_files.so.2
>> >>>>> Reading symbols from /lib/libnss_dns.so.2...(no debugging symbols
>> >>>>> found)...done.
>> >>>>> Loaded symbols for /lib/libnss_dns.so.2
>> >>>>> Core was generated by `/usr/local/cm/bin/ccm'.
>> >>>>> Program terminated with signal 11, Segmentation fault.
>> >>>>> #0  0xf6a28b55 in _int_malloc () from /lib/libc.so.6
>> >>>>>
>> >>>>> ====================================
>> >>>>> backtrace - CUCM
>> >>>>> ===================================
>> >>>>> #0  0xf6a28b55 in _int_malloc () from /lib/libc.so.6
>> >>>>> #1  0xf6a29d9e in malloc () from /lib/libc.so.6
>> >>>>> #2  0x09be122c in AlternateSyslogAlarmMonitor::manageLogs
>> >>>>> (this=0xd7d74f8, severity=6, strBuf=<value optimized out>) at
>> >>>>> AlternateSyslogAlarmMonitor.cpp:162
>> >>>>> #3  0x09be13f6 in AlternateSyslogAlarmMonitor::AlarmThreadProc
>> >>>>> (pParameter=0xd7d74f8) at AlternateSyslogAlarmMonitor.cpp:61
>> >>>>> #4  0xf6d005a8 in ACE_OS_Thread_Adapter::invoke (this=0xd7d8dc8) at
>> >>>>> OS_Thread_Adapter.cpp:103
>> >>>>> #5  0xf6cc05f1 in ace_thread_adapter (args=0xd7d8dc8) at
>> >>>>> Base_Thread_Adapter.cpp:126
>> >>>>> #6  0xf6996b39 in start_thread () from /lib/libpthread.so.0
>> >>>>> #7  0xf6a97c1e in clone () from /lib/libc.so.6
>> >>>>> ====================================
>> >>>>> info registers
>> >>>>> ====================================
>> >>>>> eax            0x198408
>> >>>>> ecx            0xf6b483d0-155941936
>> >>>>> edx            0xf6b483a0-155941984
>> >>>>> ebx            0xf6b46ff4-155947020
>> >>>>> esp            0xf5d0ab100xf5d0ab10
>> >>>>> ebp            0xf5d0ab980xf5d0ab98
>> >>>>> esi            0xf4bc0b8256622776
>> >>>>> edi            0x00
>> >>>>> eip            0xf6a28b550xf6a28b55 <_int_malloc+725>
>> >>>>> eflags         0x10212[ AF IF RF ]
>> >>>>> cs             0x2335
>> >>>>> ss             0x2b43
>> >>>>> ds             0x2b43
>> >>>>> es             0x2b43
>> >>>>> fs             0x00
>> >>>>> gs             0x6399
>> >>>>> ====================================
>> >>>>> info threads
>> >>>>> ====================================
>> >>>>> 47 Thread 0xe750bb70 (LWP 30608)  0xf6a237f4 in _IO_flush_all_lockp
>> >>>>> () from /lib/libc.so.6
>> >>>>> 46 Thread 0xf5effb70 (LWP 29077)  0xf77d0430 in __kernel_vsyscall ()
>> >>>>> 45 Thread 0xf63ffb70 (LWP 29052)  0xf77d0430 in __kernel_vsyscall ()
>> >>>>> 44 Thread 0xf67d0b70 (LWP 29048)  0xf77d0430 in __kernel_vsyscall ()
>> >>>>> 43 Thread 0xf67e0710 (LWP 25984)  0xf77d0430 in __kernel_vsyscall ()
>> >>>>> 42 Thread 0xe6f2fb70 (LWP 30611)  0xf77d0430 in __kernel_vsyscall ()
>> >>>>>
>> >>>>>
>> >>>>> No errors/etc in the Syslog at the time other then errors on the
>> >>>>> call
>> >>>>> manager service failing.
>> >>>>>
>> >>>>> However, the SDL CCM traces show of a timer expiring and appears it
>> >>>>> shuts down services due to the timer issue.
>> >>>>>
>> >>>>> From the following cisco doc,
>> >>>>>
>> >>>>>
>> >>>>> http://www.cisco.com/c/en/us/support/docs/voice-unified-communications/unified-communications-manager-callmanager/46806-cm-crashes-and-shutdowns.html
>> >>>>> it looks like the SDL timer thread stopped and service shutdown?
>> >>>>> This doc is from CCM 3.x and 4.x time frame.  Are the timers
>> >>>>> mentioned
>> >>>>> in this doc still true for 10.5 - where it shuts down at 16 seconds
>> >>>>> for SDL timer thread or 20 seconds for SDL router thread.
>> >>>>>
>> >>>>>
>> >>>>> 09465137.000 |23:55:15.416 |Stopping |
>> >>>>>    |                               |SdlTCPConnection(1,100,14,37)
>> >>>>> |SdlTCPConnection(1,100,14,37)    |
>> >>>>>    |NumOfCurrentInstances: 39
>> >>>>>
>> >>>>> 09465138.000 |23:55:15.448 |SdlSig   |SdlVerificationTimer
>> >>>>>    |monitoring                     |CMProcMon(1,100,234,1)
>> >>>>> |SdlTimerService(1,100,3,1)       |1,100,234,1.3139616^*^*
>> >>>>>    |*TraceFlagOverrode
>> >>>>>
>> >>>>> 09465138.001 |23:55:15.448 |AppInfo
>> >>>>> |CMProcMon::star_sdlVerification
>> >>>>> - Test Timer exceeded minimum timer latency threshold of 1000
>> >>>>> milliseconds, Actual latency: 54360 milliseconds
>> >>>>>
>> >>>>> 09465139.000 |23:55:16.135 |AppInfo  |CallManagerFailure - Indicates
>> >>>>> an internal failure in Unified CM Host Name of Hosting Node:CUCM1
>> >>>>> Host
>> >>>>> Node IP Address:192.168.1.5 Reason code:4 Additional Text: App
>> >>>>> ID:Cisco CallManager Cluster ID:StandAloneCluster Node ID:CUCM1
>> >>>>>
>> >>>>> 09465140.000 |23:55:16.135 |AlarmErr |AlarmClass: CallManager,
>> >>>>> AlarmName: CallManagerFailure, AlarmSeverity: Critical,
>> >>>>> AlarmMessage:
>> >>>>> , AlarmDescription: Indicates an internal failure in Unified CM,
>> >>>>> AlarmParameters:  HostName:CUCM1, IPAddress:192.168.1.5, Reason:4,
>> >>>>> Text:, AppID:Cisco CallManager, ClusterID:StandAloneCluster,
>> >>>>> NodeID:CUCM1
>> >>>>>
>> >>>>> 09465141.000 |23:55:17.465 |AppInfo  |CallManagerFailure - Indicates
>> >>>>> an internal failure in Unified CM Host Name of Hosting Node:CUCM1
>> >>>>> Host
>> >>>>> Node IP Address:192.168.1.5 Reason code:4 Additional Text:CCM
>> >>>>> Intentional Abort: SignalName: SdlVerificationTimer, DestPID:
>> >>>>> CMProcMon[1:100:234:1] App ID:Cisco CallManager Cluster
>> >>>>> ID:StandAloneCluster Node ID:CUCM1
>> >>>>>
>> >>>>> 09465142.000 |23:55:17.465 |AlarmErr |AlarmClass: CallManager,
>> >>>>> AlarmName: CallManagerFailure, AlarmSeverity: Critical,
>> >>>>> AlarmMessage:
>> >>>>> , AlarmDescription: Indicates an internal failure in Unified CM,
>> >>>>> AlarmParameters:  HostName:CUCM1, IPAddress:192.168.1.5, Reason:4,
>> >>>>> Text:CCM Intentional Abort: SignalName: SdlVerificationTimer,
>> >>>>> DestPID:
>> >>>>> CMProcMon[1:100:234:1], AppID:Cisco CallManager,
>> >>>>> ClusterID:StandAloneCluster, NodeID:CUCM1,
>> >>>>>
>> >>>>> 09465138.002 |23:55:17.566 |AppInfo  |TimerThreadSlowed - Timer
>> >>>>> thread
>> >>>>> has slowed beyond acceptable limits App ID:Cisco CallManager Cluster
>> >>>>> ID:StandAloneCluster Node ID:CUCM1
>> >>>>>
>> >>>>> 09465138.003 |23:55:17.566 |AlarmErr |AlarmClass: CallManager,
>> >>>>> AlarmName: TimerThreadSlowed, AlarmSeverity: Critical, AlarmMessage:
>> >>>>> ,
>> >>>>> AlarmDescription: Timer thread has slowed beyond acceptable limits,
>> >>>>> AlarmParameters:  AppID:Cisco CallManager,
>> >>>>> ClusterID:StandAloneCluster, NodeID:CUCM1,
>> >>>>>
>> >>>>> 09465143.000 |23:55:17.566 |SdlSig   |ReapOldTokenRegistrationsTimer
>> >>>>>    |wait                           |SIPStationInit(1,100,72,1)
>> >>>>> |SdlTimerService(1,100,3,1)       |1,100,146,1.1^*^*
>> >>>>>    |[R:H-H:36,N:55,L:1,V:1,Z:0,D:0]
>> >>>>> _______________________________________________
>> >>>>> cisco-voip mailing list
>> >>>>> cisco-voip at puck.nether.net
>> >>>>> https://puck.nether.net/mailman/listinfo/cisco-voip
>> >>> _______________________________________________
>> >>> cisco-voip mailing list
>> >>> cisco-voip at puck.nether.net
>> >>> https://puck.nether.net/mailman/listinfo/cisco-voip
>> >>
>
>