[cisco-voip] CUCM 10.5 core question - AlterateSyslogAlarm malloc backtrace

Thu Jan 28 19:42:37 EST 2016

Thanks Ryan. Nothing really standing out on the ESXi side.

Any suggestions to prevent this in future?

The VMware KB articles suggest a higher timeout on application side so
transient issues and conditions on the VM side of things don't negatively
impact the application.

On Friday, January 22, 2016, Ryan Ratliff (rratliff) <rratliff at cisco.com>
wrote:

> That's your culprit right there. That 16 second sdlrouter thread timer is
> one of the few things I bet hadn't changed since 4.x.
>
> Can you look at esxi stats for that time period to see what the host
> thought was going on?
>
> -Ryan
>
> > On Jan 22, 2016, at 4:45 PM, Erick Bergquist <erickbee at gmail.com
> <javascript:;>> wrote:
> >
> > Normal CPU usage up to core, then a CPU spike right after core when
> > things are starting back up. Different hosts, same storage.
> >
> > Only gap in trace file line entries is 23:54:59.957 to 23:55:18.402
> > but maybe no activities?
> >
> > 11574290.000 |23:54:59.957 |SdlStat  |Period: 6s #Lines: 19 #Bytes:
> > 3701 Total Number of Buffers: 10000 Free LWM: 9989 Free LWM(total):
> > 9192
> > 11574291.000 |23:55:18.402 |AppInfo
> > |SIPSocketProtocol(2,100,14,137)::handleReadComplete send SdlReadRsp:
> > size 370
> > 11574292.000 |23:55:18.402 |SdlStat  |Period: 19s #Lines: 1 #Bytes:
> > 117 Total Number of Buffers: 10000 Free LWM: 9998 Free LWM(total):
> > 9192
> > 11574293.000 |23:55:18.959 |AppInfo  |CMProcMon - ERROR  !!!!!!!! SDL
> > Router Services Dead: timeBeforeSleep: 804509608, currentMontiorTime:
> > 804530608, deltaTime: 21000, LoopCount= 19
> >
> > The deltaTime is 21000 which would be more then the 20 seconds for the
> > router thread in the previously mentioned document covering 3.x/4.x.
> >
> > Do these services shutdown and restart themselves if over 16 seconds
> > (SDL Timer) or 20 seconds (Router Thread) still in 10.x like that
> > document states at those same thresholds?
> >
> > VMWare KB suggests the SCSI timeout should be increased to handle
> > transient conditions. I would think these timers should be able to
> > handle transient issues without failing the service?  Maybe the
> > thresholds need to be increased to handle these periods.
> >
> >
> http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=1009465
> >
> > I've only seen this with the ccm and ctimanager service. None of the
> > other services.
> >
> > Thanks.
> >
> >
> > On Fri, Jan 22, 2016 at 2:44 PM, Ryan Ratliff (rratliff)
> > <rratliff at cisco.com <javascript:;>> wrote:
> >> How about CPU usage leading up to the cores?
> >> If you have CCM traces from before are there gaps in the timestamps
> between line numbers?
> >>
> >> If you are virtualized were those two nodes on the same host?  If so
> did other VMs on that host experience any odd issues at the time?
> >> -Ryan
> >>
> >> On Jan 22, 2016, at 3:29 PM, Erick Bergquist <erickbee at gmail.com
> <javascript:;>> wrote:
> >>
> >> Ryan,
> >>
> >> I'm seeing a small spike in IOWait but not for that long, No gaps in
> >> the Perfmon logs.  Only the ccm service core dumped on both the call
> >> manager nodes on this cluster at the same time.
> >>
> >> Would the system shutdown the CCM and CTI Manager service itself if
> >> these timers are to long, like that document from CCM 3.x/4.x days
> >> eludes to with the 16 second SDL Timer Thread and 20 Second Router
> >> Thread threshold?
> >>
> >>
> >> The subscriber had a different backtrace, Intentional Abort..
> >>
> >> ====================================
> >> backtrace - CUCM
> >> ===================================
> >> #0  0xf776c430 in __kernel_vsyscall ()
> >> #1  0xf697b871 in raise () from /lib/libc.so.6
> >> #2  0xf697d14a in abort () from /lib/libc.so.6
> >> #3  0x0837609e in IntentionalAbort () at ProcessCMProcMon.cpp:88
> >> #4  CMProcMon::verifySdlRouterServices () at ProcessCMProcMon.cpp:748
> >> #5  0x083762ca in CMProcMon::callManagerMonitorThread
> >> (cmProcMon=0xefc73038) at ProcessCMProcMon.cpp:429
> >> #6  0xf6c9c5a8 in ACE_OS_Thread_Adapter::invoke (this=0xe6a5d818) at
> >> OS_Thread_Adapter.cpp:103
> >> #7  0xf6c5c5f1 in ace_thread_adapter (args=0xe6a5d818) at
> >> Base_Thread_Adapter.cpp:126
> >> #8  0xf6932b39 in start_thread () from /lib/libpthread.so.0
> >> #9  0xf6a33c1e in clone () from /lib/libc.so.6
> >> ====================================
> >> info registers
> >> ====================================
> >>
> >>
> >> Thanks,
> >> Erick
> >>
> >>
> >> On Fri, Jan 22, 2016 at 1:23 PM, Ryan Ratliff (rratliff)
> >> <rratliff at cisco.com <javascript:;>> wrote:
> >>> That backtrace is usually accompanied by CTIManager crashing due to
> intentional abort.
> >>>
> >>>> 09465138.001 |23:55:15.448 |AppInfo  |CMProcMon::star_sdlVerification
> >>>> - Test Timer exceeded minimum timer latency threshold of 1000
> >>>> milliseconds, Actual latency: 54360 milliseconds
> >>>
> >>> That is quite spectacularly bad.  The SDLTimer thread is responsible
> for _every_ timer the ccm process uses. It has a one-second self-check to
> make sure it’s getting enough cpu time and in your case it didn’t get to
> execute it’s one-second check for 54 seconds.
> >>>
> >>> This was definitely performance related so check CPU usage and IOWait
> at the time (+- a minute or so). Gaps in the RISDC perfmon logs would also
> indicate performance issues, though leave you with little evidence unless
> it happened to catch some process at the beginning of a spike.
> >>>
> >>> -Ryan
> >>>
> >>> On Jan 22, 2016, at 1:13 PM, Erick Bergquist <erickbee at gmail.com
> <javascript:;>> wrote:
> >>>
> >>> Brian,
> >>>
> >>> Looked over those and not seeing any processes consuming lots of
> >>> memory, memory used was 70-75% during this time.
> >>>
> >>>
> >>>> On Fri, Jan 22, 2016 at 10:54 AM, Brian Meade <bmeade90 at vt.edu
> <javascript:;>> wrote:
> >>>> Did you download Perfmon logs?  Might just be high memory in general
> causing
> >>>> the issue.  I'd use the Perfmon logs to check what processes were
> using up
> >>>> all the memory at that time.
> >>>>
> >>>> On Fri, Jan 22, 2016 at 10:45 AM, Erick Bergquist <erickbee at gmail.com
> <javascript:;>>
> >>>> wrote:
> >>>>>
> >>>>> This happened on CUCM 10.5.2.10000-7 and I'm not finding a bug for
> >>>>> this backtrace info in bug toolkit or the 10.5.2 SU1, SU2, SU3
> release
> >>>>> notes.
> >>>>>
> >>>>>
> >>>>> From the core analysis,
> >>>>>
> >>>>> Reading symbols from /lib/libnss_files.so.2...(no debugging symbols
> >>>>> found)...done.
> >>>>> Loaded symbols for /lib/libnss_files.so.2
> >>>>> Reading symbols from /lib/libnss_dns.so.2...(no debugging symbols
> >>>>> found)...done.
> >>>>> Loaded symbols for /lib/libnss_dns.so.2
> >>>>> Core was generated by `/usr/local/cm/bin/ccm'.
> >>>>> Program terminated with signal 11, Segmentation fault.
> >>>>> #0  0xf6a28b55 in _int_malloc () from /lib/libc.so.6
> >>>>>
> >>>>> ====================================
> >>>>> backtrace - CUCM
> >>>>> ===================================
> >>>>> #0  0xf6a28b55 in _int_malloc () from /lib/libc.so.6
> >>>>> #1  0xf6a29d9e in malloc () from /lib/libc.so.6
> >>>>> #2  0x09be122c in AlternateSyslogAlarmMonitor::manageLogs
> >>>>> (this=0xd7d74f8, severity=6, strBuf=<value optimized out>) at
> >>>>> AlternateSyslogAlarmMonitor.cpp:162
> >>>>> #3  0x09be13f6 in AlternateSyslogAlarmMonitor::AlarmThreadProc
> >>>>> (pParameter=0xd7d74f8) at AlternateSyslogAlarmMonitor.cpp:61
> >>>>> #4  0xf6d005a8 in ACE_OS_Thread_Adapter::invoke (this=0xd7d8dc8) at
> >>>>> OS_Thread_Adapter.cpp:103
> >>>>> #5  0xf6cc05f1 in ace_thread_adapter (args=0xd7d8dc8) at
> >>>>> Base_Thread_Adapter.cpp:126
> >>>>> #6  0xf6996b39 in start_thread () from /lib/libpthread.so.0
> >>>>> #7  0xf6a97c1e in clone () from /lib/libc.so.6
> >>>>> ====================================
> >>>>> info registers
> >>>>> ====================================
> >>>>> eax            0x198408
> >>>>> ecx            0xf6b483d0-155941936
> >>>>> edx            0xf6b483a0-155941984
> >>>>> ebx            0xf6b46ff4-155947020
> >>>>> esp            0xf5d0ab100xf5d0ab10
> >>>>> ebp            0xf5d0ab980xf5d0ab98
> >>>>> esi            0xf4bc0b8256622776
> >>>>> edi            0x00
> >>>>> eip            0xf6a28b550xf6a28b55 <_int_malloc+725>
> >>>>> eflags         0x10212[ AF IF RF ]
> >>>>> cs             0x2335
> >>>>> ss             0x2b43
> >>>>> ds             0x2b43
> >>>>> es             0x2b43
> >>>>> fs             0x00
> >>>>> gs             0x6399
> >>>>> ====================================
> >>>>> info threads
> >>>>> ====================================
> >>>>> 47 Thread 0xe750bb70 (LWP 30608)  0xf6a237f4 in _IO_flush_all_lockp
> >>>>> () from /lib/libc.so.6
> >>>>> 46 Thread 0xf5effb70 (LWP 29077)  0xf77d0430 in __kernel_vsyscall ()
> >>>>> 45 Thread 0xf63ffb70 (LWP 29052)  0xf77d0430 in __kernel_vsyscall ()
> >>>>> 44 Thread 0xf67d0b70 (LWP 29048)  0xf77d0430 in __kernel_vsyscall ()
> >>>>> 43 Thread 0xf67e0710 (LWP 25984)  0xf77d0430 in __kernel_vsyscall ()
> >>>>> 42 Thread 0xe6f2fb70 (LWP 30611)  0xf77d0430 in __kernel_vsyscall ()
> >>>>>
> >>>>>
> >>>>> No errors/etc in the Syslog at the time other then errors on the call
> >>>>> manager service failing.
> >>>>>
> >>>>> However, the SDL CCM traces show of a timer expiring and appears it
> >>>>> shuts down services due to the timer issue.
> >>>>>
> >>>>> From the following cisco doc,
> >>>>>
> >>>>>
> http://www.cisco.com/c/en/us/support/docs/voice-unified-communications/unified-communications-manager-callmanager/46806-cm-crashes-and-shutdowns.html
> >>>>> it looks like the SDL timer thread stopped and service shutdown?
> >>>>> This doc is from CCM 3.x and 4.x time frame.  Are the timers
> mentioned
> >>>>> in this doc still true for 10.5 - where it shuts down at 16 seconds
> >>>>> for SDL timer thread or 20 seconds for SDL router thread.
> >>>>>
> >>>>>
> >>>>> 09465137.000 |23:55:15.416 |Stopping |
> >>>>>    |                               |SdlTCPConnection(1,100,14,37)
> >>>>> |SdlTCPConnection(1,100,14,37)    |
> >>>>>    |NumOfCurrentInstances: 39
> >>>>>
> >>>>> 09465138.000 |23:55:15.448 |SdlSig   |SdlVerificationTimer
> >>>>>    |monitoring                     |CMProcMon(1,100,234,1)
> >>>>> |SdlTimerService(1,100,3,1)       |1,100,234,1.3139616^*^*
> >>>>>    |*TraceFlagOverrode
> >>>>>
> >>>>> 09465138.001 |23:55:15.448 |AppInfo  |CMProcMon::star_sdlVerification
> >>>>> - Test Timer exceeded minimum timer latency threshold of 1000
> >>>>> milliseconds, Actual latency: 54360 milliseconds
> >>>>>
> >>>>> 09465139.000 |23:55:16.135 |AppInfo  |CallManagerFailure - Indicates
> >>>>> an internal failure in Unified CM Host Name of Hosting Node:CUCM1
> Host
> >>>>> Node IP Address:192.168.1.5 Reason code:4 Additional Text: App
> >>>>> ID:Cisco CallManager Cluster ID:StandAloneCluster Node ID:CUCM1
> >>>>>
> >>>>> 09465140.000 |23:55:16.135 |AlarmErr |AlarmClass: CallManager,
> >>>>> AlarmName: CallManagerFailure, AlarmSeverity: Critical, AlarmMessage:
> >>>>> , AlarmDescription: Indicates an internal failure in Unified CM,
> >>>>> AlarmParameters:  HostName:CUCM1, IPAddress:192.168.1.5, Reason:4,
> >>>>> Text:, AppID:Cisco CallManager, ClusterID:StandAloneCluster,
> >>>>> NodeID:CUCM1
> >>>>>
> >>>>> 09465141.000 |23:55:17.465 |AppInfo  |CallManagerFailure - Indicates
> >>>>> an internal failure in Unified CM Host Name of Hosting Node:CUCM1
> Host
> >>>>> Node IP Address:192.168.1.5 Reason code:4 Additional Text:CCM
> >>>>> Intentional Abort: SignalName: SdlVerificationTimer, DestPID:
> >>>>> CMProcMon[1:100:234:1] App ID:Cisco CallManager Cluster
> >>>>> ID:StandAloneCluster Node ID:CUCM1
> >>>>>
> >>>>> 09465142.000 |23:55:17.465 |AlarmErr |AlarmClass: CallManager,
> >>>>> AlarmName: CallManagerFailure, AlarmSeverity: Critical, AlarmMessage:
> >>>>> , AlarmDescription: Indicates an internal failure in Unified CM,
> >>>>> AlarmParameters:  HostName:CUCM1, IPAddress:192.168.1.5, Reason:4,
> >>>>> Text:CCM Intentional Abort: SignalName: SdlVerificationTimer,
> DestPID:
> >>>>> CMProcMon[1:100:234:1], AppID:Cisco CallManager,
> >>>>> ClusterID:StandAloneCluster, NodeID:CUCM1,
> >>>>>
> >>>>> 09465138.002 |23:55:17.566 |AppInfo  |TimerThreadSlowed - Timer
> thread
> >>>>> has slowed beyond acceptable limits App ID:Cisco CallManager Cluster
> >>>>> ID:StandAloneCluster Node ID:CUCM1
> >>>>>
> >>>>> 09465138.003 |23:55:17.566 |AlarmErr |AlarmClass: CallManager,
> >>>>> AlarmName: TimerThreadSlowed, AlarmSeverity: Critical, AlarmMessage:
> ,
> >>>>> AlarmDescription: Timer thread has slowed beyond acceptable limits,
> >>>>> AlarmParameters:  AppID:Cisco CallManager,
> >>>>> ClusterID:StandAloneCluster, NodeID:CUCM1,
> >>>>>
> >>>>> 09465143.000 |23:55:17.566 |SdlSig   |ReapOldTokenRegistrationsTimer
> >>>>>    |wait                           |SIPStationInit(1,100,72,1)
> >>>>> |SdlTimerService(1,100,3,1)       |1,100,146,1.1^*^*
> >>>>>    |[R:H-H:36,N:55,L:1,V:1,Z:0,D:0]
> >>>>> _______________________________________________
> >>>>> cisco-voip mailing list
> >>>>> cisco-voip at puck.nether.net <javascript:;>
> >>>>> https://puck.nether.net/mailman/listinfo/cisco-voip
> >>> _______________________________________________
> >>> cisco-voip mailing list
> >>> cisco-voip at puck.nether.net <javascript:;>
> >>> https://puck.nether.net/mailman/listinfo/cisco-voip
> >>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://puck.nether.net/pipermail/cisco-voip/attachments/20160128/7f76bfcd/attachment.html>