[cisco-voip] Database Layer Change Notification

Mon Apr 27 16:55:04 EDT 2009

Hi Robert,

You can get details on each of the mentioned defects in Cisco Bug 
Toolkit on Cisco.com. CSCsf31622 specifically applies to 'change 
notification load'.  What does that mean?  Enterprise wide extension 
mobility login/logout, enterprise wide MWI refresh.  These are the 2 
biggest change notification generators that come to mind.  My point here 
was not load but rather backup.  Once change notification (CN) begins to 
backup the performance rapidly degrades.  If aupair gets stopped for 
some reason so that the queue gets backed up, then aupair will have a 
very difficult time catching up once aupair is started back.

CM5.x and 6.x offer perfmon counters to monitor for CN backups.  I am 
not aware of a good way to monitor in CM4.x.

One thing we have used successfully in the past is a set of 
'TraceCollection' scripts authored by Jim Cardon at Cisco.  These 
periodically pull traces from defined CM nodes and parse the traces for 
specified strings.  You could use this to look for aupair performing 
normally ("Select TOP 499 * from DBLCNQueueHead AS H JOIN DBLCNQueueOld 
AS O ON O.ref=H.ref JOIN DBLCNQueueNew AS N ON N.ref=H.ref order by 
H.seq").  If not found then alert.

CM5.x and 6.x have this trace collection and trace monitoring built in 
by default, just need to be configured.

/Wes

On Monday, April 27, 2009 4:14:21 PM, Robert Singleton 
<rsingleton at morsco.com> wrote:
> Wes Sisk wrote:
>
>> Just one question - what CM version?
>
> Sorry, I skip that all the time...  4.1(3)sr4d
>
>> Point 1:
>> CSCsf31622    change notification performance degrades rapidly under 
>> load
>
> load where? Generally, my CallManagers are not especially heavily 
> loaded, CPU wise. That was not always the case. Once upon a time, just 
> leaving traces running was enough to affect call handling, but 
> upgraded hardware a couple or three years ago and it works much better 
> now.
>
>> Point 2:
>> CSCse41788    Change Notification Fails - DBLCNQueue Counts Rise - 
>> DBL Ptr Corruption
>
> Assuming traces aren't rolled out, I will look for this...
>
>> Point 3:
>> Aupair grabs change notifications from those tables and sends them 
>> out to processes running on nodes in the cluster via TCP.  We have 
>> seen those TCP sessions get aborted and otherwise hung causing aupair 
>> to hang:
>> CSCsa64684    Change Notify stops working due to bug in TcpLib
>>
>> Are you servers separated by a WAN, firewalls, or any TCP inspection 
>> device that may interrupt TCP sessions?  Are any processes in your 
>> cluster flapping or crashing so they might not consume their change 
>> notifications?
>
> None that I am aware of. Both servers are in the same colo. I cannot 
> say at this writing whether or not they are on the same switch, but 
> it's fairly likely that they may be in adjacent racks and thus on 
> different 35XX switches. If so, these switches are probably daisy 
> chained with 100Mb copper connections. I will investigate.
>
>> Point 4:
>
>> When you find the tables backing up grab these outputs for your 
>> CCM03xx database using SQL query analyzer:
>> sp_who2
>> sp_lock
>>
>> in the sp_who2 output you can see if any process is 'waiting' on a 
>> lock.  you can also see who has locks
>> sp_lock shows specifically who has what locks
>
> Interesting... Since the issue will have to be ongoing for these to 
> apply, do you have any specific recommendation for detecting the 
> backup before the world comes crashing down?
>
>
>> CSCsl21023    Change notification broken after CM 
>> deactivated,DBLCNQueue* tbls filling
>
> This definitely seems to fit my secondary symptom, DBCN being broken 
> after a reboot to attempt to blindly fix the other issue.
>
> Thanks for the info and reply!
>
> Robert
> _______________________________________________
> cisco-voip mailing list
> cisco-voip at puck.nether.net
> https://puck.nether.net/mailman/listinfo/cisco-voip