[cisco-voip] Database Layer Change Notification

Wes Sisk wsisk at cisco.com
Mon Apr 27 14:36:14 EDT 2009


Robert,

Great investigation on your part.  Just one question - what CM version?

Given the Tables head/tail/new, I know it's some 4.x version.

Forward:  CM4.x is approaching end of sale and end of software maintenance:
http://www.cisco.com/en/US/products/sw/voicesw/ps556/prod_eol_notices_list.html

This entire architecture is rewritten in CM5, then again in CM6.  CM6 
and 7 are very similar.  CM6 and 7 are pretty stable for this so far, 
much more so than CM4.x.

There have been many incarnations of this problem.  It is the 
responsibility of "database layer monitor service", aupair.exe, on the 
publisher to read entries out of those tables 500 at a time and dispatch 
those to active processes (executables) on nodes within the cluster.  If 
those tables are getting backed up the first question to ask is if 
database layer monitor is running on the publisher.  If aupair.exe is 
intermittently stopping or hanging on your server that only exacerbates 
the condition.  Originally those tables did not have indexes on them so 
the more data in the table the slower things ran.  You can see how this 
becomes a funnel: more->slower->even more->even slower ....  See
Point 1:
CSCsf31622    change notification performance degrades rapidly under load

BUT... aupair.exe should not be stopped so entries in the tables should 
never back up.  If aupair.exe remains running but appears dysfunctional 
I would suspect
Point 2:
CSCse41788    Change Notification Fails - DBLCNQueue Counts Rise - DBL 
Ptr Corruption

This is indicated in the DBL_AUPAIR*.txt traces by something like:
DBL_AUPAIR00004655.txt
09/26/2006 23:52:14.726 DBLAUP|   CConnectionHolder::OpenODBCConnection 
Odbc Connect OdbcDSN [ìÃèw,oft][ODBC SQL Server Driver][SQL 
Server]Changed database context to 'CCM0302'.] GlassHouse 
[ìÃèw,oft][ODBC SQL Server Driver][SQL Server]Changed database context 
to 'CCM0302'.]  m_StaticLatestDSN 
[DSN=CiscoCallManager;SERVER=CM2;DATABASE=CCM0302;Trusted_Connection=yes]

Note the Garbage in the GlassHouse string.  That should be a well formed 
DSN similar to the 'DSN=.*' later in the string.

This defect occurred about twice a year for us so relatively rare, but 
still a concern. We never got to 100% root cause but put in some 
defensive code.  Even after this fix there were subsequent reports.  
There was another defect open for this which I cannot currently locate.


Point 3:
Aupair grabs change notifications from those tables and sends them out 
to processes running on nodes in the cluster via TCP.  We have seen 
those TCP sessions get aborted and otherwise hung causing aupair to hang:
CSCsa64684    Change Notify stops working due to bug in TcpLib

Are you servers separated by a WAN, firewalls, or any TCP inspection 
device that may interrupt TCP sessions?  Are any processes in your 
cluster flapping or crashing so they might not consume their change 
notifications?

Point 4:
Aupair grabs change notifications from those tables and sends them out 
to processes running on nodes in the cluster via TCP.  Once acknowledged 
aupair deletes those change notifications from the sql tables.  If a 
process holds locks on the database schema or specifically on these 
tables in the database then deletes will not be allowed.  Monitoring 
tools such as NetIQ and Prognosis are somewhat notorious for grabbing 
and holding locks in SQL.  When you find the tables backing up grab 
these outputs for your CCM03xx database using SQL query analyzer:
sp_who2
sp_lock

in the sp_who2 output you can see if any process is 'waiting' on a 
lock.  you can also see who has locks
sp_lock shows specifically who has what locks


Other Random Points:
Other issues that may be related to change notification backups:
CSCsj09236    Unable to clear CFA from IP phone
CSCse70772    DA entries get corrupted due to out of order change 
notification for DN
CSCsl21023    Change notification broken after CM 
deactivated,DBLCNQueue* tbls filling


Regards,
Wes

On Monday, April 27, 2009 1:15:34 PM, Robert Singleton 
<rsingleton at morsco.com> wrote:
> Hello, all!
>
> I've just recovered from the second (known) occurrence of a problem 
> wherein a table in CallManager's database, DBLCNQueueHead, seems to 
> fill up and never empty, eventually bringing database changes to a 
> grinding halt.
>
> Both times, there has been an otherwise inexplicable call handling 
> issue that eventually lead to a reboot of the cluster as a 
> last-ditch-finger-crossing-wood-knocking attempt to make it go again. 
> Both times, the original complaint was not resolved and the reboot 
> apparently caused a new error to appear whenever any database change 
> was attempted.
>
> The first time, Call Forwarding was stuck in whatever state a given DN 
> was set to. If a DN was forwarded, the act of removing forwarding 
> appears to work, but calls to the DN were still forwarded. Likewise, 
> if one forwarded a DN, it would appear to take the command, but the DN 
> would continue to ring locally. Eventually, we tried the reboot (what 
> I unaffectionately call "The Windows Fix") and when I started getting 
> the errors afterward, I opened a TAC SR. I was passed around until I 
> got an engineer who was very comfortable with the database and found 
> that a few tables that were apparently related to database change 
> notification were jam packed with 100's of thousands of records.
>
> Last Friday, I had two locations for which incoming calls did not work 
> correctly. Some telephones at each site appeared to be stuck loading a 
> template, though they appeared to be registered in CallManager. Some 
> switch and routing troubleshooting appeared to point to a UDP problem, 
> but it was eventually discovered that certain telephones in the 
> locations did work, though they were phones *without* the incoming DN 
> on them.
>
> We handle incoming calls at most locations by sending calls to shared 
> DNs on most, if not all, telephones at the locations. Since phones 
> without incoming lines were operating normally, we started by picking 
> one phone, wiping it out and reconfiguring it one line at a time. We 
> found that once we added the lead number of huntgroup, that phone 
> began choking on loading a template. So, we deleted all traces of the 
> DNs associated with incoming calls at that particular location but 
> when we began adding them back, adding that lead DN number would again 
> bring down the affected phones.
>
> At that point, we decided that rebooting the cluster would probably be 
> a good idea. When the system was back up, however, I now began getting 
> errors whenever I tried make any database changes.
>
> I then reviewed TAC history to find when we'd had similar issues and 
> found where an engineer had determined that we had 200K+ entries in 
> the DBLCNQueueHead table in the CCM0301 database. I looked and I had 
> over 456K rows.
>
> I followed the same procedure, which was basically to truncate the 
> three tables associated with change notification. For 456K rows to 
> truncate takes almost 9 hours. Once that was done, not only could I 
> now make database changes, but the original symptoms went away.
>
> Now when I check properties on those tables, they have either one row 
> or no rows, depending on which table.
>
> I apologize for the exceptionally long introduction, but the real 
> question is: What do these tables do? What makes them "stick" and fill 
> up? How many rows is a critical number; when will it break because 
> this table isn't clearing out?
>
> The three tables are:
>
> DBLCNQueueHead
> DBLCNQueueNew
> DBLCNQueueOld
>
> I have viewed the contents of DBLCNQueueHead while making various 
> database changes and the one row never changes. Color me confused.
>
> Thanks!!!
>
> Robert
> _______________________________________________
> cisco-voip mailing list
> cisco-voip at puck.nether.net
> https://puck.nether.net/mailman/listinfo/cisco-voip

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://puck.nether.net/pipermail/cisco-voip/attachments/20090427/75107ede/attachment.html>


More information about the cisco-voip mailing list