[cisco-voip] Trace files for CUCM/NTP problem

Tue Dec 14 15:26:26 EST 2010

Ed,

One more thing. From all of your debug output here we can see something
different is happening with the "RefID" field.

The "RefID" field shows us the next upstream hop past the server we're
pointing to. Consider something like this:

Note - don't ever point to this server - just using it for example.

time.nist.gov (Stratum 1) <-- Your Local NTP Master (Stratum 2) <-- CUCM Pub
(Stratum 3) <-- CUCM Subs (Stratum 4)

So on your CUCM pub in this instance we would see:

Remote: Local NTP Master
Stratum: 2
RefID: time.nist.gov

This tells us your CUCM server is pointing to your local NTP server. Your
local NTP master is just 1 hop away from the root server, time.nist.gov.

In your case though, you see Stratum 16. 16 is like the infinite route in
RIP. NTP is throwing it's hands in the air and saying "I have no idea what
the hell the time it is".

Further, it's saying it's pointing at server ".STEP." This is a special
keyword that means the time is off by further than NTP can adjust for in one
single shot.

This seems to make sense with NTP restarting every 30 minutes. If we're
further out of sync than NTP can correct for, then we need to restart NTP so
the ntponeshot command can run to step the clock forwards or backwards by
several seconds or minutes at a time. Something NTP can't do alone without
the restart.

Here is an interesting experiment to find out what's happening. Remove the
NTP server entries from CUCM right at the start of the hour. Wait 4 hours.
At the start of the next 4 hours, find out how far off the CUCM clock is
from your watch (or PC clock) via "show status". Then look at the NTP server
and find out if it's time is still matching up with what you expect.

What is probably causing this is a hardware clock on the CUCM server
(motherboard) or the NTP server, drifting faster than NTP can correct for.

NTP can correct for errors of 500 parts per million. This is something like
43 seconds drift in 24 hours. If after 4 hours your CUCM server clock is off
by more than 7 seconds - then you need a new motherboard. It might be best
to wait 24 hours instead of 4 hours, just to get a more accurate idea of how
fast your clock might be drifting.

I'd be really interested to see what you find. I've replaced a few
motherboards in IBM servers for this exact problem.

-Burns

On Tue, Dec 14, 2010 at 11:23 AM, Ed Leatherman <ealeatherman at gmail.com>wrote:

> I was about to say IOS devices are OK.. but i noticed the poll value
> was 64 on the one I was reviewing... on a stable environment that
> should be 1024 "steady-state" it sounds like. Some mischief is afoot.
>
> Thanks for the NTP tips.
>
> On Mon, Dec 13, 2010 at 7:12 PM, Jason Aarons (US)
> <jason.aarons at us.didata.com> wrote:
> > Have you pointed a different router/switch to your NTP server? Are they
> > getting 16 as well? I recall a high offset/variation from clock can also
> > make it 16.
> >
> >
> >
> >
> >
> > A IOS device initially polls every 64ms, as the NTP server and client are
> > better synced and there aren't dropped packets, this number increases to
> a
> > maximum of 1024
> >
> > http://www.nil.si/ipcorner/BeOnTime/
> >
> >
> http://www.cisco.com/en/US/products/sw/iosswrel/ps1818/products_tech_note09186a008015bb3a.shtml
> >
> >
> >
> > “while the highest level (stratum 16) usually indicates that the clock is
> > not working or unaccessible”
> >
> >
> >
> >
> >
> > From: cisco-voip-bounces at puck.nether.net
> > [mailto:cisco-voip-bounces at puck.nether.net] On Behalf Of Jason Burns
> > Sent: Monday, December 13, 2010 6:57 PM
> > To: Wes Sisk
> > Cc: Cisco VOIP
> > Subject: Re: [cisco-voip] Trace files for CUCM/NTP problem
> >
> >
> >
> > Ed,
> >
> >
> >
> > CUCM is preferring the local clock, because your NTP reference has a
> Stratum
> > of 16!
> >
> >
> >
> > 10.192.20.10    .STEP.          16 u  488  512  376    0.244   16.553
> > 0.052
> >
> >
> >
> > Fix your NTP server 10.192.20.10 and you'll fix your CUCM.
> >
> >
> >
> > -Burns
> >
> > On Mon, Dec 13, 2010 at 11:50 AM, Wes Sisk <wsisk at cisco.com> wrote:
> >
> > what version of CM?  Many changes of NTP especially this one:
> > CSCsk70971    publisher NTP down if configured NTP down or unreliable
> >
> > my interpretation:
> > something on the network NTP source changed
> > now subscribers giving error that pub is unreliable
> >
> > this is expected if pub cannot sync to NTP source. what changes did they
> > make? it is still a viable NTP source for hte publisher? if not,
> publisher
> > will use local clock which makes it an invalid source for all subs.
> >
> >
> http://www.cisco.com/en/US/docs/voice_ip_comm/cucm/srnd/8x/netstruc.html#wpmkr1185636
> >
> >
> >
> > /Wes
> >
> > Ed Leatherman wrote:
> >
> > Hi folks,
> >
> >
> >
> > Our operations team updated the NTP service recently (infoblox), and
> >
> > right after that happened, I started getting syslog errors per below
> >
> > on two different CUCM 7 clusters, both of which use that NTP server.
> >
> >
> >
> > ntpRunningStatus.sh: Primary node NTP server, OWP-PUB, is currently
> >
> > inaccessible or down. Verify the network between the primary and
> >
> > secondary nodes.  Check the status of NTP on both the primary and
> >
> > secondary nodes via CLI 'utils ntp status'.  If the network is fine,
> >
> > try restarting NTP using CLI 'utils ntp restart'.
> >
> >
> >
> > Looking at the status on these servers, the pub looks OK but the subs
> show:
> >
> > utils ntp status on all secondary nodes comes up with (example):
> >
> >      remote           refid      st t when poll reach   delay   offset
> > jitter
> >
> >
> ==============================================================================
> >
> > *127.127.1.0     LOCAL(0)        10 l   32   64  377    0.000    0.000
> > 0.004
> >
> >  10.192.20.10    .STEP.          16 u  488  512  376    0.244   16.553
> > 0.052
> >
> >
> >
> > Restarting NTP on all nodes fixes the problem temporarily (NTP status
> >
> > goes back to normal) but only for a short time.
> >
> >
> >
> > The NTP logs don't show anything other than what appears to be the NTP
> >
> > service restarting every 30 minutes.. is this normal?
> >
> > 11/16/2010 23:00:02
> >
> >
> sd_ntp|*********************************************************|<LVL::Info>
> >
> > 11/16/2010 23:00:02 sd_ntp|          Running sd_ntp. Process Id=12302
> >
> >                |<LVL::Info>
> >
> > 11/16/2010 23:00:02
> >
> >
> sd_ntp|*********************************************************|<LVL::Info>
> >
> > 11/16/2010 23:00:02 sd_ntp||<LVL::Info>
> >
> > 11/16/2010 23:00:02 sd_ntp|[528] Command Line parameters: -list
> > -s|<LVL::Info>
> >
> > 11/16/2010 23:00:02 sd_ntp|[585] The file /etc/ntp.conf
> exists|<LVL::Debug>
> >
> > 11/16/2010 23:00:02 sd_ntp|[421] /etc/ntp/drift file is not
> > changed|<LVL::Debug>
> >
> > 11/16/2010 23:00:02 sd_ntp|[603] Listing all the servers|<LVL::Debug>
> >
> > 11/16/2010 23:00:02 sd_ntp|sd_ntp exitinng normally.|<LVL::Info>
> >
> >
> >
> > In both clusters, the pub and most or all of the subs are on the same
> >
> > VLAN and physical switch.
> >
> >
> >
> > What other traces can I look at on CM to troubleshoot this? Anyone
> >
> > know if there is a debug for the process that's generating my syslog
> >
> > errors?
> >
> >
> >
> > I want to make sure it's not an error on my end and hopefully have
> >
> > some better information on whats broke before I go back to the
> >
> > operations group. All the IOS routers using infoblox for NTP appear to
> >
> > be working just fine, so they see no problems :)
> >
> >
> >
> > Thanks in advance!
> >
> >
> >
> >
> >
> > _______________________________________________
> > cisco-voip mailing list
> > cisco-voip at puck.nether.net
> > https://puck.nether.net/mailman/listinfo/cisco-voip
> >
> >
> >
> > ________________________________
> >
> > Disclaimer: This e-mail communication and any attachments may contain
> > confidential and privileged information and is for use by the designated
> > addressee(s) named above only. If you are not the intended addressee, you
> > are hereby notified that you have received this communication in error
> and
> > that any use or reproduction of this email or its contents is strictly
> > prohibited and may be unlawful. If you have received this communication
> in
> > error, please notify us immediately by replying to this message and
> deleting
> > it from your computer. Thank you.
> >
> > _______________________________________________
> > cisco-voip mailing list
> > cisco-voip at puck.nether.net
> > https://puck.nether.net/mailman/listinfo/cisco-voip
> >
> >
>
>
>
> --
> Ed Leatherman
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <https://puck.nether.net/pipermail/cisco-voip/attachments/20101214/2e8d5003/attachment.html>