[outages] Java apps around the globe are crashing...

Jeremy Chadwick outages at jdc.parodius.com
Sun Jul 1 05:48:53 EDT 2012


I don't know what operating system you're running there, but I'm going
to assume some Linux distribution.

You also didn't provide the timezones those systems are in, so your
logging messages aren't as useful (to me) as they could be.  I'm going
to guess UTC+1 (keep reading for how I determined that :-) ):

I'll explain a bit more about this, but since I use FreeBSD the system
and behaviour may be a bit different.  I imagine the ntpd and kernel
time bits are similar/identical though.

This year's leap second occurred on 06/30 at 23:59:60 UTC or
thereabouts.  That means it works like so:

2012/06/30 23:59:59 UTC
2012/06/30 23:59:60 UTC -- leap second occurs here
2012/07/01 00:00:00 UTC

My below logs are taken from two FreeBSD systems (I can check more if
need be), which are UTC-7 (PDT) timezone:

Jun 28 06:39:55 sys1 ntpd[1560]: kernel time sync status change 6001
Jun 29 17:24:31 sys1 ntpd[1560]: kernel time sync status change 2011
Jun 30 17:04:57 sys1 ntpd[1560]: kernel time sync status change 2001

Jun 29 06:33:38 sys2 ntpd[77470]: kernel time sync status change 2001
Jun 29 17:14:34 sys2 ntpd[77470]: kernel time sync status change 2011
Jun 30 05:59:24 sys2 ntpd[77470]: kernel time sync status change 6011
Jun 30 06:07:58 sys2 ntpd[77470]: kernel time sync status change 2011
Jun 30 17:01:31 sys2 ntpd[77470]: kernel time sync status change 2001

(Note: bit 14 (0x4000) defines PLL or FLL operation mode (1=FLL).  On
FreeBSD this tends to toggle quite often, and nobody (TMK) has ever
figured out why.  My point is to ignore bit 14 in the above output)

For sys1, 06/29 17:24:31 + 7 hours = 06/30 00:24:31 UTC
For sys2, 06/29 17:14:34 + 7 hours = 06/30 00:14:34 UTC

Based on this we can see that ntpd almost certainly noticed the upcoming
leap second 24 hours in advance.  I imagine that value (24 hours, or
86400 seconds) is not chance -- Wikipedia hints at things like "mean
solar days" and so on.  So I imagine ntpd prepares in advance for this
ordeal.

Remember: ntpd does not blindly set the clock in large increments (like
ntpdate does) -- it works directly with the kernel to adjust the clock
in very, very small amounts (milliseconds or microseconds, I forget
which) so that there is minimal impact to underlying applications.
(I'll never forget people telling me back in the 90s to simply make a
cronjob that calls ntpdate... *shakes head)

Now for the status bits:

On FreeBSD the bits are described in /usr/include/sys/timex.h and are
part of the kernel, not ntpd itself:

#define STA_PLL         0x0001  /* enable PLL updates (rw) */
#define STA_INS         0x0010  /* insert leap (rw) */

These status bits are also used in "time sync error xxxx" messages (yes
I'm looking at the ntpd source).

On FreeBSD, the only condition where "kernel time sync error" can happen
is when ntp_adjtime(2) (a syscall) returns TIME_ERROR.  TIME_ERROR is
defined as "clock not synchronised".  Looking at the FreeBSD kernel
code, ntp_is_time_error() is what causes TIME_ERROR to be returned by
ntp_adjtime(2).  There are only 4 conditions where this can occur:

1. Hardware or software clock error
2. PPS signal lost when time or freq. sync is requested
3. PPS jitter when time sync requested
4. PPS wander exceeded or calibration error when freq. sync requested

So back to your logs:

Jun 30 01:07:05 -- enable STA_INS
Jun 30 14:42:25 -- time sync error (current status: PLL update + STA_INS)
Jun 30 18:10:21 -- time sync error (current status: PLL update + STA_INS)
Jul  1 01:08:22 -- disable STA_INS

Based on this I would say your system is is UTC+1, given how ntpd
behaves (described above).

As for your time sync errors -- not too sure about those.  I'm looking
at kernel mailing lists and general Usenet posts from folks familiar
with the NTP innards, but it's hard for me to determine if the errors
you saw were the result of STA_INS being enabled or if they were the
cause of something like, say, an upstream NTP server returning a time
that results in a very large offset.  It could also be that one or more
of the NTP servers upstream from you, which were active ("*" in ntpq -c
peers) went offline?  Not sure.  It could also be a kernel or ntpd bug.
Just from looking at Google for the phrase "PLL update" there are all
sorts of bugs/issues that have been discussed + fixed, and one of them
pertains to PLL updates needing to be done slightly differently for
situations where network connectivity to a server becomes erratic.

In general I wouldn't worry about it.  It's obvious that by 07/01 at
01:08 UTC+1 your system was back to normal.  :-)

-- 
| Jeremy Chadwick                                   jdc at koitsu.org |
| UNIX Systems Administrator                http://jdc.koitsu.org/ |
| Mountain View, CA, US                                            |
| Making life hard for others since 1977.             PGP 4BD6C0CB |

On Sun, Jul 01, 2012 at 09:09:36AM +0100, Colin Johnston wrote:
> Did however see some weird ntp error messages though
> 
> 
> Jun 30 01:07:05 [192.168.0.1.128.94] 2012:06:30-01:07:05 ntpd[29409]: kernel time sync status change 0011
> Jun 30 14:42:25 [192.168.0.1.128.94] 2012:06:30-14:42:25 ntpd[29409]: kernel time sync error 0011
> Jun 30 18:10:21 [192.168.0.1.128.94] 2012:06:30-18:10:21 ntpd[29409]: kernel time sync error 0011
> Jul  1 01:08:22 [192.168.0.1.128.95] 2012:07:01-01:08:22 ntpd[29409]: kernel time sync status change 0001
> 
> 
> On 1 Jul 2012, at 04:31, Colin Johnston wrote:
> 
> > all seemed to be ok via astaro vm
> > 2012:07:01-00:59:59 VM406 kernel: Clock: inserting leap second 23:59:60 UTC
> > 
> > 
> > Colin
> > 
> > On 1 Jul 2012, at 04:23, Jason Hellenthal wrote:
> > 
> >> 
> >> Would have replied to a later message but none looked to be just perfect
> >> so Ill reply here.
> >> 
> >> clock.nyc.he.net LOCAL(0)
> >> clock.fmt.he.net LOCAL(0)
> >> clock.sjc.he.net LOCAL(0)
> >> 
> >> All went to a LOCAL(0) refid sometime recently after being in .CDMA. for
> >> the longest time.
> >> 
> >> Maybe this is a small part of it but thought I would share the note.
> >> 
> >> On Sat, Jun 30, 2012 at 06:53:53PM -0700, Wil Schultz wrote:
> >>> Leap second bug. *sigh*
> >>> 
> >> 
> >> 
> >> -- 
> >> 
> >> - (2^(N-1))
> >> _______________________________________________
> >> Outages mailing list
> >> Outages at outages.org
> >> https://puck.nether.net/mailman/listinfo/outages
> > 
> 
> 
> _______________________________________________
> Outages mailing list
> Outages at outages.org
> https://puck.nether.net/mailman/listinfo/outages



More information about the Outages mailing list