[ednog] Building a cheap NTP stratum-2 network with retired Cisco gear

Wed Jun 22 18:06:15 EDT 2005

I've had some conversations with others who are on this list about an
NTP stratum-2 implementation and it seemed worthy of posting my notes
here for others to comment on or possibly take advantage of.  Note, I
had done a something very similar to this at the institution I was at
prior to Northwestern and someone from there is listening and I'm sure
will readily pipe up and say "we removed that, don't do it!" if this
is not a good idea.  Maybe they will at least chime in with some stats
since I no longer have access to that infrastructure.  Otherwise, in my
experience this seems to work relatively well and talking it over with
a couple of others seems to suggest this is a relatively decent way to
build a NTP stratum-2 network cheaply.

The basic approach is described at the following:

  <http://www.ntp.org/ntpfaq/NTP-s-config-adv.htm#AEN2912>

In my particular case, we're using cheap cisco gear that is no longer
in service on the production network.  Previously I used some 2600
routers and now at NU I'm using cat 3524 switches.  Using at least
four 3524's, we'll have something that will look like this (where the
3524's are ntp1, ntp2, ntp3 and ntp4 in the diagram - all peered to
each other):

   st-1a  st-1b   st-1c  st-1d   st-1e  st-1f   st-1g  st-1h
      \    /         \    /         \    /         \    /
       ntp1 --------- ntp2 --------- ntp3 --------- ntp4
        ^              ^              ^              ^
       / \            / \            / \            / \
     fan-out        fan-out        fan-out        fan-out

  st-1?    unique upstream stratum-1 server outside of NU
  ntp?     internal stratum-2 server that peers with other internal-2's
  fan-out  are end system NTP clients

We have enough 3524's that we can have a supply of readily available
hot spares and since these devices have a lifetime warranty, we can
easily get them replaced with something if the need arises.

Note, ideally the clients will get time from at least three of our
stratum 2's, but most clients tend to just use one.  To help balance
the load in the latter case we will have use DNS round robin with a
name such as time.northwestern.edu.  Anycast NTP could also be used,
but not with 3524's unless they ran routing code (which we won't be
doing).  Even then I don't expect this to really buy us a whole lot
and the little bit of added complexity (and potential for NTP breakage,
however unlikely - 4 packets are used in NTP) just doesn't seem like
it is worth it today.

Internally peered stratum-2's will use MD5 auth.  MD5 will not be
available for general clients, since it cannot be securely nor easily
managed for large numbers of clients.  A shared MD5 hash passphrase
can be used for centrally controlled devices that support it, such
as other switches, routers and servers.  The use of MD5 may not buy
buy us a whole lot, but anything that can raise the bar without
too much trouble is probably a good idea in my opinion.  Note however,
that I know of only two stratum-1's that will support MD5 to stratum-2
downstreams and I've asked a quite a far amount of operators.

Hardening the boxes should be relatively easy if you're familiar with
cisco gear.  You can apply interface packet filters on the switch
interface and upstream router interface.  You can also do some basic
NTP server specific filters (for the type NTP services you're willing
to allow and to/from who).  Here is an example:

  ntp access-group query-only 1
  ntp access-group serve 1
  ntp access-group peer 2
  ntp access-group serve-only 3
  ntp authenticate
  ntp authentication-key <key-id> md5 <hash>
  ntp trusted-key <key-id>
  ntp peer <horizontal-stratum2> key <key-id>
  ntp server <upstream-stratum1>

Create an 'ntp authentication-key' for each horizontal/upstream.  You
can probably use the same key for all horizontals if they are all your's.
Also create a 'ntp trusted-key' line for each keyid you've configured.
The 'peer' option is for peers and upstreams.  The serve-only is for
clients getting time from you.  The others you don't need and can be
blocked entirely.  ACLs would look like this:

  access-list 1 deny any

  access-list 2 permit <horizontal-stratum2>
  access-list 2 permit <upstream-stratum1>
  access-list 2 deny any

  access-list 3 permit any

ACL 3 you can restrict to whatever clients you want to allow to get
time from you.  Often you have to open this up for clients that are
going to be mobile and sync from off campus.

Here are other options that you may be interested in:

  ntp source <interface>
  ntp update-calendar
  ntp max-assocations <limit>
  ntp master <stratum>
  ntp multicast
  ntp multicast client

If you only have a single interface on the cisco box you don't need
'ntp source'.  'ntp update-calendar' is only available on certain
hardware platforms.  I don't use the others.  If you care why, feel
free to ask.

Some monitoring commands:

  show ntp status              # shows synchronization status
  show ntp associations        # shows ntp peer/server associations
  show ntp associations detail # like above, but with more detail
  debug ntp ?                  # logs detailed ntp messages/packets

Some references:

  NTP Overview by Stanislav Shalunov, Internet2
  <http://www.internet2.edu/%7Eshalunov/talks/20050322-Atlanta-PerformanceWorkshop-NTP.pdf>

  Hardening Cisco Routers: Chapter 10: NTP
  <http://www.oreilly.com/catalog/hardcisco/chapter/ch10.html>

  IOS configuration and reference documentation
  <http://www.cisco.com>

  NTP home
  <http://www.ntp.org>

I found a note from David Mills from a few years back that said it
was Dave Katz who put the NTP code in cisco devices and that he
basically used ntpd's v3 code.  David Mills reported that it worked
well.  That seems good enough for me.  The processor in a 3524 is
a PowerPC403.  I don't have a good sense of how this box will do
under load yet, but one well informed colleague believes a 3524
should handle a few hundred clients without too much trouble.

I haven't done a full blown stress test on the 3524's, but I have sent
a flood ping to a box, while doing running a 'debug ntp events' and
having another box sync to it and it syncing to an upstream.  The flood
got up to about 600 Kb/s and the 3524 never missed a ping reply nor time
sync event.  CPU was at about 45% as I recall, but still responsive.
I would like to flood it with time sync requests and see how well it
will handle load, but I haven't gotten to that yet.  I'd be interested
in any thoughts or measurements others may have in this area.

Based on what I've seen thus far, they should handle our current load,
while giving us a more robust time sync infrastructure than what is
currently deployed.

Hopefully this is interesting to someone.  I'd especially be interested
in any feedback that point out potential problems or suggestions for
improvement.  ...and if anyone of you are in the midwest region and are
running stratum-1 servers that accept downstream stratum-2 server clients,
and I haven't bugged you yet, please let me know who you are so I can bug
you.  :-)

John