[ednog] anycast DNS

Tue Apr 5 23:30:38 EDT 2005

John Kristoff wrote:
> On Tue, 5 Apr 2005 13:40:29 -0500 (CDT)
> Jay Ford <jay-ford at uiowa.edu> wrote:
> 
> 
>>Does anybody have any kernels of wisdom which I could factor in as I continue
>>down this road?  Specifically, does anybody have a slick script to predicate
>>the announcement of routes for the service addresses on the availability of
>>DNS on the service addresses?
> 
> 
> Jay,
> 
> Do you mean like the example presented in Appendix D here?
> 
>   <http://www.isc.org/pubs/tn/index.pl?tn=isc-tn-2004-1.txt>

[Note: I started to compose this earlier in the day, and I realize that
some of this has been addressed in more recent additions to this thread,
but I am too lazy to edit myself now.  I think most of it is still
relevant.]

That script is more or less useful depending on how you define
"availability of DNS on the service addresses."  In the sense that the
named process is running, the script works fine.  Where I have been
bitten before is if the nameserver wedges in some way or if it gets
REALLY slow.  For example, it can get really slow if you run out of disk
space on your logging drive, and similar situations. That's a good
reason to monitor disk space, but it also means that it's useful to have
a script that actually monitors the responsiveness of the server to
queries.  I have been experimenting with hacked-up versions of nanny.pl
(in the 'contrib' directory of the BIND 9.3.x--and probably
others--distribution).  The script is very straighforward, so you can
easily see how it can be modified to kill the routing daemon or if-down
the service interface and take other actions if the nameserver becomes
unresponsive.

Other words of wisdumb (I have been running anycast for 5+ years now):

1. If you have two or more "well-known" service addresses for your
caching nameservers that you distribute to static clients and/or put in
your DHCP config, don't announce both addresses on each anycast server.
 Instead, divide your anycast cluster up and announce one set of
addresses on one set of machines and another set on the other machines.
 Then, if there is some rare failure where your scripts don't withdraw
the routes but clients aren't getting service, the client failover will
still work as the walk through the well-known addresses in their local
resolvers.  (This point was also made in a NANOG presentation by Bill
Woodcock, but I have lost the URL for that.)

2. I am not sure how quickly the if-down mechanism works at withdrawing
the routes relative to when the nameserver stops responding on that
service address.  If the nameserver breaks, then the if-down mechanism
is the fastest way to withdraw the route.  But if you are just trying to
take down servers for maintenance and/or bring new servers into
production, the ifdown may leave gaps in your service where users can't
get to your DNS server(s) for several seconds.  In such a situation, I
simply kill ospfd and zebra.  This takes longer to converge, since the
ospf neighbor has to wait for the dead timer to expire, but it basically
ensures that named is still responding on the service address until the
traffic goes away.

michael