[Outages-discussion] Idea to help with the S/N on outages...

Wed Aug 17 17:50:40 EDT 2011

I'll chime in here, as someone who works in an enterprise-level NOC.

1) isitdownorjustme.com, and other "is it down" sites are only useful as
a binary "sanity" indicator; literally they're intended for generic joe
schmoes who don't know anything about the Internet, transport, IP
networking, or web servers and "just want to know if something works".
All it results in is people saying "the site [blahblah.com] also says
its down".  It tells the person nothing about *why* it's down.  Anyone
who thinks the Internet ISN'T broken 24x7x365 has their head in the
sand.

2) Smokeping is practically worthless.  It's for people who enjoy
staring at gradients all day, equating pretty pictures with useful.  I
often refer to the tool as "smokeandmirrorsping".  I get much more
conclusive, useful real-world data from mtr.

3) I would prefer people not bring Nagios or Zabbix into the discussion
either, but those may be "okay frameworks" to use, but many underlying
utilities/checks don't provide the level of granularity needed.

What's needed is basically:

a) Combination of traceroute (both UDP and ICMP-based methods) and mtr
(at present only ICMP) output which can be obtained in real-time.
(NOTE: this is only partially useful because asymmetric routing is
common on the Internet these days; seeing only half the path means
you're only getting half the picture.  I'd say 80% of the time if I ask
a customer/client/provider for a traceroute they can't provide one due
to network ACLs or VPNs getting in the way -- welcome to reality)

b) URL monitors for HTTP services that include response time
measurements on all levels (DNS lookup time, TCP connect time, delay
time between GET/POST and response, payload delivery time, full HTTP
results (including headers), and total amount of time it took to
complete the entire HTTP transaction).  curl can provide all of this:
look at the --write-out portion of the documentation, specifically
things like {time_total}, {time_namelookup}, {time_connect},
{time_pretransfer}, {time_starttransfer}, {speed_download}, etc...

c) Recursive DNS analysis done in real-time.  I've yet to see any sort
of resource that does this reliably/accurately; I can find "online
DNS tracers" (sigh...) that are basically just dig +trace, which is
helpful but simultaneously useless in intermittent outages (e.g. a
single root server having issues).

d) BGP and/or OSPF (if applicable) analysis done in real-time.  This is
extremely tricky and I've yet to see anyone implement it.  There's
BGPlay and equivalents, which are great but sometimes not real-time
enough (and you have to know how to use the tool effectively to find
what you want; most end-users won't know what they're looking at, or
even what CIDR is).  "show ip route" from Junipers and "show route" from
Ciscos would be most useful, and again, would need to be done in
real-time.  And yes I am aware of route-views.routeviews.org.

e) All of the above needs to be done from multiple geographic locations,
with multiple peering providers (and the ability for the analyst to
choose which provider tests should utilise).  The last part of this
paragraph is key, particularly when used in combination with (a).

I'm sure I'm missing a couple key items (I'm rushed to get out the door
for work), but those are the main ones.

I'm of the opinion outages.org *should not* be held ""responsible""
having to implement something like this.  It's a big undertaking, and
one that often takes large enterprise-level companies 10+ years to
implement, tune, and adjust over time.  It IS NOT something you can just
"click and deploy".

And finally, however/wherever this gets implemented needs to be done for
free (and not funded by advertising).  Turning it into a commercial
service immediately kills your demographic, even if that commercialism
happens gradually/over time.  Do not tell me "this cannot be done for
free" -- I'm a hosting provider who provides free services and has done
so for the past 15 years, at around US$700/month out of my own pocket.
TL;DR version is "if someone implements this, please don't be a dick and
commercialise it".

I won't be reply to this thread past this point.  The above is really
all that needs to be said; if you build it they will come.

-- 
| Jeremy Chadwick                                jdc at parodius.com |
| Parodius Networking                       http://www.parodius.com/ |
| UNIX Systems Administrator                   Mountain View, CA, US |
| Making life hard for others since 1977.               PGP 4BD6C0CB |

On Wed, Aug 17, 2011 at 03:19:54PM -0500, Haworth, Michael A. wrote:
> Just adding my humble suggestion - hoping not to get flamed -
> smokeping (low resource usage) on a box sitting in 'Somewhere, US'
> managed by (insert poor unlucky soul here) with the results page
> posted to the outages.org website...
> 
> I could be wrong - it could be a pain to do, but it would provide
> basic latency/availability data for any node requested - licensing is
> priced right and resource usage is fairly minimal for any connection
> that would host the service.
> 
> I'm not sure if it could be 'clustered' to show results from different
> geological locations, but if (possibly) three or four were running
> identical configs with their results links posted on the outages.org
> website, it could server the same purpose for all involved?