[c-nsp] service monitoring on a small scale?

Thu Sep 27 11:21:16 EDT 2007

On Wed, Sep 26, 2007 at 12:58:42PM -0500, neal rauhauser wrote:
>     Yesterday we cooked a POS card in a 7507 and the customer has just had
> it with stuff breaking at 0200 and learning about it at 0900 via fifty angry
> customer messages.
> 
>      The failure modes we see are not simple link up/down things that could
> be caught with syslog or Nagios. We've had a steady flow of things that
> cause loss or latency over the last year without having any sort of outright
> failure.
> 
>      We want to be able to ensure quality of customer experience and we know
> this has to go down to the level of TCP segment loss across sessions.

You need a method for either measuring the customer's experience, or
automated tests that measure variables that approximate the customer
experience.

For example, the Smokeping monitoring system measures and graphs latency
and packet loss.  You could have it send an alert into your alarm system
when packet loss or latency got too high on a customer link.

If you have your own equipment on the other side of the communications
link, or if the customer is willing to run your software agent to help
with the monitoring, then that increases your ability to measure stuff.

For example, a script running on the customer side that tries to download
a Web page or two, or access a mail server, and then reports the times
that it took to do that to your montoring system, would give you a
realistic view of what the customer is experiencing.  I think this can
be done with a Cisco router using IP SLA (nee SAA), but I haven't tried
it yet.