[c-nsp] OT Solarwinds Alternatives

Thu Aug 3 03:42:36 EDT 2017

On 27 July 2017 at 19:56, Nick Griffin <nick.jon.griffin at gmail.com> wrote:
> Sorry for the off-topic post. I'm looking for input on network management
> solutions other than solarwinds, unbiased opinions. We will need all things
> network related, monitoring, alerts, reporting, configuration management,
> and other tools that might be handy for a NOC. If this takes multiple tools
> then that is fine. Just looking for some ideas from the guys in the
> trenches. Thanks!

tl;dr > It takes a lot more time to roll your own with something like
Cacti (or InfluxDB and Grafana, this is actually were I want to be
going and would recommend for a greenfield deployment now) but I think
in the long run it’s worth it.

SolarShite is just that really in my opinion. We have tens of
thousands of devices being monitored in SolarWinds and I hate it;
- We have multiple deployments of SolarWinds to break up the "load",
to have all that in one deployment it would probably grind to a halt
unless you want to very very expensive boxes (each SQL servers have
256GB RAM, 64 cores, ssd, it’s still not very fast, and we’ve had SW
support guys “tune” the install).
- In a similar vein to the first point, I don't believe it is well
written at all, based in how poorly it scales (and we've hit plenty of
bugs).
- It does have support for multiple pollers which we have but still,
we also have Cacti with a single poller and it is much faster at
polling the same number of devices or data points when making a like
for like comparison with SolarShite.
- It is supposed to support standard MIBs out of the box but Juniper
wasn't supported until very recently, we couldn't even poll interface
stats, and they were telling us it was suitable for SP’s.
- It does have a lot of features (EoL checking, threshold alerts,
config backups, config reports for compliance, NetFlow collecting, IP
SLA reports etc) however they seem more enterprise driven than SP
driver and all of those features are available in a mixture of Cacti,
Observium, Oxidized/RANCID, nfdump etc for free (the cost of hiring a
good sys admin to put the leg work in will be way cheaper than the SW
licenses!).

We also have Cacti as I mentioned because we can graph "anything" in
Cacti that is a number basically and it's lightweight and fast. So we
use it for additional bespoke graphing needs and also for accuracy; 5
minutes is a standard industry polling time, some of our SW stats are
polled less frequently than that and some more frequently, but either
way the graphs always look shit. With Cacit we can poll at 5 minutes
(or pretty much any time we like) and we have altered our RRD
templates to store 2 years’ worth of data samples before they start to
consolidate data points (but this could be any value we like).

Someone also mentioned the lack of multiple pollers for scaling in
Cacti and Observium; Cacti has recently been under heavy development
and I believe it has this feature now natively (it’s on my to-do list
to check it out) but for both Cacti and Observium people have been
implementing this themselves quite easily. On the Observium mailing
list a few years some guy had a blade chassis just for Observium to
monitor a large deployment. He/they wrote a simple wrapper script that
calls the main poller script that would only fetch ¼ of the nodes in
the database instead of all of them, and had 4 servers each fetching a
different ¼ result thus each polling a separate ¼ of the node set.
People also do things like storing the RRD files on an NFS share
sitting on SSDs so that each poller can write to the same path etc.

Cheers,
James.