I think this is one I can speak to. At least running Observium on NFS.
We have a commercial product for Rancid, Syslog-ng is another team,
and we don't use smoke ping.
I use Observium to monitor about 700 remote locations and a few data
centers with ~7k devices resulting in 300k ports. This is mostly Cisco
devices, another team monitors server level metrics. Right off the bat
I would recommend keeping the environment as small as possible if you
can. As long as you can poll all of your devices in 5 minutes, I would
not break it out. If you are not finishing in 5 minutes then consider
this a path like this. It greatly increases the complexity of the
environment and will put you in an unsupported configuration. Given
our network is spread all over the country, our biggest problem is
latency not raw hardware.
We queue all checks through RabbitMQ. Cron jobs determines which
devices need to be polled or discovered and writes it a queue.
We're running 100% on VMWare on a Cisco UCS stack. The OS is Centos 6.5.
1x DB Server (4 core, 16 gb of ram)
1x NFS Server (4 core, 32gb ram, Fusion IO disk, XFS filesystem. Tuned
for heavy write cache)
1x Web Server (4 core, 16 gb of ram, RabbitMQ, cron tasks to populate RabbitMQ)
16x Poller(2 core, 8gb of ram)
1x Nagios Server (Specific to this infrastructure)
We also have a nagios environment with about 100k checks against that
are generated from the observium database based on IOS feature types,
etc.
Observations:
1.) Pollers spend an insane amount of IO checking the mibs. We were
seeing 4-10k nfs operations a second per poller when loading from NFS.
We have a job that syncs the mibs from NFS to a cache directory on the
pollers local disk. We override mibdir in config to point to this
directory. This dropped our operations per second by 90%. We were
noticing that even though these files were in cache, it sill resulted
in a stat to nfs.
2.) You have to monitor the hell out of this. We watch for stale
processes, number of processes, last write times, network
availability.
3.) The NFS box is going to be 98% of your tuning effort. Our database
stayed around 8gb as long as we truncated the discovery and poller log
tables (Per Adam they are not used for anything). It grew by a couple
million rows a day. We truncate these tables hourly. We give mysql
12gb of ram and 4gb is left to the OS. This seems to work fine.
4.) Mount options for the pollers are your friend. We used this:
mount -o 'rw,async,noatime,nodiratime,noacl,rsize=32768,wsize=32768'
nfsbox:/mount/path /mnt/something
5.) Consistent configuration is huge. Use puppet or chef.
6.) When the database has 300k ports page load is bad. Sometimes we
saw 15 - 30 second page load times. A php profiler of the splash page
was 100 megs. :) All of the port level access checks cause hundreds
of thousands of calls to the database. We patched the code to not have
port level security. We don't have a use for it.
> _______________________________________________
On Tue, Feb 4, 2014 at 9:27 AM, Chip Pleasants <wpleasants@gmail.com> wrote:
> Anyone have suggestions or real world experiences they could share scaling
> Observium with Rancid, Smokeping, and Syslog-ng. I have about 1000
> router/switches in my network that equate to about 30k of interfaces. Half
> the interfaces will probably be down, although I'm not sure that makes a
> difference.
>
> Basically I'm wondering if I should break out my tools to individual VMs and
> use something like NFS for Observium to be able to see Rancid,Smokeping, and
> Syslog-ng data. Feel free to shoot me a private note. I appreciate any
> feedback.
>
> -Chip
>
>
>
>
> observium mailing list
> observium@observium.org
> http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
>
_______________________________________________
observium mailing list
observium@observium.org
http://postman.memetic.org/cgi-bin/mailman/listinfo/observium