Re: [Observium] Scaling Observium with Rancid, Smokeping, and Syslog-ng

6 Feb 2014


      This brings up an interesting question.  Are there any plans to implement a
master/slave configuration like Opsview or Nagios uses?
Thanks, 
Scott Brawner
-----Original Message-----
From: observium [mailto:observium-bounces@observium.org] On Behalf Of Jeremy
Custenborder
Sent: Thursday, February 06, 2014 12:35 PM
To: Observium Network Observation System
Subject: Re: [Observium] Scaling Observium with Rancid, Smokeping, and
Syslog-ng
I think this is one I can speak to. At least running Observium on NFS.
We have a commercial product for Rancid, Syslog-ng is another team, and we
don't use smoke ping.
I use Observium to monitor about 700 remote locations and a few data centers
with ~7k devices resulting in 300k ports. This is mostly Cisco devices,
another team monitors server level metrics. Right off the bat I would
recommend keeping the environment as small as possible if you can. As long
as you can poll all of your devices in 5 minutes, I would not break it out.
If you are not finishing in 5 minutes then consider this a path like this.
It greatly increases the complexity of the environment and will put you in
an unsupported configuration. Given our network is spread all over the
country, our biggest problem is latency not raw hardware.
We queue all checks through RabbitMQ. Cron jobs determines which devices
need to be polled or discovered and writes it a queue.
We're running 100% on VMWare on a Cisco UCS stack. The OS is Centos 6.5.
1x DB Server (4 core, 16 gb of ram)
1x NFS Server (4 core, 32gb ram, Fusion IO disk, XFS filesystem. Tuned for
heavy write cache) 1x Web Server (4 core, 16 gb of ram, RabbitMQ, cron tasks
to populate RabbitMQ) 16x Poller(2 core, 8gb of ram) 1x Nagios Server
(Specific to this infrastructure)
We also have a nagios environment with about 100k checks against that are
generated from the observium database based on IOS feature types, etc.
Observations:
1.) Pollers spend an insane amount of IO checking the mibs. We were seeing
4-10k nfs operations a second per poller when loading from NFS.
We have a job that syncs the mibs from NFS to a cache directory on the
pollers local disk. We override mibdir in config to point to this directory.
This dropped our operations per second by 90%. We were noticing that even
though these files were in cache, it sill resulted in a stat to nfs.
2.) You have to monitor the hell out of this. We watch for stale processes,
number of processes, last write times, network availability.
3.) The NFS box is going to be 98% of your tuning effort. Our database
stayed around 8gb as long as we truncated the discovery and poller log
tables (Per Adam they are not used for anything). It grew by a couple
million rows a day. We truncate these tables hourly. We give mysql 12gb of
ram and 4gb is left to the OS. This seems to work fine.
4.) Mount options for the pollers are your friend. We used this:
mount -o 'rw,async,noatime,nodiratime,noacl,rsize=32768,wsize=32768'
nfsbox:/mount/path /mnt/something
5.) Consistent configuration is huge. Use puppet or chef.
6.) When the database has 300k ports page load is bad. Sometimes we saw 15 -
30 second page load times. A php profiler of the splash page was 100 megs.
:)  All of the port level access checks cause hundreds of thousands of calls
to the database. We patched the code to not have port level security. We
don't have a use for it.
On Tue, Feb 4, 2014 at 9:27 AM, Chip Pleasants wpleasants@gmail.com wrote:
...
Anyone have suggestions or real world experiences they could share 
scaling Observium with Rancid, Smokeping, and Syslog-ng.  I have about 
1000 router/switches in my network that equate to about 30k of 
interfaces.  Half the interfaces will probably be down, although I'm 
not sure that makes a difference.
Basically I'm wondering if I should break out my tools to individual 
VMs and use something like NFS for Observium to be able to see 
Rancid,Smokeping, and Syslog-ng data. Feel free to shoot me a private 
note.  I appreciate any feedback.
-Chip

observium mailing list
observium@observium.org
http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
_______________________________________________
observium mailing list
observium@observium.org
http://postman.memetic.org/cgi-bin/mailman/listinfo/observium