Re: [Observium] Scaling Observium with Rancid, Smokeping, and Syslog-ng

8 Feb 2014


      You are correct we have a dedicated NFS VM that is Fusion IO backed
for storage, then several smaller VM that remote mount NFS. We
override $config['mibdir'] to /var/cache/observium/mibs which is a
local copy of the mibs.
The page load times seem to climb with the number of ports. There is a
port security check so anything that displays a lot of ports will
result in a large number of calls to the database. The are primary key
lookups but there is still a remote call. You might want to try
keeping the Web Server and DB server on the same box. We eliminated a
lot of these port check calls be modifying the code. We don't need
port level acls and would prefer the data hippy approach of letting
everybody see everything.
On Fri, Feb 7, 2014 at 5:16 AM, Chip Pleasants wpleasants@gmail.com wrote:
...
Jeremy,
Thank you very much for the detailed message. That's sounds like an awesome
setup.  I don't have the time nor do I want make my environment crazy
complex believe me.  Sounds like pollers first then the database would be
the first break out items.  Just to make sure I understand the pollers run
on a dedicated VM with an NFS mount to run the scripts and have a local copy
of the mibs? The database server is also a dedicated VM, therefore its
remote tcp(3306 for mysql) connection from the main Observium web server? In
a smaller environment and having the database on a different VM add long
page loads or other issues? Thanks again for the feedback I really
appreciate it!
-Chip
On Thu, Feb 6, 2014 at 12:34 PM, Jeremy Custenborder
jcustenborder@gmail.com wrote:
...
I think this is one I can speak to. At least running Observium on NFS.
We have a commercial product for Rancid, Syslog-ng is another team,
and we don't use smoke ping.
I use Observium to monitor about 700 remote locations and a few data
centers with ~7k devices resulting in 300k ports. This is mostly Cisco
devices, another team monitors server level metrics. Right off the bat
I would recommend keeping the environment as small as possible if you
can. As long as you can poll all of your devices in 5 minutes, I would
not break it out. If you are not finishing in 5 minutes then consider
this a path like this. It greatly increases the complexity of the
environment and will put you in an unsupported configuration. Given
our network is spread all over the country, our biggest problem is
latency not raw hardware.
We queue all checks through RabbitMQ. Cron jobs determines which
devices need to be polled or discovered and writes it a queue.
We're running 100% on VMWare on a Cisco UCS stack. The OS is Centos 6.5.
1x DB Server (4 core, 16 gb of ram)
1x NFS Server (4 core, 32gb ram, Fusion IO disk, XFS filesystem. Tuned
for heavy write cache)
1x Web Server (4 core, 16 gb of ram, RabbitMQ, cron tasks to populate
RabbitMQ)
16x Poller(2 core, 8gb of ram)
1x Nagios Server (Specific to this infrastructure)
We also have a nagios environment with about 100k checks against that
are generated from the observium database based on IOS feature types,
etc.
Observations:
1.) Pollers spend an insane amount of IO checking the mibs. We were
seeing 4-10k nfs operations a second per poller when loading from NFS.
We have a job that syncs the mibs from NFS to a cache directory on the
pollers local disk. We override mibdir in config to point to this
directory. This dropped our operations per second by 90%. We were
noticing that even though these files were in cache, it sill resulted
in a stat to nfs.
2.) You have to monitor the hell out of this. We watch for stale
processes, number of processes, last write times, network
availability.
3.) The NFS box is going to be 98% of your tuning effort. Our database
stayed around 8gb as long as we truncated the discovery and poller log
tables (Per Adam they are not used for anything). It grew by a couple
million rows a day. We truncate these tables hourly. We give mysql
12gb of ram and 4gb is left to the OS. This seems to work fine.
4.) Mount options for the pollers are your friend. We used this:
mount -o 'rw,async,noatime,nodiratime,noacl,rsize=32768,wsize=32768'
nfsbox:/mount/path /mnt/something
5.) Consistent configuration is huge. Use puppet or chef.
6.) When the database has 300k ports page load is bad. Sometimes we
saw 15 - 30 second page load times. A php profiler of the splash page
was 100 megs. :)  All of the port level access checks cause hundreds
of thousands of calls to the database. We patched the code to not have
port level security. We don't have a use for it.
On Tue, Feb 4, 2014 at 9:27 AM, Chip Pleasants wpleasants@gmail.com
wrote:
...
Anyone have suggestions or real world experiences they could share
scaling
Observium with Rancid, Smokeping, and Syslog-ng.  I have about 1000
router/switches in my network that equate to about 30k of interfaces.
Half
the interfaces will probably be down, although I'm not sure that makes a
difference.
Basically I'm wondering if I should break out my tools to individual VMs
and
use something like NFS for Observium to be able to see Rancid,Smokeping,
and
Syslog-ng data. Feel free to shoot me a private note.  I appreciate any
feedback.
-Chip

observium mailing list
observium@observium.org
http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

observium mailing list
observium@observium.org
http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

observium mailing list
observium@observium.org
http://postman.memetic.org/cgi-bin/mailman/listinfo/observium