You are correct we have a dedicated NFS VM that is Fusion IO backed for storage, then several smaller VM that remote mount NFS. We override $config['mibdir'] to /var/cache/observium/mibs which is a local copy of the mibs.
The page load times seem to climb with the number of ports. There is a port security check so anything that displays a lot of ports will result in a large number of calls to the database. The are primary key lookups but there is still a remote call. You might want to try keeping the Web Server and DB server on the same box. We eliminated a lot of these port check calls be modifying the code. We don't need port level acls and would prefer the data hippy approach of letting everybody see everything.
On Fri, Feb 7, 2014 at 5:16 AM, Chip Pleasants wpleasants@gmail.com wrote:
Jeremy,
Thank you very much for the detailed message. That's sounds like an awesome setup. I don't have the time nor do I want make my environment crazy complex believe me. Sounds like pollers first then the database would be the first break out items. Just to make sure I understand the pollers run on a dedicated VM with an NFS mount to run the scripts and have a local copy of the mibs? The database server is also a dedicated VM, therefore its remote tcp(3306 for mysql) connection from the main Observium web server? In a smaller environment and having the database on a different VM add long page loads or other issues? Thanks again for the feedback I really appreciate it!
-Chip
On Thu, Feb 6, 2014 at 12:34 PM, Jeremy Custenborder jcustenborder@gmail.com wrote:
I think this is one I can speak to. At least running Observium on NFS. We have a commercial product for Rancid, Syslog-ng is another team, and we don't use smoke ping.
I use Observium to monitor about 700 remote locations and a few data centers with ~7k devices resulting in 300k ports. This is mostly Cisco devices, another team monitors server level metrics. Right off the bat I would recommend keeping the environment as small as possible if you can. As long as you can poll all of your devices in 5 minutes, I would not break it out. If you are not finishing in 5 minutes then consider this a path like this. It greatly increases the complexity of the environment and will put you in an unsupported configuration. Given our network is spread all over the country, our biggest problem is latency not raw hardware.
We queue all checks through RabbitMQ. Cron jobs determines which devices need to be polled or discovered and writes it a queue.
We're running 100% on VMWare on a Cisco UCS stack. The OS is Centos 6.5. 1x DB Server (4 core, 16 gb of ram) 1x NFS Server (4 core, 32gb ram, Fusion IO disk, XFS filesystem. Tuned for heavy write cache) 1x Web Server (4 core, 16 gb of ram, RabbitMQ, cron tasks to populate RabbitMQ) 16x Poller(2 core, 8gb of ram) 1x Nagios Server (Specific to this infrastructure)
We also have a nagios environment with about 100k checks against that are generated from the observium database based on IOS feature types, etc.
Observations:
1.) Pollers spend an insane amount of IO checking the mibs. We were seeing 4-10k nfs operations a second per poller when loading from NFS. We have a job that syncs the mibs from NFS to a cache directory on the pollers local disk. We override mibdir in config to point to this directory. This dropped our operations per second by 90%. We were noticing that even though these files were in cache, it sill resulted in a stat to nfs.
2.) You have to monitor the hell out of this. We watch for stale processes, number of processes, last write times, network availability.
3.) The NFS box is going to be 98% of your tuning effort. Our database stayed around 8gb as long as we truncated the discovery and poller log tables (Per Adam they are not used for anything). It grew by a couple million rows a day. We truncate these tables hourly. We give mysql 12gb of ram and 4gb is left to the OS. This seems to work fine.
4.) Mount options for the pollers are your friend. We used this: mount -o 'rw,async,noatime,nodiratime,noacl,rsize=32768,wsize=32768' nfsbox:/mount/path /mnt/something
5.) Consistent configuration is huge. Use puppet or chef.
6.) When the database has 300k ports page load is bad. Sometimes we saw 15 - 30 second page load times. A php profiler of the splash page was 100 megs. :) All of the port level access checks cause hundreds of thousands of calls to the database. We patched the code to not have port level security. We don't have a use for it.
On Tue, Feb 4, 2014 at 9:27 AM, Chip Pleasants wpleasants@gmail.com wrote:
Anyone have suggestions or real world experiences they could share scaling Observium with Rancid, Smokeping, and Syslog-ng. I have about 1000 router/switches in my network that equate to about 30k of interfaces. Half the interfaces will probably be down, although I'm not sure that makes a difference.
Basically I'm wondering if I should break out my tools to individual VMs and use something like NFS for Observium to be able to see Rancid,Smokeping, and Syslog-ng data. Feel free to shoot me a private note. I appreciate any feedback.
-Chip
observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium