Re: [Observium] Multiple Pollers Guidance

9 Apr 2014


      Hi Adam,
Thanks for taking the time to provide feedback. We love Observium. I hope
to break out the database to another VM very soon as well as double the
cores to 8 on my current VM.  We have about 20k ports currently with 218
devices, but with all our devices I'm guessing we'll have about 60k ports
total. Does it help reduce CPU/Polling if the ports are admin down or down
down verses shutdown? Right now I only have VMs at my disposal, therefore
I'm attempting to maximize the environment.
-Chip
On Tue, Apr 8, 2014 at 1:06 PM, Adam Armstrong adama@memetic.org wrote:
...
Hi Chip,
The number of devices isn't so important, the number of ports is generally
a more valid method of gauging requirements (100 Linux servers will
probably poll faster than 10 cisco 6509s)
We have no clean mechanism of having multiple poller systems, you always
need to share storage and MySQL which tends to remove most of the benefits,
especially as it's not terribly difficult to get 24-core systems these days.
You will likely have 4 main scalability issues :
i)   MySQL throughput
ii)  Disk I/O throughput
iii) CPU time, mostly to run PHP and parse MIBs
iv)  SNMP response latency
There are a few strategies you can use to help mitigate these, and they're
probably effective beyond the point that 99% of users will be trying to
scale to:
i)   Move MySQL to separate hardware, ask a MySQL guy how to make MySQL go
superwooshfast.
ii)  Move RRD storage to a ram disk
Reduce the amount of averaging in the RRD structure to reduce I/O at the
expense of disk space
Try rrdcached (never seems to make much difference for us though, tbh)
Add more (faster) spindles to your system. More disks means more I/O
capacity.
Move RRD storage to a separate system with more, faster disks and better
caching (adds latency)
iii) Move to a faster system. We don't think it's worth the hassle to move
polling to multiple systems until
you're past the point that a single system can't comfortably scale. 12
core systems are easy.
iv)  Run multiple pollers in parallel. You can relatively easily scale
through 4, 8, 12, 16 cores. Once we have
more than a single user who needs 24 cores, maybe then it's worth looking
at separate pollers.
A lot of people get hung up on how long it takes to finish polling a
device. This isn't so important, so long as each part of each poller run is
~5 minutes apart. It's generally a good idea to run as many poller
instances as your system will accomodate accounting for CPU, MySQL and I/O.
If a system takes 15 minutes to poll because it's far away or replies
slowly, that's no problem, because the pollers are started 5 minutes apart,
so the CPU module will be run 5 minutes apart and the ports module will be
run 5 minutes apart, etc.
adam.
On 2014-04-04 05:45, Chip Pleasants wrote:
...
Thank you for the ping tuning. I've added those to the config.
 However, as soon as turn on the external poller I get snmp down
alerts from the poller that lives on the all in one server solution.
I'm assuming the alerts emails get sent local to the poller?  Not sure
really where to go from here if I want to use observium for 800 more
devices.  I plan to break out the database, which shouldn't be
difficult and should give some relief the server for polling, but if I
can't get the external poller working it may be a show stopper for me.
If it is possible I'm really wanting to know more about more about the
multiple pollers.  Any suggestions and time is appreciated.
Thanks,
Chip
On Wed, Apr 2, 2014 at 6:47 PM, Mike Stupalov mike@observium.org wrote:
in includes/defaults.inc.php:
$config['fping']          = "/usr/bin/fping";
$config['fping6']         = "/usr/bin/fping6";
// PING Settings - Retries/Timeouts
#$config['ping']['retries'] = 3;    // How many times to retry ping (1 -
10)
#$config['ping']['timeout'] = 500;  // Timeout in milliseconds (50 - 2000)
On Thu, Apr 3, 2014 at 1:50 AM, Chip Pleasants wpleasants@gmail.com
wrote:
When there are multiple pollers via NFS how do they pick the devices to
poll? Basically how to they not step all over each other polling the same
nodes?  I'm seeing snmp and ping alerts like 15 or so every hour come in
that didn't come in when it was a single server solution. It does seem to
be the same 20 or so devices. These particular devices do typically take
around 60 sec to poll.  Looking at devices that generated alerts their cpu,
snmp response time, and ping time was all over the place. Meaning cpu
doubled (10% to 20%), ping times went up to 200ms from 10ms, and snmp
response time average when from 50ms to 556ms. I'm wondering if I was
polling these devices multiple times within 5 minutes? Would this be
related to NFS IO issues? I reverted back to a single solution for now.
 Any assistance is greatly appreciated.
-Chip
Config Sniplet
$config['alerts']['email']['enable']       = TRUE;
$config['poller-wrapper']['alerter']       = TRUE;
$config['snmp']['timeout'] = 6;
$config['snmp']['retries'] = 3;
$config['snmp']['max-rep'] = 10;
$config['fping']          = "/usr/sbin/fping -t2000";

observium mailing list
observium@observium.org
http://postman.memetic.org/cgi-bin/mailman/listinfo/observium [1]
--
Mike Stupalov
http://observium.org/ [2]

observium mailing list
observium@observium.org
http://postman.memetic.org/cgi-bin/mailman/listinfo/observium [1]
Links:
[1] http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
[2] http://observium.org/

observium mailing list
observium@observium.org
http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

observium mailing list
observium@observium.org
http://postman.memetic.org/cgi-bin/mailman/listinfo/observium