On Tue, Apr 8, 2014 at 1:06 PM, Adam Armstrong <adama@memetic.org> wrote:

Hi Chip,

The number of devices isn't so important, the number of ports is generally a more valid method of gauging requirements (100 Linux servers will probably poll faster than 10 cisco 6509s)

We have no clean mechanism of having multiple poller systems, you always need to share storage and MySQL which tends to remove most of the benefits, especially as it's not terribly difficult to get 24-core systems these days.

You will likely have 4 main scalability issues :

i) MySQL throughput
ii) Disk I/O throughput
iii) CPU time, mostly to run PHP and parse MIBs
iv) SNMP response latency

There are a few strategies you can use to help mitigate these, and they're probably effective beyond the point that 99% of users will be trying to scale to:

i) Move MySQL to separate hardware, ask a MySQL guy how to make MySQL go superwooshfast.
ii) Move RRD storage to a ram disk
Reduce the amount of averaging in the RRD structure to reduce I/O at the expense of disk space
Try rrdcached (never seems to make much difference for us though, tbh)
Add more (faster) spindles to your system. More disks means more I/O capacity.
Move RRD storage to a separate system with more, faster disks and better caching (adds latency)
iii) Move to a faster system. We don't think it's worth the hassle to move polling to multiple systems until
you're past the point that a single system can't comfortably scale. 12 core systems are easy.
iv) Run multiple pollers in parallel. You can relatively easily scale through 4, 8, 12, 16 cores. Once we have
more than a single user who needs 24 cores, maybe then it's worth looking at separate pollers.

A lot of people get hung up on how long it takes to finish polling a device. This isn't so important, so long as each part of each poller run is ~5 minutes apart. It's generally a good idea to run as many poller instances as your system will accomodate accounting for CPU, MySQL and I/O. If a system takes 15 minutes to poll because it's far away or replies slowly, that's no problem, because the pollers are started 5 minutes apart, so the CPU module will be run 5 minutes apart and the ports module will be run 5 minutes apart, etc.

adam.

On 2014-04-04 05:45, Chip Pleasants wrote:

Thank you for the ping tuning. I've added those to the config.
However, as soon as turn on the external poller I get snmp down
alerts from the poller that lives on the all in one server solution.
I'm assuming the alerts emails get sent local to the poller? Not sure
really where to go from here if I want to use observium for 800 more
devices. I plan to break out the database, which shouldn't be
difficult and should give some relief the server for polling, but if I
can't get the external poller working it may be a show stopper for me.
If it is possible I'm really wanting to know more about more about the
multiple pollers. Any suggestions and time is appreciated.

Thanks,
Chip

On Wed, Apr 2, 2014 at 6:47 PM, Mike Stupalov <mike@observium.org> wrote:

in includes/defaults.inc.php:

$config['fping']          = "/usr/bin/fping";
$config['fping6']         = "/usr/bin/fping6";

// PING Settings - Retries/Timeouts
#$config['ping']['retries'] = 3;    // How many times to retry ping (1 - 10)
#$config['ping']['timeout'] = 500; // Timeout in milliseconds (50 - 2000)

On Thu, Apr 3, 2014 at 1:50 AM, Chip Pleasants <wpleasants@gmail.com> wrote:

When there are multiple pollers via NFS how do they pick the devices to poll? Basically how to they not step all over each other polling the same nodes? I'm seeing snmp and ping alerts like 15 or so every hour come in that didn't come in when it was a single server solution. It does seem to be the same 20 or so devices. These particular devices do typically take around 60 sec to poll. Looking at devices that generated alerts their cpu, snmp response time, and ping time was all over the place. Meaning cpu doubled (10% to 20%), ping times went up to 200ms from 10ms, and snmp response time average when from 50ms to 556ms. I'm wondering if I was polling these devices multiple times within 5 minutes? Would this be related to NFS IO issues? I reverted back to a single solution for now. Any assistance is greatly appreciated.

-Chip


Config Sniplet

$config['alerts']['email']['enable'] = TRUE;
$config['poller-wrapper']['alerter'] = TRUE;
$config['snmp']['timeout'] = 6;

$config['snmp']['retries'] = 3;
$config['snmp']['max-rep'] = 10;
$config['fping'] = "/usr/sbin/fping -t2000";

_______________________________________________
observium mailing list
observium@observium.org

http://postman.memetic.org/cgi-bin/mailman/listinfo/observium [1]

--
Mike Stupalov
http://observium.org/ [2]

_______________________________________________
observium mailing list
observium@observium.org

http://postman.memetic.org/cgi-bin/mailman/listinfo/observium [1]

Links:
------
[1] http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
[2] http://observium.org/

_______________________________________________
observium mailing list
observium@observium.org
http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

_______________________________________________
observium mailing list
observium@observium.org
http://postman.memetic.org/cgi-bin/mailman/listinfo/observium