On Sun, Jul 28, 2013 at 11:29 PM, Peter Childs <pchilds@staff.iinet.net.au> wrote:

So given what you have done so far, what issues are you currently experiencing? I would assume your IO/NFS is just gone to shit?

I had have a few random thoughts about horizontal scaling but I haven't gone any further than thinking about it…

Some thoughts (not in any particular order)

Distributed pollers/RRD storage – someone wrote an interesting patch at one stage that would allow the decoupling of the RRD location from the aggregation/presentation layer (ie your WEB-UI nodes could pull in data from all your cluster nodes) -- https://lists.oetiker.ch/pipermail/rrd-developers/2008-May/002203.html

Most of the rrd access stuff seems pretty well encapsulated — it may be possible to replace this with a backend in some type of OpenTSDB and a front end in JavaScript (flot?) ..
There is an interesting article which could indicate the actual limits for a single node solution could potentially be dealt with some uber tuning -- http://code.google.com/p/epicnms/wiki/Scaling

Quasi similar to (1) have a look at using a distributed filesystem (gluster?)
Cache in code potentially in a layer so that for single node instances its just a in-php cache, and for multi-node instances hitting something like redis for things you don't really want to persist in your database, but you don't want to store for a short lifetime

Just some randomness thoughts

From: Joe Hoh <jhoh@costco.com>
Reply-To: Observium <observium@observium.org>
Date: Monday, 29 July 2013 12:00 PM
To: Observium <observium@observium.org>
Subject: [Observium] Very Large Environment

Thanks in advance for anyone who can help.

We are deploying Observium in a very large environment. We are upwards of 6,000 devices and 300,000 ports.

I don't think we are going to be able to poll this in 5 minutes. This is what we have done so far:

Distributed polling

Polling - 8 VMs - 4core, 8GB RAM each
MySQL - 4 core, 8GB RAM
Web Interface - 4 core, 8GB RAM
NFS Server (for centralized /opt/observium - except for the MIBS directory, which is copied across each server) - moving to an EMC VNX in 2 weeks.

rrdcached being implemented (any assistance here is helpful)
Modified poller-wrapper.py to distribute the devices that poll within 120s across multiple instances of poller-wrapper.py running on multiple hosts. Devices that poll in more than 120s are polled on separate servers at 15 minute intervals.
poller-wrapper has been modified to allow for multiple instances just like poller.php with MODULUS distribution

Each instance of poller-wrapper.py gets an instance number and the total number of instances.
All of the devices with the last poll time < 120 seconds are MOD'ed with the device_id and the total number of instances and compared to the instance number - device_id MOD total_instances = this_instance

Tuning threads in each poller-wrapper.py - currently at 16 threads and 2 instances on each 4 vCPU server for 32 threads running at once or 8 threads per core

The DC is on the west coast and that presents latency problems. We may need to address with distributed polling
We are at 1024MB in php.ini
We are using xcache (tuning help is appreciated - or should we just turn it off)?

My questions are:

How can I change the default RRD behavior to use 15 minute intervals instead of 5 minute intervals

15 minutes x 2 weeks
2 hours x 5 weeks
4 hours x 12 months
We want to keep the max/min/average/87.5th percentile (since only 8 measurements per 2 hours)
I don't see the configuration items for that.

Would we be better with a few big boxes rather than small VMs?

_______________________________________________
observium mailing list
observium@observium.org
http://postman.memetic.org/cgi-bin/mailman/listinfo/observium