So given what you have done so far, what issues are you currently experiencing? I would assume your IO/NFS is just gone to shit?
I had have a few random thoughts about horizontal scaling but I haven't gone any further than thinking about it…
Some thoughts (not in any particular order)
- Distributed pollers/RRD storage – someone wrote an interesting patch at one stage that would allow the decoupling of the RRD location from the aggregation/presentation layer (ie your WEB-UI nodes could pull in data from all your cluster nodes) -- https://lists.oetiker.ch/pipermail/rrd-developers/2008-May/002203.html
- Most of the rrd access stuff seems pretty well encapsulated — it may be possible to replace this with a backend in some type of OpenTSDB and a front end in JavaScript (flot?) ..
- There is an interesting article which could indicate the actual limits for a single node solution could potentially be dealt with some uber tuning -- http://code.google.com/p/epicnms/wiki/Scaling
- Quasi similar to (1) have a look at using a distributed filesystem (gluster?)
- Cache in code potentially in a layer so that for single node instances its just a in-php cache, and for multi-node instances hitting something like redis for things you don't really want to persist in your database, but you don't want to store for a short lifetime
Just some randomness thoughts
From: Joe Hoh <jhoh@costco.com>
Reply-To: Observium <observium@observium.org>
Date: Monday, 29 July 2013 12:00 PM
To: Observium <observium@observium.org>
Subject: [Observium] Very Large Environment
Thanks in advance for anyone who can help.
We are deploying Observium in a very large environment. We are upwards of 6,000 devices and 300,000 ports.
I don't think we are going to be able to poll this in 5 minutes. This is what we have done so far:
- Distributed polling
- Polling - 8 VMs - 4core, 8GB RAM each
- MySQL - 4 core, 8GB RAM
- Web Interface - 4 core, 8GB RAM
- NFS Server (for centralized /opt/observium - except for the MIBS directory, which is copied across each server) - moving to an EMC VNX in 2 weeks.
- rrdcached being implemented (any assistance here is helpful)
- Modified poller-wrapper.py to distribute the devices that poll within 120s across multiple instances of poller-wrapper.py running on multiple hosts. Devices that poll in more than 120s are polled on separate servers at 15 minute intervals.
- poller-wrapper has been modified to allow for multiple instances just like poller.php with MODULUS distribution
- Each instance of poller-wrapper.py gets an instance number and the total number of instances.
- All of the devices with the last poll time < 120 seconds are MOD'ed with the device_id and the total number of instances and compared to the instance number - device_id MOD total_instances = this_instance
- Tuning threads in each poller-wrapper.py - currently at 16 threads and 2 instances on each 4 vCPU server for 32 threads running at once or 8 threads per core
- The DC is on the west coast and that presents latency problems. We may need to address with distributed polling
- We are at 1024MB in php.ini
- We are using xcache (tuning help is appreciated - or should we just turn it off)?
My questions are:
- How can I change the default RRD behavior to use 15 minute intervals instead of 5 minute intervals
- 15 minutes x 2 weeks
- 2 hours x 5 weeks
- 4 hours x 12 months
- We want to keep the max/min/average/87.5th percentile (since only 8 measurements per 2 hours)
- I don't see the configuration items for that.
- Would we be better with a few big boxes rather than small VMs?
_______________________________________________
observium mailing list
observium@observium.org
http://postman.memetic.org/cgi-bin/mailman/listinfo/observium