Thanks in advance for anyone who can help.
We are deploying Observium in a very large environment. We are upwards of 6,000 devices and 300,000 ports.
I don't think we are going to be able to poll this in 5 minutes. This is what we have done so far:
- Distributed polling
- Polling - 8 VMs - 4core, 8GB RAM each
- MySQL - 4 core, 8GB RAM
- Web Interface - 4 core, 8GB RAM
- NFS Server (for centralized /opt/observium - except for the MIBS directory, which is copied across each server) - moving to an EMC VNX in 2 weeks.
- rrdcached being implemented (any assistance here is helpful)
- Modified poller-wrapper.py to distribute the devices that poll within 120s across multiple instances of poller-wrapper.py running on multiple hosts. Devices that poll in more than 120s are polled on separate servers at 15 minute intervals.
- poller-wrapper has been modified to allow for multiple instances just like poller.php with MODULUS distribution
- Each instance of poller-wrapper.py gets an instance number and the total number of instances.
- All of the devices with the last poll time < 120 seconds are MOD'ed with the device_id and the total number of instances and compared to the instance number - device_id MOD total_instances = this_instance
-
Tuning threads in each poller-wrapper.py - currently at 16 threads and 2 instances on each 4 vCPU server for 32 threads running at once or 8 threads per core
- The DC is on the west coast and that presents latency problems. We may need to address with distributed polling
- We are at 1024MB in php.ini
- We are using xcache (tuning help is appreciated - or should we just turn it off)?
My questions are:
-
How can I change the default RRD behavior to use 15 minute intervals instead of 5 minute intervals
- 15 minutes x 2 weeks
-
2 hours x 5 weeks
- 4 hours x 12 months
- We want to keep the max/min/average/87.5th percentile (since only 8 measurements per 2 hours)
- I don't see the configuration items for that.
- Would we be better with a few big boxes rather than small VMs?