Thanks in advance for anyone who can help.

We are deploying Observium in a very large environment.  We are upwards of 6,000 devices and 300,000 ports.

I don't think we are going to be able to poll this in 5 minutes.  This is what we have done so far:
  • Distributed polling
    • Polling - 8 VMs - 4core, 8GB RAM each
    • MySQL - 4 core, 8GB RAM
    • Web Interface - 4 core, 8GB RAM
    • NFS Server (for centralized /opt/observium - except for the MIBS directory, which is copied across each server) - moving to an EMC VNX in 2 weeks.
  • rrdcached being implemented (any assistance here is helpful)
  • Modified poller-wrapper.py to distribute the devices that poll within 120s across multiple instances of poller-wrapper.py running on multiple hosts.  Devices that poll in more than 120s are polled on separate servers at 15 minute intervals.
  • poller-wrapper has been modified to allow for multiple instances just like poller.php with MODULUS distribution
    • Each instance of poller-wrapper.py gets an instance number and the total number of instances.
    • All of the devices with the last poll time < 120 seconds are MOD'ed with the device_id and the total number of instances and compared to the instance number - device_id MOD total_instances = this_instance
    • Tuning threads in each poller-wrapper.py - currently at 16 threads and 2 instances on each 4 vCPU server for 32 threads running at once or 8 threads per core
  • The DC is on the west coast and that presents latency problems.  We may need to address with distributed polling
  • We are at 1024MB in php.ini
  • We are using xcache (tuning help is appreciated - or should we just turn it off)?
My questions are:
  • How can I change the default RRD behavior to use 15 minute intervals instead of 5 minute intervals
    • 15 minutes x 2 weeks
    • 2 hours x 5 weeks
    • 4 hours x 12 months
    • We want to keep the max/min/average/87.5th percentile (since only 8 measurements per 2 hours)
    • I don't see the configuration items for that.
  • Would we be better with a few big boxes rather than small VMs?