We are having problems with the NFS performance, yes.  We are also having problems getting the polling nodes to be more utilized.

Nothing is working very hard, but throughput (measured in devices completed polling per second) is extremely low.

We are going to try some of the RRD Filesystem tuning recommended in the article - that is an EXCELLENT read.  Giving the NFS server more ram and tuning the cache should help - unless it's use as NFS obviates that improvement.

What's the largest Observium deployment that you know of?  6,000 devices and 300,000 ports seems MASSIVE!


On Sun, Jul 28, 2013 at 11:29 PM, Peter Childs <pchilds@staff.iinet.net.au> wrote:

So given what you have done so far, what issues are you currently experiencing?   I would assume your IO/NFS is just gone to shit?

I had have a few random thoughts about horizontal scaling but I haven't gone any further than thinking about it…

Some thoughts (not in any particular order)
  1. Distributed pollers/RRD storage – someone wrote an interesting patch at one stage that would allow the decoupling of the RRD location from the aggregation/presentation layer (ie your WEB-UI nodes could pull in data from all your cluster nodes) -- https://lists.oetiker.ch/pipermail/rrd-developers/2008-May/002203.html
  2. Most of the rrd access stuff seems pretty well encapsulated — it may be possible to replace this with a backend in some type of OpenTSDB and a front end in JavaScript (flot?) .. 
  3. There is an interesting article which could indicate the actual limits for a single node solution could potentially be dealt with some uber tuning -- http://code.google.com/p/epicnms/wiki/Scaling
  4. Quasi similar to (1) have a look at using a distributed filesystem (gluster?) 
  5. Cache in code potentially in a layer so that for single node instances its just a in-php cache, and for multi-node instances hitting something like redis for things you don't really want to persist in your database, but you don't want to store for a short lifetime
Just some randomness thoughts


From: Joe Hoh <jhoh@costco.com>
Reply-To: Observium <observium@observium.org>
Date: Monday, 29 July 2013 12:00 PM
To: Observium <observium@observium.org>
Subject: [Observium] Very Large Environment

Thanks in advance for anyone who can help.

We are deploying Observium in a very large environment.  We are upwards of 6,000 devices and 300,000 ports.

I don't think we are going to be able to poll this in 5 minutes.  This is what we have done so far:
  • Distributed polling
    • Polling - 8 VMs - 4core, 8GB RAM each
    • MySQL - 4 core, 8GB RAM
    • Web Interface - 4 core, 8GB RAM
    • NFS Server (for centralized /opt/observium - except for the MIBS directory, which is copied across each server) - moving to an EMC VNX in 2 weeks.
  • rrdcached being implemented (any assistance here is helpful)
  • Modified poller-wrapper.py to distribute the devices that poll within 120s across multiple instances of poller-wrapper.py running on multiple hosts.  Devices that poll in more than 120s are polled on separate servers at 15 minute intervals.
  • poller-wrapper has been modified to allow for multiple instances just like poller.php with MODULUS distribution
    • Each instance of poller-wrapper.py gets an instance number and the total number of instances.
    • All of the devices with the last poll time < 120 seconds are MOD'ed with the device_id and the total number of instances and compared to the instance number - device_id MOD total_instances = this_instance
    • Tuning threads in each poller-wrapper.py - currently at 16 threads and 2 instances on each 4 vCPU server for 32 threads running at once or 8 threads per core
  • The DC is on the west coast and that presents latency problems.  We may need to address with distributed polling
  • We are at 1024MB in php.ini
  • We are using xcache (tuning help is appreciated - or should we just turn it off)?
My questions are:
  • How can I change the default RRD behavior to use 15 minute intervals instead of 5 minute intervals
    • 15 minutes x 2 weeks
    • 2 hours x 5 weeks
    • 4 hours x 12 months
    • We want to keep the max/min/average/87.5th percentile (since only 8 measurements per 2 hours)
    • I don't see the configuration items for that.
  • Would we be better with a few big boxes rather than small VMs?




_______________________________________________
observium mailing list
observium@observium.org
http://postman.memetic.org/cgi-bin/mailman/listinfo/observium