Re: [Observium] Very Large Environment

29 Jul 2013

      On 2013-07-29 03:30, Joe Hoh wrote:
...
Thanks in advance for anyone who can help.
We are deploying Observium in a very large environment.  We are
upwards of 6,000 devices and 300,000 ports.
You seem to be the largest live installation that I know about.
...
I don't think we are going to be able to poll this in 5 minutes.  This
is what we have done so far:

Distributed polling

Polling - 8 VMs - 4core, 8GB RAM each

I would not do this in VMs. You're losing a few percent of CPU 
performance and making your RAM much less efficient.
I assume all of these VMs are on two or more hosts running nothing else? 
Why not run the pollers directly on the hosts themselves?
VMware seems to have brainwashed the entire planet into thinking things 
get faster inside VMs, or that they make more efficient use of 
resources. Very odd!
I'd be trying quite hard to size a single poller server to accommodate 
the entire platform to remove complexity and the likelihood at that some 
point we'll change something which collapses your house of custom 
modification cards.
The poller needs aggregate throughput rather than single-core speed, so 
you would want to look for dual or quad socket 6/8/12-core systems.
...

MySQL - 4 core, 8GB RAM

MySQL is the easiest place to make performance gains by offloading it to 
another server. I/O and CPU contention makes a lot of difference.
...

Web Interface - 4 core, 8GB RAM

Normally I'd put the webui on the same device as the pollers, but in 
your instance you're going to have a couple of really slow to render 
pages which will benefit from a small number of very fast cores. You 
want the fastest single-process cpu you can get in here, a high-clock i7 
would do nicely.
Probably just mounting it over NFS directly from the poller host would 
work.
...

NFS Server (for centralized /opt/observium - except for the MIBS

directory, which is copied across each server) - moving to an EMC VNX
in 2 weeks.
NFS adds a fair bit of overhead to the entire process. I would very much 
be trying to work out ways of fitting the whole storage subsystem 
directly on to the polling server. Lots of 2.5" SAS disks in the poller 
host might suffice, or some form of high-throughput, low-latency 
external storage medium.
I assume you already have some idea of how your existing NFS server 
copes with the load, do you think you could scale it up to 300k 
interfaces?
You could look at running off decent quality SSDs. I know of a few large 
installs which do this. Below I mention RRD structure options to 
minimise writes, which may be useful if you drop the whole install 
across a few very fast, large SSDs.
...

rrdcached being implemented (any assistance here is helpful)

We found that RRDcached didn't add very much. I'm also not sure how safe 
it is to use in a multi-poller environment.
If you can fit them all on to a single host, you might gain a bit of i/o 
throughput by using it, though.
...

Modified poller-wrapper.py to distribute the devices that poll

within 120s across multiple instances of poller-wrapper.py running on
multiple hosts.  Devices that poll in more than 120s are polled on
separate servers at 15 minute intervals.
Why? I'm assuming it's related to the order that devices are polled in. 
We should remove the ordering by poller time and poll in device added 
order, this would provide more even load.
...

poller-wrapper has been modified to allow for multiple instances

just like poller.php with MODULUS distribution
If you can build a single server large enough, you'd run a single 
poller-wrapper with 128 instances :)
...

Each instance of poller-wrapper.py gets an instance number and the

total number of instances.

All of the devices with the last poll time < 120 seconds are MOD'ed

with the device_id and the total number of instances and compared to
the instance number - device_id MOD total_instances = this_instance

Tuning threads in each poller-wrapper.py - currently at 16 threads

and 2 instances on each 4 vCPU server for 32 threads running at once
or 8 threads per core
...

The DC is on the west coast and that presents latency problems.  We

may need to address with distributed polling
The issues we find with long-distance polling tend to come from network 
stability rather than latency. Some devices can be so slow to respond 
that we end up overlapping polling of a single host though (think a 
fully-loaded 6500 300ms away)
We do have some ideas about how to help solve the UDP-over-huge-distance 
problem involving HTTP-based proxying of requests, but that's quite a 
big job to rewrite our code to handle, so isn't something we're likely 
to get done in the near future.
...

We are at 1024MB in php.ini

Seems a little excessive to need this much RAM for a PHP process. You 
may have reached the point at which our in-PHP sorting system becomes 
unusable. This stuff might need to be rewritten for your size of 
install.
...

We are using xcache (tuning help is appreciated - or should we just

turn it off)?
Oh god don't turn it off! Your web interface would become unusable! :)
...
My questions are:

How can I change the default RRD behavior to use 15 minute

intervals instead of 5 minute intervals

15 minutes x 2 weeks
2 hours x 5 weeks
4 hours x 12 months
We want to keep the max/min/average/87.5th percentile (since only 8

measurements per 2 hours)
At the moment this isn't possible as the poller frequency is hard-coded 
in places around the code, but we could perhaps change that in the 
future.
I'm not sure what you mean by keeping a percentile, but in any case this 
isn't at all possible due to RRD's limitations.
It should be noted here that if you're storing your data on a rotational 
medium where speed isn't an issue, you might be better off aggregating 
as little as possible. A large amount of RRD's i/o load comes from when 
it aggregates high resolution data into low resolution data.
If you can afford the disk space, it might help you to store, say, 6 
months of 5 minute data and then aggregate to 1 year of 2 hour data. 
This means that you're only generating a single aggregated data point 
every 2 hours, if you see what I mean?
Our RRDs are sized at the moment for my preference to run observium out 
of a RAM disk. You have long since passed the point where this is viable 
and are going to have to use a *lot* of spindles to get enough IOPS 
capacity, so perhaps removing the aggregation would work for you.
...

I don't see the configuration items for that.

There aren't any yet :)
...

Would we be better with a few big boxes rather than small VMs?

I think you'd be better off trying to size a single poller with a *lot* 
of cores and fast I/O, and supplementing this with a fast external MySQL 
server and a very high clock-speed webui server.
As with all free software projects, we listen to suggestions and 
requests, but at the end of the day, all that ever really gets 
implemented is what the individual development team members want for 
their own instances. Most of our installs are probably around the 10k 
ports mark, so we rarely work on things that would help the platform to 
scale to your size of installation.
We have special arrangements with a few large organisations where they 
sponsor development to add features and makes changes specifically for 
their requirements, this might be useful for you too. :)
adam.

Re: [Observium] Very Large Environment

Adam Armstrong