On Fri, 11 Nov 2011 12:22:42 -0500, Berant Lemmenes berant@lemmenes.com wrote:
Hello everyone,
I just wanted to introduce myself to the list and complement the authors
on
a fantastic tool!
As a little background on our deployment, I work for a Midwestern US ISP/NSP and I found Observium by looking for a replacement solution for our severely aging 95th percentile burstable billing system. However I
was
blown away by Observium once I got it running. We're now looking at replacing several systems with it focusing on interface/data polling, we have a separate system that we will continue maintain to do SNMP trap handling.
Right now we have just shy of 16k interfaces across 30 nodes, with
another
dozen or two to add to complete our cisco L3/L2 devices. I'm interested
in
adding new device types for our Cisco 15454 SONET and MSTP systems (and have read the Developing/NewOS document), however if these devices were
to
be added it would take our device count up north of 600, with a massive increase in interfaces. So I want to make sure I've got things setup
well
before going down that road.
While performance is doing great thus far I'm concerned about things I
can
do to scale the system. Currently the system is a Xen VM with 4 cores
and
4GB of ram, and the load average is staying right around 4 with 40%
average
CPU usage.
Firstly, don't run it in a VM. Observium scales primarily on I/O throughput. If you have /very/ fast disks you might get away with something that size in a VM.
I tend to put large deployments into a ramdisk that is synced periodically to a physical disk (.tar.Z to reduce write time)
Since we're not interested in alerting for interface/node up/down with Observium I've configured each device to ignore and disable alerting.
I've
also took out the various menu items poller modules for things that we don't need as well. And I need to look into interface names that can be ignored etc as well.
Please, please, please do *not* make any changes that won't be committed back into the SVN. Observium is designed to be updated frequently from SVN and has database and other update scripts to make this work. Code changes will break this mechanism. They'll also mean you never update...
I've not yet tried rrdcached but I'd like to see what impact that has on existing load.
In practice it makes relatively little difference. Far more impact can be had from increasing disk I/O performance or moving the RRDs into a large ramdisk (i've even had instances where the ramdisk has been on another host due to that being how the RAM was available!)
Does anyone else have any recommendations or additional best practices?
Splitting RRD/MySQL across two different disks can help. I've not scaled much past 10k interfaces on any deployments i've done. But really it's just a function of how quickly you can write RRDs (and perform MySQL whilst the disk isn't being eaten by RRD).
I have some thoughts ideas on potential features that could be useful
for
other ISP users as well, however I don't want to flood the list right
off
the bat with an even more rambling email.
One of the major things we're missing is good ideas on how to present information, we're especially interested in ideas from other SPs.
adam.