On 10/10/2013 11:36 PM, F.Reenders@utwente.nl wrote:
I've got 2 x Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz, 64 Gb ram and 2 x raid 1 SAS disks(4 disks) speed 10k.
It is a HP DL380G8.
Load is now 200. :) but that's because the 5 minutes is not enough.

I will try the noatime also.

I'm running on a pretty similar system; a Dell R720 with the same CPUs & RAM, and 2 x 7 15K SAS in RAID 10.  We also V2P-ed our server when we upgraded to this machine, and I was pretty underwhelmed at the performance improvement considering it's a bit overkill for our environment (details below).

Then I started digging into the poller stats and found that some of my remote Linux servers (which run the PPPoE for the branch's ADSL connection) were running around 200-300 seconds for poll time.  When I ran the poller in debug mode I found that the interfaces poll was taking a huge proportion of the poller's run time, even though they only have 3 NICs, plus ppp0 for ADSL and tun0 for OpenVPN.  But because the kernel gives both ppp0 and tun0 a new interface id every time the connection goes down & up again, net-snmp was reporting this as a new interface (duly noting a warning in syslog), we were ending up with hundreds of interfaces per server over time, and net-snmp seems to be particularly inefficient at reporting them (or perhaps Observium is trying to poll too much data from non-existent ports?).

Regardless, I rolled out a script with puppet to restart net-snmp every time ppp0 or tun0 comes up.  Now we have poll times for all those hosts < 20 secs and the load on our server doesn't go over about 1.5, even during polls with 32 concurrent pollers.

Regards,
Paul

P.S. Device/port counts from our Observium installation:


Total Up Down Ignored Disabled
Devices 187 165 up 3 down 5 ignored 14 disabled
Ports 2614 1044 up 12 down 1275 ignored 173 shutdown