Re: [Observium] Many false positives

I'm following back up to the list, in hopes that others will find this and it will help them.
Part of figuring this out was to track server (the host observium is running on) utilization in terms of Disk IO, CPU, Network IO, swaps, etc.
In general we had to upgrade our server to handle a higher compute load / cpu load.
We had been running Cacti on the same hardware platform with about 3x the devices, polling every 1 minute and no issues. When running Cacti we also used the C based poller environment (spine).
Observium requires more CPU.
We also increased our SNMP time out value in the config file. This helped mitigate the issue, but at the end of the day, more CPU fixed the issue.
Now the SNR (Signal to Noise Ratio) has reduced and alerts / data from Observium is considered reflective of the devices status.
Once we stabilize and goto production, I'll post our hardware / os configuration along with other relevant data for others to reference.
cheers and thanks to those that provided on-list and off-list usable feedback.
On Tue, Dec 23, 2014 at 6:06 AM, John Brown john@citylinkfiber.com wrote:
what command line options do you send to fping ? how do you call fping from within the code? and what are you parsing on from fping's output to determine up or down ?
thanks
On Tue, Dec 23, 2014 at 12:10 AM, Adam Armstrong adama@memetic.org wrote:
In my previous testing (which was on a mikrotik, but I didn't find out if it was the end device or an intermediate device), the issue occured because the device stopped replying to pings for 10 seconds every few minutes.
This will produce frequent false alerts with even 1 second timeouts and multiple retries not ignoring.
Personally, I think we're quite justified in marking a device down if it doesn't reply to ping after 5 seconds :)
adam.
------ Original Message ------ From: "John Brown" john@citylinkfiber.com To: "Observium Network Observation System" observium@observium.org Sent: 12/22/2014 5:12:18 PM Subject: Re: [Observium] Many false positives
Thanks.
I did crank the timeout and retires up to 1000ms and 5 retries, prior to my original post. That didn't seem to help much.
I now have the debug value set and will watch the debug log for any hints.
Thanks
On Mon, Dec 22, 2014 at 4:04 PM, Tom Laermans < tom.laermans@powersource.cx> wrote:
There is no time frame to report; it didn't return a ping (in time) at the exact moment the message is logged.
These are the default ping config settings: #$config['ping']['retries'] = 3; // How many times to retry ping (1 - 10) #$config['ping']['timeout'] = 500; // Timeout in milliseconds (50 - 2000)
So, 3 missed-or->500ms pings, it would seem.
There are currently no device-specific ping settings, this is on my to-do somewhere.
You can try to enable this: $config['ping']['debug'] = TRUE; // If TRUE store ping errors into logs/debug.log file
and check the debug log.
Tom
On 22/12/2014 23:55, John Brown wrote:
Yes, I see the "Device Status changed to Down (PING)" in the log.
The conflict I have with this that it doesn't provide any more detailed information. How many pings failed, time frame, etc
I am running TCPDUMP on a monitor/span port that the ONOS is connected to and I see ICMP packets going out to devices and I see their reply packets come back.
Over a 15 minute period of time a host will be reported as DOWN, yet the ICMP packet flow shows echo_request / echo_reply pairs without undo delay.
Other machines on the same LAN subnet as the ONOS host also show no dropped ICMP packets.
Hence why I'm asking about additional debugging tools within ONOS..
Thanks
observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
participants (1)
-
John Brown