Re: [Observium] Many false positives

2 Jan 2015


      I'm following back up to the list, in hopes that others will find this and
it will help them.
Part of figuring this out was to track server (the host observium is
running on) utilization in terms of Disk IO, CPU, Network IO, swaps, etc.
In general we had to upgrade our server to handle a higher compute load /
cpu load.
We had been running Cacti on the same hardware platform with about 3x the
devices, polling every 1 minute and no issues.  When running Cacti we also
used the C based poller environment (spine).
Observium requires more CPU.
We also increased our SNMP time out value in the config file.  This helped
mitigate the issue, but at the end of the day, more CPU fixed the issue.
Now the SNR (Signal to Noise Ratio) has reduced and alerts / data from
Observium is considered reflective of the devices status.
Once we stabilize and goto production, I'll post our hardware / os
configuration along with other relevant data for others to reference.
cheers and thanks to those that provided on-list and off-list usable
feedback.
On Tue, Dec 23, 2014 at 6:06 AM, John Brown john@citylinkfiber.com wrote:
...
what command line options do you send to fping ?
how do you call fping from within the code?
and what are you parsing on from fping's output to determine up or down ?
thanks
On Tue, Dec 23, 2014 at 12:10 AM, Adam Armstrong adama@memetic.org
wrote:
...
In my previous testing (which was on a mikrotik, but I didn't find out
if it was the end device or an intermediate device), the issue occured
because the device stopped replying to pings for 10 seconds every few
minutes.
This will produce frequent false alerts with even 1 second timeouts and
multiple retries not ignoring.
Personally, I think we're quite justified in marking a device down if it
doesn't reply to ping after 5 seconds :)
adam.
------ Original Message ------
From: "John Brown" john@citylinkfiber.com
To: "Observium Network Observation System" observium@observium.org
Sent: 12/22/2014 5:12:18 PM
Subject: Re: [Observium] Many false positives
Thanks.
I did crank the timeout and retires up to 1000ms and 5 retries, prior to
my original post.
That didn't seem to help much.
I now have the debug value set and will watch the debug log for any hints.
Thanks
On Mon, Dec 22, 2014 at 4:04 PM, Tom Laermans <
tom.laermans@powersource.cx> wrote:
...
There is no time frame to report; it didn't return a ping (in time) at
the exact moment the message is logged.
These are the default ping config settings:
#$config['ping']['retries'] = 3;    // How many times to retry ping (1 -
10)
#$config['ping']['timeout'] = 500;  // Timeout in milliseconds (50 -
2000)
So, 3 missed-or->500ms pings, it would seem.
There are currently no device-specific ping settings, this is on my
to-do somewhere.
You can try to enable this:
$config['ping']['debug']        = TRUE;    // If TRUE store ping errors
into logs/debug.log file
and check the debug log.
Tom
On 22/12/2014 23:55, John Brown wrote:
Yes, I see the "Device Status changed to Down (PING)" in the log.
The conflict I have with this that it doesn't provide any more detailed
information.
How many pings failed, time frame, etc
I am running TCPDUMP on a monitor/span port that the ONOS is connected
to and I see ICMP packets going out to devices and I see their reply
packets come back.
Over a 15 minute period of time a host will be reported as DOWN, yet the
ICMP packet flow shows echo_request / echo_reply pairs without undo delay.
Other machines on the same LAN subnet as the ONOS host also show no
dropped ICMP packets.
Hence why I'm asking about additional debugging tools within ONOS..
Thanks

observium mailing list
observium@observium.org
http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

John Brown

tags (0)

participants (1)