Re: [Observium] Many false positives

24 Dec 2014

      I've seen similar issues (not necessarily via Observium) with overloaded 
MikroTik routerboards - especially on the low-end/older boards.
Given RouterOS's powerful featureset vs the relatively low CPU/memory 
available, it is very common to find a board sitting at 100% for hours 
on end or running out of memory every few minutes/hours. The only saving 
grace, typically, is that the board either recovers periodically or that 
the watchdog timers force it to reboot.
Of course, as you (John) are using Observium, you should be able to tell 
if the MikroTik is being overloaded or not. ;)
One alternative to attempt to confirm the behaviour (assuming you've not 
catching it yourself) is to monitor the problematic hosts with a a 
smokeping installation. Smokeping does one thing and does it well - ping 
services and print pretty latency/packet-loss graphs. :)
On 2014/12/23 09:10, Adam Armstrong wrote:
...
In my previous testing (which was on a mikrotik, but I didn't find out 
if it was the end device or an intermediate device), the issue occured 
because the device stopped replying to pings for 10 seconds every few 
minutes.
This will produce frequent false alerts with even 1 second timeouts 
and multiple retries not ignoring.
Personally, I think we're quite justified in marking a device down if 
it doesn't reply to ping after 5 seconds :)
adam.
------ Original Message ------
From: "John Brown" <john@citylinkfiber.com 
mailto:john@citylinkfiber.com>
To: "Observium Network Observation System" <observium@observium.org 
mailto:observium@observium.org>
Sent: 12/22/2014 5:12:18 PM
Subject: Re: [Observium] Many false positives
...
Thanks.
I did crank the timeout and retires up to 1000ms and 5 retries, prior 
to my original post.
That didn't seem to help much.
I now have the debug value set and will watch the debug log for any 
hints.
Thanks
On Mon, Dec 22, 2014 at 4:04 PM, Tom Laermans 
<tom.laermans@powersource.cx mailto:tom.laermans@powersource.cx> wrote:
There is no time frame to report; it didn't return a ping (in
time) at the exact moment the message is logged.

These are the default ping config settings:
#$config['ping']['retries'] = 3;    // How many times to retry
ping (1 - 10)
#$config['ping']['timeout'] = 500;  // Timeout in milliseconds
(50 - 2000)

So, 3 missed-or->500ms pings, it would seem.

There are currently no device-specific ping settings, this is on
my to-do somewhere.

You can try to enable this:
$config['ping']['debug']        = TRUE;    // If TRUE store ping
errors into logs/debug.log file

and check the debug log.

Tom

On 22/12/2014 23:55, John Brown wrote:

...
Yes, I see the "Device Status changed to Down (PING)" in the log.

The conflict I have with this that it doesn't provide any more
detailed information.
How many pings failed, time frame, etc

I am running TCPDUMP on a monitor/span port that the ONOS is
connected to and I see ICMP packets going out to devices and I
see their reply packets come back.

Over a 15 minute period of time a host will be reported as DOWN,
yet the ICMP packet flow shows echo_request / echo_reply pairs
without undo delay.

Other machines on the same LAN subnet as the ONOS host also show
no dropped ICMP packets.

Hence why I'm asking about additional debugging tools within ONOS..

Thanks

observium mailing list
observium@observium.org
http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
-- 
__________
Brendan Hide
http://swiftspirit.co.za/
http://www.webafrica.co.za/?AFF1E97