Many false positives

newer
Enable / Disable RANCID on a per...

older
Trango Radios, wish list, support...

John Brown

22 Dec 2014 22 Dec '14

11:05 p.m.

I'm trying to troubleshoot the many false positives we are receiving from OB.

The system will report a host as down, yet our legacy Nagios and out-of-band Pingdom do not show the host as down.

It doesn't appear that OB records in the log what specifically is making OB think the host is down.

I've increased the SNMP time out value to 3 seconds (which seems very long) and that has helped with some hosts, mostly Mikrotiks.

But I doubt that our Juniper MX480's (which are lightly loaded) should need such long time frame to respond.

How can I get OB to report what is the actual trigger for its "Host Down" alerts ??

Are there tweaks for performance monitoring / testing ??

Thank you in advance..

Attachments:

attachment.html (text/html — 992 bytes)

Show replies by date

Tom Laermans

22 Dec 22 Dec

11:39 p.m.

Observium... Bonitoring(?) does tell you why it's down. It doesn't receive a reply either over ICMP echo or over SNMP; this is noted in the event log when the host goes down.

Tom

On 22/12/2014 23:05, John Brown wrote:

...

Hi

I'm trying to troubleshoot the many false positives we are receiving from OB.

The system will report a host as down, yet our legacy Nagios and out-of-band Pingdom do not show the host as down.

It doesn't appear that OB records in the log what specifically is making OB think the host is down.

I've increased the SNMP time out value to 3 seconds (which seems very long) and that has helped with some hosts, mostly Mikrotiks.

But I doubt that our Juniper MX480's (which are lightly loaded) should need such long time frame to respond.

How can I get OB to report what is the actual trigger for its "Host Down" alerts ??

Are there tweaks for performance monitoring / testing ??

Thank you in advance..

observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

John Brown

11:55 p.m.

Yes, I see the "Device Status changed to Down (PING)" in the log.

The conflict I have with this that it doesn't provide any more detailed information. How many pings failed, time frame, etc

I am running TCPDUMP on a monitor/span port that the ONOS is connected to and I see ICMP packets going out to devices and I see their reply packets come back.

Over a 15 minute period of time a host will be reported as DOWN, yet the ICMP packet flow shows echo_request / echo_reply pairs without undo delay.

Other machines on the same LAN subnet as the ONOS host also show no dropped ICMP packets.

Hence why I'm asking about additional debugging tools within ONOS..

Thanks

On Mon, Dec 22, 2014 at 3:39 PM, Tom Laermans tom.laermans@powersource.cx wrote:

...

Observium... Bonitoring(?) does tell you why it's down. It doesn't receive a reply either over ICMP echo or over SNMP; this is noted in the event log when the host goes down.

Tom

On 22/12/2014 23:05, John Brown wrote:

Hi

I'm trying to troubleshoot the many false positives we are receiving from OB.

The system will report a host as down, yet our legacy Nagios and out-of-band Pingdom do not show the host as down.

It doesn't appear that OB records in the log what specifically is making OB think the host is down.

I've increased the SNMP time out value to 3 seconds (which seems very long) and that has helped with some hosts, mostly Mikrotiks.

But I doubt that our Juniper MX480's (which are lightly loaded) should need such long time frame to respond.

How can I get OB to report what is the actual trigger for its "Host Down" alerts ??

Are there tweaks for performance monitoring / testing ??

Thank you in advance..

observium mailing listobservium@observium.orghttp://postman.memetic.org/cgi-bin/mailman/listinfo/observium

observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

Tom Laermans

23 Dec 23 Dec

12:04 a.m.

There is no time frame to report; it didn't return a ping (in time) at the exact moment the message is logged.

These are the default ping config settings: #$config['ping']['retries'] = 3; // How many times to retry ping (1 - 10) #$config['ping']['timeout'] = 500; // Timeout in milliseconds (50 - 2000)

So, 3 missed-or->500ms pings, it would seem.

There are currently no device-specific ping settings, this is on my to-do somewhere.

You can try to enable this: $config['ping']['debug'] = TRUE; // If TRUE store ping errors into logs/debug.log file

and check the debug log.

Tom

On 22/12/2014 23:55, John Brown wrote:

...

Yes, I see the "Device Status changed to Down (PING)" in the log.

The conflict I have with this that it doesn't provide any more detailed information. How many pings failed, time frame, etc

I am running TCPDUMP on a monitor/span port that the ONOS is connected to and I see ICMP packets going out to devices and I see their reply packets come back.

Over a 15 minute period of time a host will be reported as DOWN, yet the ICMP packet flow shows echo_request / echo_reply pairs without undo delay.

Other machines on the same LAN subnet as the ONOS host also show no dropped ICMP packets.

Hence why I'm asking about additional debugging tools within ONOS..

Thanks

On Mon, Dec 22, 2014 at 3:39 PM, Tom Laermans <tom.laermans@powersource.cx mailto:tom.laermans@powersource.cx> wrote:
Observium... Bonitoring(?) does tell you why it's down. It doesn't
receive a reply either over ICMP echo or over SNMP; this is noted
in the event log when the host goes down.

Tom

On 22/12/2014 23:05, John Brown wrote:
...
Hi

I'm trying to troubleshoot the many false positives we are
receiving from OB.

The system will report a host as down, yet our legacy Nagios and
out-of-band Pingdom do not show the host as down.

It doesn't appear that OB records in the log what specifically is
making OB think the host is down.

I've increased the SNMP time out value to 3 seconds (which seems
very long) and that has helped with some hosts, mostly Mikrotiks.

But I doubt that our Juniper MX480's (which are lightly loaded)
should need such long time frame to respond.

How can I get OB to report what is the actual trigger for its
"Host Down" alerts ??

Are there tweaks for performance monitoring / testing ??


Thank you in advance..




_______________________________________________
observium mailing list
observium@observium.org  <mailto:observium@observium.org>
http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
_______________________________________________
observium mailing list
observium@observium.org <mailto:observium@observium.org>
http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

John Brown

12:12 a.m.

Thanks.

I did crank the timeout and retires up to 1000ms and 5 retries, prior to my original post. That didn't seem to help much.

I now have the debug value set and will watch the debug log for any hints.

Thanks

On Mon, Dec 22, 2014 at 4:04 PM, Tom Laermans tom.laermans@powersource.cx wrote:

...

There is no time frame to report; it didn't return a ping (in time) at the exact moment the message is logged.

These are the default ping config settings: #$config['ping']['retries'] = 3; // How many times to retry ping (1 - 10) #$config['ping']['timeout'] = 500; // Timeout in milliseconds (50 - 2000)

So, 3 missed-or->500ms pings, it would seem.

There are currently no device-specific ping settings, this is on my to-do somewhere.

You can try to enable this: $config['ping']['debug'] = TRUE; // If TRUE store ping errors into logs/debug.log file

and check the debug log.

Tom

On 22/12/2014 23:55, John Brown wrote:

Yes, I see the "Device Status changed to Down (PING)" in the log.

The conflict I have with this that it doesn't provide any more detailed information. How many pings failed, time frame, etc

I am running TCPDUMP on a monitor/span port that the ONOS is connected to and I see ICMP packets going out to devices and I see their reply packets come back.

Over a 15 minute period of time a host will be reported as DOWN, yet the ICMP packet flow shows echo_request / echo_reply pairs without undo delay.

Other machines on the same LAN subnet as the ONOS host also show no dropped ICMP packets.

Hence why I'm asking about additional debugging tools within ONOS..

Thanks

In my previous testing (which was on a mikrotik, but I didn't find out if it was the end device or an intermediate device), the issue occured because the device stopped replying to pings for 10 seconds every few minutes.

This will produce frequent false alerts with even 1 second timeouts and multiple retries not ignoring.

Personally, I think we're quite justified in marking a device down if it doesn't reply to ping after 5 seconds :)

adam.

------ Original Message ------ From: "John Brown" john@citylinkfiber.com To: "Observium Network Observation System" observium@observium.org Sent: 12/22/2014 5:12:18 PM Subject: Re: [Observium] Many false positives

...

Thanks.

I did crank the timeout and retires up to 1000ms and 5 retries, prior to my original post. That didn't seem to help much.

I now have the debug value set and will watch the debug log for any hints.

Thanks

On Mon, Dec 22, 2014 at 4:04 PM, Tom Laermans tom.laermans@powersource.cx wrote:

...
There is no time frame to report; it didn't return a ping (in time) at the exact moment the message is logged.

These are the default ping config settings: #$config['ping']['retries'] = 3; // How many times to retry ping (1

#$config['ping']['timeout'] = 500; // Timeout in milliseconds (50 - 2000)

So, 3 missed-or->500ms pings, it would seem.

There are currently no device-specific ping settings, this is on my to-do somewhere.

You can try to enable this: $config['ping']['debug'] = TRUE; // If TRUE store ping errors into logs/debug.log file

and check the debug log.

Tom

On 22/12/2014 23:55, John Brown wrote:

...
Yes, I see the "Device Status changed to Down (PING)" in the log.

The conflict I have with this that it doesn't provide any more detailed information. How many pings failed, time frame, etc

I am running TCPDUMP on a monitor/span port that the ONOS is connected to and I see ICMP packets going out to devices and I see their reply packets come back.

Over a 15 minute period of time a host will be reported as DOWN, yet the ICMP packet flow shows echo_request / echo_reply pairs without undo delay.

Other machines on the same LAN subnet as the ONOS host also show no dropped ICMP packets.

Hence why I'm asking about additional debugging tools within ONOS..

Thanks

John Brown

2:06 p.m.

what command line options do you send to fping ? how do you call fping from within the code? and what are you parsing on from fping's output to determine up or down ?

thanks

On Tue, Dec 23, 2014 at 12:10 AM, Adam Armstrong adama@memetic.org wrote:

...

In my previous testing (which was on a mikrotik, but I didn't find out if it was the end device or an intermediate device), the issue occured because the device stopped replying to pings for 10 seconds every few minutes.

This will produce frequent false alerts with even 1 second timeouts and multiple retries not ignoring.

Personally, I think we're quite justified in marking a device down if it doesn't reply to ping after 5 seconds :)

adam.

------ Original Message ------ From: "John Brown" john@citylinkfiber.com To: "Observium Network Observation System" observium@observium.org Sent: 12/22/2014 5:12:18 PM Subject: Re: [Observium] Many false positives

Thanks.

I did crank the timeout and retires up to 1000ms and 5 retries, prior to my original post. That didn't seem to help much.

I now have the debug value set and will watch the debug log for any hints.

Thanks

On Mon, Dec 22, 2014 at 4:04 PM, Tom Laermans <tom.laermans@powersource.cx

...
wrote:

...
There is no time frame to report; it didn't return a ping (in time) at the exact moment the message is logged.

These are the default ping config settings: #$config['ping']['retries'] = 3; // How many times to retry ping (1 - 10) #$config['ping']['timeout'] = 500; // Timeout in milliseconds (50 - 2000)

So, 3 missed-or->500ms pings, it would seem.

There are currently no device-specific ping settings, this is on my to-do somewhere.

You can try to enable this: $config['ping']['debug'] = TRUE; // If TRUE store ping errors into logs/debug.log file

and check the debug log.

Tom

On 22/12/2014 23:55, John Brown wrote:

Yes, I see the "Device Status changed to Down (PING)" in the log.

The conflict I have with this that it doesn't provide any more detailed information. How many pings failed, time frame, etc

I am running TCPDUMP on a monitor/span port that the ONOS is connected to and I see ICMP packets going out to devices and I see their reply packets come back.

Over a 15 minute period of time a host will be reported as DOWN, yet the ICMP packet flow shows echo_request / echo_reply pairs without undo delay.

Other machines on the same LAN subnet as the ONOS host also show no dropped ICMP packets.

Hence why I'm asking about additional debugging tools within ONOS..

Thanks

observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

Brendan Hide

24 Dec 24 Dec

2:58 p.m.

I've seen similar issues (not necessarily via Observium) with overloaded MikroTik routerboards - especially on the low-end/older boards.

Given RouterOS's powerful featureset vs the relatively low CPU/memory available, it is very common to find a board sitting at 100% for hours on end or running out of memory every few minutes/hours. The only saving grace, typically, is that the board either recovers periodically or that the watchdog timers force it to reboot.

Of course, as you (John) are using Observium, you should be able to tell if the MikroTik is being overloaded or not. ;)

One alternative to attempt to confirm the behaviour (assuming you've not catching it yourself) is to monitor the problematic hosts with a a smokeping installation. Smokeping does one thing and does it well - ping services and print pretty latency/packet-loss graphs. :)

On 2014/12/23 09:10, Adam Armstrong wrote:

...

In my previous testing (which was on a mikrotik, but I didn't find out if it was the end device or an intermediate device), the issue occured because the device stopped replying to pings for 10 seconds every few minutes. This will produce frequent false alerts with even 1 second timeouts and multiple retries not ignoring. Personally, I think we're quite justified in marking a device down if it doesn't reply to ping after 5 seconds :) adam. ------ Original Message ------ From: "John Brown" <john@citylinkfiber.com mailto:john@citylinkfiber.com> To: "Observium Network Observation System" <observium@observium.org mailto:observium@observium.org> Sent: 12/22/2014 5:12:18 PM Subject: Re: [Observium] Many false positives

...
Thanks.

I did crank the timeout and retires up to 1000ms and 5 retries, prior to my original post. That didn't seem to help much.

I now have the debug value set and will watch the debug log for any hints.

Thanks

On Mon, Dec 22, 2014 at 4:04 PM, Tom Laermans <tom.laermans@powersource.cx mailto:tom.laermans@powersource.cx> wrote:
There is no time frame to report; it didn't return a ping (in
time) at the exact moment the message is logged.

These are the default ping config settings:
#$config['ping']['retries'] = 3;    // How many times to retry
ping (1 - 10)
#$config['ping']['timeout'] = 500;  // Timeout in milliseconds
(50 - 2000)

So, 3 missed-or->500ms pings, it would seem.

There are currently no device-specific ping settings, this is on
my to-do somewhere.

You can try to enable this:
$config['ping']['debug']        = TRUE;    // If TRUE store ping
errors into logs/debug.log file

and check the debug log.

Tom

On 22/12/2014 23:55, John Brown wrote:
...
Yes, I see the "Device Status changed to Down (PING)" in the log.

The conflict I have with this that it doesn't provide any more
detailed information.
How many pings failed, time frame, etc

I am running TCPDUMP on a monitor/span port that the ONOS is
connected to and I see ICMP packets going out to devices and I
see their reply packets come back.

Over a 15 minute period of time a host will be reported as DOWN,
yet the ICMP packet flow shows echo_request / echo_reply pairs
without undo delay.

Other machines on the same LAN subnet as the ONOS host also show
no dropped ICMP packets.

Hence why I'm asking about additional debugging tools within ONOS..

Thanks
observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

-- __________ Brendan Hide http://swiftspirit.co.za/ http://www.webafrica.co.za/?AFF1E97

Adam Armstrong

23 Dec 23 Dec

8:06 a.m.

We run fping against the remote host. fping tells us "alive" or not.

We don't generate ICMP packets, we don't listen for replies. We run fping and it does all of that.

The last host I had with an issue like this i just ran fping against it repeatedly until it failed, and it failed because pings weren't being returned. I never found out why, but after that it wasn't my problem anymore :)

Things that stop pings are usually control plane policing (when pinging large routers and switches) and firewalls. adam.

------ Original Message ------ From: "John Brown" john@citylinkfiber.com To: "Observium Network Observation System" observium@observium.org Sent: 12/22/2014 4:55:03 PM Subject: Re: [Observium] Many false positives

...

Yes, I see the "Device Status changed to Down (PING)" in the log.

The conflict I have with this that it doesn't provide any more detailed information. How many pings failed, time frame, etc

I am running TCPDUMP on a monitor/span port that the ONOS is connected to and I see ICMP packets going out to devices and I see their reply packets come back.

Over a 15 minute period of time a host will be reported as DOWN, yet the ICMP packet flow shows echo_request / echo_reply pairs without undo delay.

Other machines on the same LAN subnet as the ONOS host also show no dropped ICMP packets.

Hence why I'm asking about additional debugging tools within ONOS..

Thanks

On Mon, Dec 22, 2014 at 3:39 PM, Tom Laermans tom.laermans@powersource.cx wrote:

...
Observium... Bonitoring(?) does tell you why it's down. It doesn't receive a reply either over ICMP echo or over SNMP; this is noted in the event log when the host goes down.

Tom

On 22/12/2014 23:05, John Brown wrote:

...
Hi

I'm trying to troubleshoot the many false positives we are receiving from OB.

The system will report a host as down, yet our legacy Nagios and out-of-band Pingdom do not show the host as down.

It doesn't appear that OB records in the log what specifically is making OB think the host is down.

I've increased the SNMP time out value to 3 seconds (which seems very long) and that has helped with some hosts, mostly Mikrotiks.

But I doubt that our Juniper MX480's (which are lightly loaded) should need such long time frame to respond.

How can I get OB to report what is the actual trigger for its "Host Down" alerts ??

Are there tweaks for performance monitoring / testing ??

Thank you in advance..

_______________________________________________ observium mailing list observium@observium.orghttp://postman.memetic.org/cgi-bin/mailman/listinfo/observium

observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

3888

Age (days ago)

3890

Last active (days ago)

List overview

Download

8 comments

4 participants

tags (0)

participants (4)

Adam Armstrong
Brendan Hide
John Brown
Tom Laermans