Many false positives
Hi
I'm trying to troubleshoot the many false positives we are receiving from OB.
The system will report a host as down, yet our legacy Nagios and out-of-band Pingdom do not show the host as down.
It doesn't appear that OB records in the log what specifically is making OB think the host is down.
I've increased the SNMP time out value to 3 seconds (which seems very long) and that has helped with some hosts, mostly Mikrotiks.
But I doubt that our Juniper MX480's (which are lightly loaded) should need such long time frame to respond.
How can I get OB to report what is the actual trigger for its "Host Down" alerts ??
Are there tweaks for performance monitoring / testing ??
Thank you in advance..
Observium... Bonitoring(?) does tell you why it's down. It doesn't receive a reply either over ICMP echo or over SNMP; this is noted in the event log when the host goes down.
Tom
On 22/12/2014 23:05, John Brown wrote:
Hi
I'm trying to troubleshoot the many false positives we are receiving from OB.
The system will report a host as down, yet our legacy Nagios and out-of-band Pingdom do not show the host as down.
It doesn't appear that OB records in the log what specifically is making OB think the host is down.
I've increased the SNMP time out value to 3 seconds (which seems very long) and that has helped with some hosts, mostly Mikrotiks.
But I doubt that our Juniper MX480's (which are lightly loaded) should need such long time frame to respond.
How can I get OB to report what is the actual trigger for its "Host Down" alerts ??
Are there tweaks for performance monitoring / testing ??
Thank you in advance..
observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
Yes, I see the "Device Status changed to Down (PING)" in the log.
The conflict I have with this that it doesn't provide any more detailed information. How many pings failed, time frame, etc
I am running TCPDUMP on a monitor/span port that the ONOS is connected to and I see ICMP packets going out to devices and I see their reply packets come back.
Over a 15 minute period of time a host will be reported as DOWN, yet the ICMP packet flow shows echo_request / echo_reply pairs without undo delay.
Other machines on the same LAN subnet as the ONOS host also show no dropped ICMP packets.
Hence why I'm asking about additional debugging tools within ONOS..
Thanks
On Mon, Dec 22, 2014 at 3:39 PM, Tom Laermans tom.laermans@powersource.cx wrote:
Observium... Bonitoring(?) does tell you why it's down. It doesn't receive a reply either over ICMP echo or over SNMP; this is noted in the event log when the host goes down.
Tom
On 22/12/2014 23:05, John Brown wrote:
Hi
I'm trying to troubleshoot the many false positives we are receiving from OB.
The system will report a host as down, yet our legacy Nagios and out-of-band Pingdom do not show the host as down.
It doesn't appear that OB records in the log what specifically is making OB think the host is down.
I've increased the SNMP time out value to 3 seconds (which seems very long) and that has helped with some hosts, mostly Mikrotiks.
But I doubt that our Juniper MX480's (which are lightly loaded) should need such long time frame to respond.
How can I get OB to report what is the actual trigger for its "Host Down" alerts ??
Are there tweaks for performance monitoring / testing ??
Thank you in advance..
observium mailing listobservium@observium.orghttp://postman.memetic.org/cgi-bin/mailman/listinfo/observium
observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
There is no time frame to report; it didn't return a ping (in time) at the exact moment the message is logged.
These are the default ping config settings: #$config['ping']['retries'] = 3; // How many times to retry ping (1 - 10) #$config['ping']['timeout'] = 500; // Timeout in milliseconds (50 - 2000)
So, 3 missed-or->500ms pings, it would seem.
There are currently no device-specific ping settings, this is on my to-do somewhere.
You can try to enable this: $config['ping']['debug'] = TRUE; // If TRUE store ping errors into logs/debug.log file
and check the debug log.
Tom
On 22/12/2014 23:55, John Brown wrote:
Yes, I see the "Device Status changed to Down (PING)" in the log.
The conflict I have with this that it doesn't provide any more detailed information. How many pings failed, time frame, etc
I am running TCPDUMP on a monitor/span port that the ONOS is connected to and I see ICMP packets going out to devices and I see their reply packets come back.
Over a 15 minute period of time a host will be reported as DOWN, yet the ICMP packet flow shows echo_request / echo_reply pairs without undo delay.
Other machines on the same LAN subnet as the ONOS host also show no dropped ICMP packets.
Hence why I'm asking about additional debugging tools within ONOS..
Thanks
On Mon, Dec 22, 2014 at 3:39 PM, Tom Laermans <tom.laermans@powersource.cx mailto:tom.laermans@powersource.cx> wrote:
Observium... Bonitoring(?) does tell you why it's down. It doesn't receive a reply either over ICMP echo or over SNMP; this is noted in the event log when the host goes down. Tom On 22/12/2014 23:05, John Brown wrote:
Hi I'm trying to troubleshoot the many false positives we are receiving from OB. The system will report a host as down, yet our legacy Nagios and out-of-band Pingdom do not show the host as down. It doesn't appear that OB records in the log what specifically is making OB think the host is down. I've increased the SNMP time out value to 3 seconds (which seems very long) and that has helped with some hosts, mostly Mikrotiks. But I doubt that our Juniper MX480's (which are lightly loaded) should need such long time frame to respond. How can I get OB to report what is the actual trigger for its "Host Down" alerts ?? Are there tweaks for performance monitoring / testing ?? Thank you in advance.. _______________________________________________ observium mailing list observium@observium.org <mailto:observium@observium.org> http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
_______________________________________________ observium mailing list observium@observium.org <mailto:observium@observium.org> http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
Thanks.
I did crank the timeout and retires up to 1000ms and 5 retries, prior to my original post. That didn't seem to help much.
I now have the debug value set and will watch the debug log for any hints.
Thanks
On Mon, Dec 22, 2014 at 4:04 PM, Tom Laermans tom.laermans@powersource.cx wrote:
There is no time frame to report; it didn't return a ping (in time) at the exact moment the message is logged.
These are the default ping config settings: #$config['ping']['retries'] = 3; // How many times to retry ping (1 - 10) #$config['ping']['timeout'] = 500; // Timeout in milliseconds (50 - 2000)
So, 3 missed-or->500ms pings, it would seem.
There are currently no device-specific ping settings, this is on my to-do somewhere.
You can try to enable this: $config['ping']['debug'] = TRUE; // If TRUE store ping errors into logs/debug.log file
and check the debug log.
Tom
On 22/12/2014 23:55, John Brown wrote:
Yes, I see the "Device Status changed to Down (PING)" in the log.
The conflict I have with this that it doesn't provide any more detailed information. How many pings failed, time frame, etc
I am running TCPDUMP on a monitor/span port that the ONOS is connected to and I see ICMP packets going out to devices and I see their reply packets come back.
Over a 15 minute period of time a host will be reported as DOWN, yet the ICMP packet flow shows echo_request / echo_reply pairs without undo delay.
Other machines on the same LAN subnet as the ONOS host also show no dropped ICMP packets.
Hence why I'm asking about additional debugging tools within ONOS..
Thanks
In my previous testing (which was on a mikrotik, but I didn't find out if it was the end device or an intermediate device), the issue occured because the device stopped replying to pings for 10 seconds every few minutes.
This will produce frequent false alerts with even 1 second timeouts and multiple retries not ignoring.
Personally, I think we're quite justified in marking a device down if it doesn't reply to ping after 5 seconds :)
adam.
------ Original Message ------ From: "John Brown" john@citylinkfiber.com To: "Observium Network Observation System" observium@observium.org Sent: 12/22/2014 5:12:18 PM Subject: Re: [Observium] Many false positives
Thanks.
I did crank the timeout and retires up to 1000ms and 5 retries, prior to my original post. That didn't seem to help much.
I now have the debug value set and will watch the debug log for any hints.
Thanks
On Mon, Dec 22, 2014 at 4:04 PM, Tom Laermans tom.laermans@powersource.cx wrote:
There is no time frame to report; it didn't return a ping (in time) at the exact moment the message is logged.
These are the default ping config settings: #$config['ping']['retries'] = 3; // How many times to retry ping (1
#$config['ping']['timeout'] = 500; // Timeout in milliseconds (50 - 2000)
So, 3 missed-or->500ms pings, it would seem.
There are currently no device-specific ping settings, this is on my to-do somewhere.
You can try to enable this: $config['ping']['debug'] = TRUE; // If TRUE store ping errors into logs/debug.log file
and check the debug log.
Tom
On 22/12/2014 23:55, John Brown wrote:
Yes, I see the "Device Status changed to Down (PING)" in the log.
The conflict I have with this that it doesn't provide any more detailed information. How many pings failed, time frame, etc
I am running TCPDUMP on a monitor/span port that the ONOS is connected to and I see ICMP packets going out to devices and I see their reply packets come back.
Over a 15 minute period of time a host will be reported as DOWN, yet the ICMP packet flow shows echo_request / echo_reply pairs without undo delay.
Other machines on the same LAN subnet as the ONOS host also show no dropped ICMP packets.
Hence why I'm asking about additional debugging tools within ONOS..
Thanks
what command line options do you send to fping ? how do you call fping from within the code? and what are you parsing on from fping's output to determine up or down ?
thanks
On Tue, Dec 23, 2014 at 12:10 AM, Adam Armstrong adama@memetic.org wrote:
In my previous testing (which was on a mikrotik, but I didn't find out if it was the end device or an intermediate device), the issue occured because the device stopped replying to pings for 10 seconds every few minutes.
This will produce frequent false alerts with even 1 second timeouts and multiple retries not ignoring.
Personally, I think we're quite justified in marking a device down if it doesn't reply to ping after 5 seconds :)
adam.
------ Original Message ------ From: "John Brown" john@citylinkfiber.com To: "Observium Network Observation System" observium@observium.org Sent: 12/22/2014 5:12:18 PM Subject: Re: [Observium] Many false positives
Thanks.
I did crank the timeout and retires up to 1000ms and 5 retries, prior to my original post. That didn't seem to help much.
I now have the debug value set and will watch the debug log for any hints.
Thanks
On Mon, Dec 22, 2014 at 4:04 PM, Tom Laermans <tom.laermans@powersource.cx
wrote:
There is no time frame to report; it didn't return a ping (in time) at the exact moment the message is logged.
These are the default ping config settings: #$config['ping']['retries'] = 3; // How many times to retry ping (1 - 10) #$config['ping']['timeout'] = 500; // Timeout in milliseconds (50 - 2000)
So, 3 missed-or->500ms pings, it would seem.
There are currently no device-specific ping settings, this is on my to-do somewhere.
You can try to enable this: $config['ping']['debug'] = TRUE; // If TRUE store ping errors into logs/debug.log file
and check the debug log.
Tom
On 22/12/2014 23:55, John Brown wrote:
Yes, I see the "Device Status changed to Down (PING)" in the log.
The conflict I have with this that it doesn't provide any more detailed information. How many pings failed, time frame, etc
I am running TCPDUMP on a monitor/span port that the ONOS is connected to and I see ICMP packets going out to devices and I see their reply packets come back.
Over a 15 minute period of time a host will be reported as DOWN, yet the ICMP packet flow shows echo_request / echo_reply pairs without undo delay.
Other machines on the same LAN subnet as the ONOS host also show no dropped ICMP packets.
Hence why I'm asking about additional debugging tools within ONOS..
Thanks
observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
I've seen similar issues (not necessarily via Observium) with overloaded MikroTik routerboards - especially on the low-end/older boards.
Given RouterOS's powerful featureset vs the relatively low CPU/memory available, it is very common to find a board sitting at 100% for hours on end or running out of memory every few minutes/hours. The only saving grace, typically, is that the board either recovers periodically or that the watchdog timers force it to reboot.
Of course, as you (John) are using Observium, you should be able to tell if the MikroTik is being overloaded or not. ;)
One alternative to attempt to confirm the behaviour (assuming you've not catching it yourself) is to monitor the problematic hosts with a a smokeping installation. Smokeping does one thing and does it well - ping services and print pretty latency/packet-loss graphs. :)
On 2014/12/23 09:10, Adam Armstrong wrote:
In my previous testing (which was on a mikrotik, but I didn't find out if it was the end device or an intermediate device), the issue occured because the device stopped replying to pings for 10 seconds every few minutes. This will produce frequent false alerts with even 1 second timeouts and multiple retries not ignoring. Personally, I think we're quite justified in marking a device down if it doesn't reply to ping after 5 seconds :) adam. ------ Original Message ------ From: "John Brown" <john@citylinkfiber.com mailto:john@citylinkfiber.com> To: "Observium Network Observation System" <observium@observium.org mailto:observium@observium.org> Sent: 12/22/2014 5:12:18 PM Subject: Re: [Observium] Many false positives
Thanks.
I did crank the timeout and retires up to 1000ms and 5 retries, prior to my original post. That didn't seem to help much.
I now have the debug value set and will watch the debug log for any hints.
Thanks
On Mon, Dec 22, 2014 at 4:04 PM, Tom Laermans <tom.laermans@powersource.cx mailto:tom.laermans@powersource.cx> wrote:
There is no time frame to report; it didn't return a ping (in time) at the exact moment the message is logged. These are the default ping config settings: #$config['ping']['retries'] = 3; // How many times to retry ping (1 - 10) #$config['ping']['timeout'] = 500; // Timeout in milliseconds (50 - 2000) So, 3 missed-or->500ms pings, it would seem. There are currently no device-specific ping settings, this is on my to-do somewhere. You can try to enable this: $config['ping']['debug'] = TRUE; // If TRUE store ping errors into logs/debug.log file and check the debug log. Tom On 22/12/2014 23:55, John Brown wrote:
Yes, I see the "Device Status changed to Down (PING)" in the log. The conflict I have with this that it doesn't provide any more detailed information. How many pings failed, time frame, etc I am running TCPDUMP on a monitor/span port that the ONOS is connected to and I see ICMP packets going out to devices and I see their reply packets come back. Over a 15 minute period of time a host will be reported as DOWN, yet the ICMP packet flow shows echo_request / echo_reply pairs without undo delay. Other machines on the same LAN subnet as the ONOS host also show no dropped ICMP packets. Hence why I'm asking about additional debugging tools within ONOS.. Thanks
observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
We run fping against the remote host. fping tells us "alive" or not.
We don't generate ICMP packets, we don't listen for replies. We run fping and it does all of that.
The last host I had with an issue like this i just ran fping against it repeatedly until it failed, and it failed because pings weren't being returned. I never found out why, but after that it wasn't my problem anymore :)
Things that stop pings are usually control plane policing (when pinging large routers and switches) and firewalls. adam.
------ Original Message ------ From: "John Brown" john@citylinkfiber.com To: "Observium Network Observation System" observium@observium.org Sent: 12/22/2014 4:55:03 PM Subject: Re: [Observium] Many false positives
Yes, I see the "Device Status changed to Down (PING)" in the log.
The conflict I have with this that it doesn't provide any more detailed information. How many pings failed, time frame, etc
I am running TCPDUMP on a monitor/span port that the ONOS is connected to and I see ICMP packets going out to devices and I see their reply packets come back.
Over a 15 minute period of time a host will be reported as DOWN, yet the ICMP packet flow shows echo_request / echo_reply pairs without undo delay.
Other machines on the same LAN subnet as the ONOS host also show no dropped ICMP packets.
Hence why I'm asking about additional debugging tools within ONOS..
Thanks
On Mon, Dec 22, 2014 at 3:39 PM, Tom Laermans tom.laermans@powersource.cx wrote:
Observium... Bonitoring(?) does tell you why it's down. It doesn't receive a reply either over ICMP echo or over SNMP; this is noted in the event log when the host goes down.
Tom
On 22/12/2014 23:05, John Brown wrote:
Hi
I'm trying to troubleshoot the many false positives we are receiving from OB.
The system will report a host as down, yet our legacy Nagios and out-of-band Pingdom do not show the host as down.
It doesn't appear that OB records in the log what specifically is making OB think the host is down.
I've increased the SNMP time out value to 3 seconds (which seems very long) and that has helped with some hosts, mostly Mikrotiks.
But I doubt that our Juniper MX480's (which are lightly loaded) should need such long time frame to respond.
How can I get OB to report what is the actual trigger for its "Host Down" alerts ??
Are there tweaks for performance monitoring / testing ??
Thank you in advance..
_______________________________________________ observium mailing list observium@observium.orghttp://postman.memetic.org/cgi-bin/mailman/listinfo/observium
observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
participants (4)
-
Adam Armstrong
-
Brendan Hide
-
John Brown
-
Tom Laermans