![](https://secure.gravatar.com/avatar/8f1abbebfccca89beabcc58bda7887c1.jpg?s=120&d=mm&r=g)
I've seen similar issues (not necessarily via Observium) with overloaded MikroTik routerboards - especially on the low-end/older boards.
Given RouterOS's powerful featureset vs the relatively low CPU/memory available, it is very common to find a board sitting at 100% for hours on end or running out of memory every few minutes/hours. The only saving grace, typically, is that the board either recovers periodically or that the watchdog timers force it to reboot.
Of course, as you (John) are using Observium, you should be able to tell if the MikroTik is being overloaded or not. ;)
One alternative to attempt to confirm the behaviour (assuming you've not catching it yourself) is to monitor the problematic hosts with a a smokeping installation. Smokeping does one thing and does it well - ping services and print pretty latency/packet-loss graphs. :)
On 2014/12/23 09:10, Adam Armstrong wrote:
In my previous testing (which was on a mikrotik, but I didn't find out if it was the end device or an intermediate device), the issue occured because the device stopped replying to pings for 10 seconds every few minutes. This will produce frequent false alerts with even 1 second timeouts and multiple retries not ignoring. Personally, I think we're quite justified in marking a device down if it doesn't reply to ping after 5 seconds :) adam. ------ Original Message ------ From: "John Brown" <john@citylinkfiber.com mailto:john@citylinkfiber.com> To: "Observium Network Observation System" <observium@observium.org mailto:observium@observium.org> Sent: 12/22/2014 5:12:18 PM Subject: Re: [Observium] Many false positives
Thanks.
I did crank the timeout and retires up to 1000ms and 5 retries, prior to my original post. That didn't seem to help much.
I now have the debug value set and will watch the debug log for any hints.
Thanks
On Mon, Dec 22, 2014 at 4:04 PM, Tom Laermans <tom.laermans@powersource.cx mailto:tom.laermans@powersource.cx> wrote:
There is no time frame to report; it didn't return a ping (in time) at the exact moment the message is logged. These are the default ping config settings: #$config['ping']['retries'] = 3; // How many times to retry ping (1 - 10) #$config['ping']['timeout'] = 500; // Timeout in milliseconds (50 - 2000) So, 3 missed-or->500ms pings, it would seem. There are currently no device-specific ping settings, this is on my to-do somewhere. You can try to enable this: $config['ping']['debug'] = TRUE; // If TRUE store ping errors into logs/debug.log file and check the debug log. Tom On 22/12/2014 23:55, John Brown wrote:
Yes, I see the "Device Status changed to Down (PING)" in the log. The conflict I have with this that it doesn't provide any more detailed information. How many pings failed, time frame, etc I am running TCPDUMP on a monitor/span port that the ONOS is connected to and I see ICMP packets going out to devices and I see their reply packets come back. Over a 15 minute period of time a host will be reported as DOWN, yet the ICMP packet flow shows echo_request / echo_reply pairs without undo delay. Other machines on the same LAN subnet as the ONOS host also show no dropped ICMP packets. Hence why I'm asking about additional debugging tools within ONOS.. Thanks
observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium