Capture gaps when retrieving interface traffic data …
Hello! We query with snmp Observium the port data from a Mikrotik router (CCR2004-16G-2S+) with RouterOS 7.14.1 (Level 6) via SNMP from 4 WAN interfaces with polling and alerts. However, we always have gaps in the data. Every now and then 5-10 minutes of data are missing, so that vertical lines without data appear in the rrd - of course this only becomes clear in the 6h overview.
We are surprised because we have the impression that it only started on this device (and others) almost a year ago. We therefore renewed the hardware, reinstalled the system (debian12) and only queried this router alone (also deleted the query from the old Observium to avoid parallel double queries).
We look at https://.../pollerlog/ and see our patient there: Device Last Polled 10.X.X.1 100% 108.67s (...) and others of the location with 1-19%... Why 100%... where are the key data relevant here?
We would be happy if someone could help us adjust the Observium "Poller Wrapper" parameters if necessary. Is there potential for adaptation there?
Greetings Frankfurt/Main Stefan
This is usually caused by one of three things:
1. Poller hardware is insufficient to poll all devices every 5 minutes consistently. This is usually easy to figure out from the massive load on the server.
2. Poller has insufficient threads configured to poll all devices every 5 minutes consistently, despite having enough resources. You can tell if this is the case because many poller-wrapper.py processes will be running, but the load on the server won't be critical.
3. Some intervening device on the network is throttling/filtering traffic intermittently, causing periodic failed snmpwalks and missing data for the odd poller cycle
Poller wrapper is given a number of threads to start (either in the cron job as an argument or in the web config), it'll run this number of poller.php processes. When the poller wrapper process starts, it checks to see how many of itself are running, if too many of it are running, it dies. These numbers are in the poller config section.
If you have insufficient threads or the server is too slow, the devices aren't polled in enough time, so another poller-wrapper starts before it's finished. This isn't always bad, some devices just take a REALLY long time to poll, but when they start overlapping multiple times, that's usually pretty dire.
We prevent more than a certain number of wrapper processes running, so if you already have x running because of some slow devices or insufficient threads or whatever, it'll refuse to start a process for that period, and you'll lose one period of polling data, causing a gap.
The network caused stuff is much harder to diagnose, because no one ever wants to admit that their pet firewall platform is useless. :D
adam.
Stefan Schmidt via observium wrote on 2024-03-18 10:33:
Hello! We query with snmp Observium the port data from a Mikrotik router (CCR2004-16G-2S+) with RouterOS 7.14.1 (Level 6) via SNMP from 4 WAN interfaces with polling and alerts. However, we always have gaps in the data. Every now and then 5-10 minutes of data are missing, so that vertical lines without data appear in the rrd - of course this only becomes clear in the 6h overview.
We are surprised because we have the impression that it only started on this device (and others) almost a year ago. We therefore renewed the hardware, reinstalled the system (debian12) and only queried this router alone (also deleted the query from the old Observium to avoid parallel double queries).
We look at https://.../pollerlog/ and see our patient there: Device Last Polled 10.X.X.1 100% 108.67s (...) and others of the location with 1-19%... Why 100%... where are the key data relevant here?
We would be happy if someone could help us adjust the Observium "Poller Wrapper" parameters if necessary. Is there potential for adaptation there?
Greetings Frankfurt/Main Stefan _______________________________________________ observium mailing list -- observium@lists.observium.org To unsubscribe send an email to observium-leave@lists.observium.org
Hello Admin, thank you for the 3 points, to which we would like to respond in order to further assess the problem:
1. The poller hardware currently only queries this one Mikrotik router beside 19 other devices behind. There is a load, this one router is displayed with “Last Polled” 100% and 204.19s. So is the server overloaded?
2. We have total 20 devices, 16 threads and 1 wrapper. How can we expand the processes? Minimum allowed warpper process is set to “4”. We only see one at a time.
3. In parallel to the router in question, other routers, switches, APs are queried (19) in the same network, where these problems do not occur. Why?
On one router we have traffic of 10-100 Mbit/s on each 4 ports (interfaces) that we can see. Is it possible that the Mikrotik hardware processes the SNMP query too slowly? Our SNMP Access goes via a 5th unused line, not effected with internet traffic. Has anyone had similar experiences with Mikrotik and the CCR2004-16G-2S+ model?
If this problem stays: Is there a way to individually change the polling time to 10 or 20 minutes for certain devices? Could be ok for uns ...
Grüße Stefan
In the SNMP config for that specific device, try tuning the "Maxrep" setting and poll manually comparing the times.
This Often needs tweaking to be something device specific for best performance - I even have multiple devices of the exact same make/model/etc, which each need different maxrep values.
Regards, James Tandy TandyUK Servers Limited
Tel: 01903 247 011 Www: http://www.tandyukservers.co.uk Email: support@tandyukservers.co.uk
TandyUK Servers Limited Registered in England and Wales, Company number 8314911 VAT Registered in the UK, number 182 0661 19 Registered Office: Amelia House, Crescent Road, Worthing, BN11 1QR
On 18/03/2024 15:37, Stefan Schmidt via observium wrote:
Hello Admin, thank you for the 3 points, to which we would like to respond in order to further assess the problem:
The poller hardware currently only queries this one Mikrotik router beside 19 other devices behind. There is a load, this one router is displayed with “Last Polled” 100% and 204.19s. So is the server overloaded?
We have total 20 devices, 16 threads and 1 wrapper. How can we expand the processes? Minimum allowed warpper process is set to “4”. We only see one at a time.
In parallel to the router in question, other routers, switches, APs are queried (19) in the same network, where these problems do not occur. Why?
On one router we have traffic of 10-100 Mbit/s on each 4 ports (interfaces) that we can see. Is it possible that the Mikrotik hardware processes the SNMP query too slowly? Our SNMP Access goes via a 5th unused line, not effected with internet traffic. Has anyone had similar experiences with Mikrotik and the CCR2004-16G-2S+ model?
If this problem stays: Is there a way to individually change the polling time to 10 or 20 minutes for certain devices? Could be ok for uns ...
Grüße Stefan _______________________________________________ observium mailing list -- observium@lists.observium.org To unsubscribe send an email to observium-leave@lists.observium.org
Thx James Tandy, Solution!
We selected the dedicated device, choosed "Properties" and "SNMP" Tap, then "Basic Configuration" and adjusted under "Max Repetitions" with the Value "80". Working better now, Poller Times went down to 50 Seconds ...
Grüße Stefan
You have a RouterOS device which polls slow enough higher max-rep helps?
How many ports does it have?
In general you can try running snmpbulkwalks on the device with progressively larger max-rep values to find out what it can handle.
I'm not really sure why this helped though. You're running almost as many threads as you have devices, so things shouldn't really overlap unless you have very, very slow polling devices, which you don't seem to have.
oO
adam.
Stefan Schmidt via observium wrote on 2024-03-19 13:46:
Thx James Tandy, Solution!
We selected the dedicated device, choosed "Properties" and "SNMP" Tap, then "Basic Configuration" and adjusted under "Max Repetitions" with the Value "80". Working better now, Poller Times went down to 50 Seconds ...
Grüße Stefan _______________________________________________ observium mailing list -- observium@lists.observium.org To unsubscribe send an email to observium-leave@lists.observium.org
participants (3)
-
Adam Armstrong
-
James Tandy
-
Stefan Schmidt