This is usually caused by one of three things:
1. Poller hardware is insufficient to poll all devices every 5 minutes consistently. This is usually easy to figure out from the massive load on the server.
2. Poller has insufficient threads configured to poll all devices every 5 minutes consistently, despite having enough resources. You can tell if this is the case because many poller-wrapper.py processes will be running, but the load on the server won't be critical.
3. Some intervening device on the network is throttling/filtering traffic intermittently, causing periodic failed snmpwalks and missing data for the odd poller cycle
Poller wrapper is given a number of threads to start (either in the cron job as an argument or in the web config), it'll run this number of poller.php processes. When the poller wrapper process starts, it checks to see how many of itself are running, if too many of it are running, it dies. These numbers are in the poller config section.
If you have insufficient threads or the server is too slow, the devices aren't polled in enough time, so another poller-wrapper starts before it's finished. This isn't always bad, some devices just take a REALLY long time to poll, but when they start overlapping multiple times, that's usually pretty dire.
We prevent more than a certain number of wrapper processes running, so if you already have x running because of some slow devices or insufficient threads or whatever, it'll refuse to start a process for that period, and you'll lose one period of polling data, causing a gap.
The network caused stuff is much harder to diagnose, because no one ever wants to admit that their pet firewall platform is useless. :D
adam.
Stefan Schmidt via observium wrote on 2024-03-18 10:33:
Hello! We query with snmp Observium the port data from a Mikrotik router (CCR2004-16G-2S+) with RouterOS 7.14.1 (Level 6) via SNMP from 4 WAN interfaces with polling and alerts. However, we always have gaps in the data. Every now and then 5-10 minutes of data are missing, so that vertical lines without data appear in the rrd - of course this only becomes clear in the 6h overview.
We are surprised because we have the impression that it only started on this device (and others) almost a year ago. We therefore renewed the hardware, reinstalled the system (debian12) and only queried this router alone (also deleted the query from the old Observium to avoid parallel double queries).
We look at https://.../pollerlog/ and see our patient there: Device Last Polled 10.X.X.1 100% 108.67s (...) and others of the location with 1-19%... Why 100%... where are the key data relevant here?
We would be happy if someone could help us adjust the Observium "Poller Wrapper" parameters if necessary. Is there potential for adaptation there?
Greetings Frankfurt/Main Stefan _______________________________________________ observium mailing list -- observium@lists.observium.org To unsubscribe send an email to observium-leave@lists.observium.org