System Uptime (sysUpTime) OID incorrect for some systems
I'm observing a problem with some systems being reported as having rebooted at odd times, which are completely unrelated to reality. The GUI claims the system uptime is derived from the OID sysUpTime, however I can't find this OID in any MIB, maybe an internal label in Observium?
In actuality Observium seems to be using the 1st OID below, but as can be seen from the 2nd OID the two seem totally unrelated. In the case of this particular system, the 2nd OID value is when it was rebooted, the 1st one seems to have no correlation to the reboot time.
DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (376806) 1:02:48.06 HOST-RESOURCES-MIB::hrSystemUptime.0 = Timeticks: (13560578) 1 day, 13:40:05.78
Is there anyway to change the OID on a per device basis that is used to display and monitor System Uptime?
Observium CE 18.9.9420 (5th September 2018) OS Linux 4.9.0-8-amd64 [amd64] (Debian 9 (stretch)) Apache 2.4.25 (Debian) PHP 7.0.33-0+deb9u3 (OPcache: ENABLED) Python 2.7.13 MySQL 10.1.37-MariaDB-0+deb9u1 (extension: mysqli 5.0.12-dev) SNMP NET-SNMP 5.7.3 RRDtool 1.6.0 Fping 3.15 (IPv4 and IPv6)
Regards
Chris Macneill Web: www.cmit.nz
sysUpTime is mean: SNMPv2-MIB::sysUpTime.0 it's complete same as DISMAN-EVENT-MIB::sysUpTimeInstance
$ snmptranslate DISMAN-EVENT-MIB::sysUpTimeInstance -On .1.3.6.1.2.1.1.3.0 $ snmptranslate SNMPv2-MIB::sysUpTime.0 -On .1.3.6.1.2.1.1.3.0
Observium always uses maximum time of one of this (which avialable on device): 1. SNMPv2-MIB::sysUpTime.0 2. HOST-RESOURCES-MIB::hrSystemUptime.0 3. SNMP-FRAMEWORK-MIB::snmpEngineTime.0 4. unix-agent or os-specific Oid (from definitions).
In your case will used hrSystemUptime.0 = Timeticks: (13560578) 1 day, 13:40:05.78
Reboot trigger will happen if in next poll runtime this value less than previous more than 300 seconds.
Chris Macneill via observium wrote on 24/05/2019 10:17:
I'm observing a problem with some systems being reported as having rebooted at odd times, which are completely unrelated to reality. The GUI claims the system uptime is derived from the OID sysUpTime, however I can't find this OID in any MIB, maybe an internal label in Observium?
In actuality Observium seems to be using the 1st OID below, but as can be seen from the 2nd OID the two seem totally unrelated. In the case of this particular system, the 2nd OID value is when it was rebooted, the 1st one seems to have no correlation to the reboot time.
DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (376806) 1:02:48.06 HOST-RESOURCES-MIB::hrSystemUptime.0 = Timeticks: (13560578) 1 day, 13:40:05.78
Is there anyway to change the OID on a per device basis that is used to display and monitor System Uptime?
Observium CE 18.9.9420 (5th September 2018) OS Linux 4.9.0-8-amd64 [amd64] (Debian 9 (stretch)) Apache 2.4.25 (Debian) PHP 7.0.33-0+deb9u3 (OPcache: ENABLED) Python 2.7.13 MySQL 10.1.37-MariaDB-0+deb9u1 (extension: mysqli 5.0.12-dev) SNMP NET-SNMP 5.7.3 RRDtool 1.6.0 Fping 3.15 (IPv4 and IPv6)
Regards
Chris Macneill Web: www.cmit.nz
observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
Thanks for the quick response, but Observium CE GUI is displaying result of SNMPv2-MIB::sysUpTime.0 even when HOST-RESOURCES-MIB::hrSystemUptime.0 is available AND is larger value.
Currently I have three devices displaying the "wrong" System Uptime, 2 x Windows Server 2012R2 and a Tycon Voltage monitor.
The Windows Servers definitely return both values as noted in my original email, but the SNMPv2-MIB::sysUpTime.0 value is displayed. The Windows Servers are Virtual Machines running under Hyper-V.
I was doing some maintenance on these servers earlier this evening and rebooted them and now the two OIDs return uptime values that are less then 1 minute apart. I guess in a day or so we'll see what happens.
The Tycon monitor has a poor SNMP implementation and only returns SNMPv2-MIB::sysUpTime.0 Observium periodically detects a reboot. I'll try monitoring the OID via a script to try and determine whether the device is providing spurious data or some issue in Observium. I'm not overly concern about this device as it'll get replaced soon, so I may just turn off the alerts. It's more of an irritation factor than a big problem.
It seems to me that for the Windows Servers this may be a bug, so I guess I should report it through the bug report channel? I'll monitor for a few days and then report or not if it continues to be a problem.
Regards
Chris Macneill Web: www.cmit.nz
----- Original Message ----- From: "Mike Stupalov" mike@observium.org To: "observium" observium@observium.org, "observium" observium@observium.org Cc: "Chris Macneill" cmacneill@cmit.nz Sent: Friday, 24 May, 2019 22:01:38 Subject: Re: [Observium] System Uptime (sysUpTime) OID incorrect for some systems
sysUpTime is mean: SNMPv2-MIB::sysUpTime.0 it's complete same as DISMAN-EVENT-MIB::sysUpTimeInstance
$ snmptranslate DISMAN-EVENT-MIB::sysUpTimeInstance -On .1.3.6.1.2.1.1.3.0 $ snmptranslate SNMPv2-MIB::sysUpTime.0 -On .1.3.6.1.2.1.1.3.0
Observium always uses maximum time of one of this (which avialable on device): 1. SNMPv2-MIB::sysUpTime.0 2. HOST-RESOURCES-MIB::hrSystemUptime.0 3. SNMP-FRAMEWORK-MIB::snmpEngineTime.0 4. unix-agent or os-specific Oid (from definitions).
In your case will used hrSystemUptime.0 = Timeticks: (13560578) 1 day, 13:40:05.78
Reboot trigger will happen if in next poll runtime this value less than previous more than 300 seconds.
Chris Macneill via observium wrote on 24/05/2019 10:17:
I'm observing a problem with some systems being reported as having rebooted at odd times, which are completely unrelated to reality. The GUI claims the system uptime is derived from the OID sysUpTime, however I can't find this OID in any MIB, maybe an internal label in Observium?
In actuality Observium seems to be using the 1st OID below, but as can be seen from the 2nd OID the two seem totally unrelated. In the case of this particular system, the 2nd OID value is when it was rebooted, the 1st one seems to have no correlation to the reboot time.
DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (376806) 1:02:48.06 HOST-RESOURCES-MIB::hrSystemUptime.0 = Timeticks: (13560578) 1 day, 13:40:05.78
Is there anyway to change the OID on a per device basis that is used to display and monitor System Uptime?
Observium CE 18.9.9420 (5th September 2018) OS Linux 4.9.0-8-amd64 [amd64] (Debian 9 (stretch)) Apache 2.4.25 (Debian) PHP 7.0.33-0+deb9u3 (OPcache: ENABLED) Python 2.7.13 MySQL 10.1.37-MariaDB-0+deb9u1 (extension: mysqli 5.0.12-dev) SNMP NET-SNMP 5.7.3 RRDtool 1.6.0 Fping 3.15 (IPv4 and IPv6)
Regards
Chris Macneill Web: www.cmit.nz
observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
Chris Macneill via observium wrote on 24/05/2019 15:46:
Thanks for the quick response, but Observium CE GUI is displaying result of SNMPv2-MIB::sysUpTime.0 even when HOST-RESOURCES-MIB::hrSystemUptime.0 is available AND is larger value.
Currently I have three devices displaying the "wrong" System Uptime, 2 x Windows Server 2012R2 and a Tycon Voltage monitor.
Since the last CE release, we have made some improvements regarding uptime, but main logic (about max uptime) used over the years.
Probably something other wrong in snmp requests.. Attach debug for poller (for problem devices):
./poller.php -r -d -m os -h <device>
The Windows Servers definitely return both values as noted in my original email, but the SNMPv2-MIB::sysUpTime.0 value is displayed. The Windows Servers are Virtual Machines running under Hyper-V.
I was doing some maintenance on these servers earlier this evening and rebooted them and now the two OIDs return uptime values that are less then 1 minute apart. I guess in a day or so we'll see what happens.
The Tycon monitor has a poor SNMP implementation and only returns SNMPv2-MIB::sysUpTime.0 Observium periodically detects a reboot. I'll try monitoring the OID via a script to try and determine whether the device is providing spurious data or some issue in Observium. I'm not overly concern about this device as it'll get replaced soon, so I may just turn off the alerts. It's more of an irritation factor than a big problem.
Which OS detected on Tycon monitor and around which uptime value reboot triggered for this device? We have the ability to exclude some "black hole" uptimes from "rebooted" trigger.
It seems to me that for the Windows Servers this may be a bug, so I guess I should report it through the bug report channel? I'll monitor for a few days and then report or not if it continues to be a problem.
Regards
Chris Macneill Web: www.cmit.nz
----- Original Message ----- From: "Mike Stupalov" mike@observium.org To: "observium" observium@observium.org, "observium" observium@observium.org Cc: "Chris Macneill" cmacneill@cmit.nz Sent: Friday, 24 May, 2019 22:01:38 Subject: Re: [Observium] System Uptime (sysUpTime) OID incorrect for some systems
sysUpTime is mean: SNMPv2-MIB::sysUpTime.0 it's complete same as DISMAN-EVENT-MIB::sysUpTimeInstance
$ snmptranslate DISMAN-EVENT-MIB::sysUpTimeInstance -On .1.3.6.1.2.1.1.3.0 $ snmptranslate SNMPv2-MIB::sysUpTime.0 -On .1.3.6.1.2.1.1.3.0
Observium always uses maximum time of one of this (which avialable on device): 1. SNMPv2-MIB::sysUpTime.0 2. HOST-RESOURCES-MIB::hrSystemUptime.0 3. SNMP-FRAMEWORK-MIB::snmpEngineTime.0 4. unix-agent or os-specific Oid (from definitions).
In your case will used hrSystemUptime.0 = Timeticks: (13560578) 1 day, 13:40:05.78
Reboot trigger will happen if in next poll runtime this value less than previous more than 300 seconds.
Chris Macneill via observium wrote on 24/05/2019 10:17:
I'm observing a problem with some systems being reported as having rebooted at odd times, which are completely unrelated to reality. The GUI claims the system uptime is derived from the OID sysUpTime, however I can't find this OID in any MIB, maybe an internal label in Observium?
In actuality Observium seems to be using the 1st OID below, but as can be seen from the 2nd OID the two seem totally unrelated. In the case of this particular system, the 2nd OID value is when it was rebooted, the 1st one seems to have no correlation to the reboot time.
DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (376806) 1:02:48.06 HOST-RESOURCES-MIB::hrSystemUptime.0 = Timeticks: (13560578) 1 day, 13:40:05.78
Is there anyway to change the OID on a per device basis that is used to display and monitor System Uptime?
Observium CE 18.9.9420 (5th September 2018) OS Linux 4.9.0-8-amd64 [amd64] (Debian 9 (stretch)) Apache 2.4.25 (Debian) PHP 7.0.33-0+deb9u3 (OPcache: ENABLED) Python 2.7.13 MySQL 10.1.37-MariaDB-0+deb9u1 (extension: mysqli 5.0.12-dev) SNMP NET-SNMP 5.7.3 RRDtool 1.6.0 Fping 3.15 (IPv4 and IPv6)
Regards
Chris Macneill Web: www.cmit.nz
observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
OK, I've worked out what the problem was with Windows Server 2012R2, simple really and problem is easily and reliably repeatable.
On Windows Server:-
SNMPv2-MIB::sysUpTime.0 is reset when the SNMP service restarts. HOST-RESOURCES-MIB::hrSystemUptime.0 is reset when the server hardware is rebooted.
So, if you're playing with the SNMP service as I was yesterday and repeatedly restarting the SNMP Service, the two OIDs return completely different values and Observium reports "System Rebooted" each time the SNMP service is restarted. In normal circumstances you wouldn't likely need to restart the SNMP service too often, so this issue wouldn't reveal itself.
When you reboot the system, then the OIDs will return values that are very close, but not identical. HOST-RESOURCES-MIB::hrSystemUptime.0 is always greater than SNMPv2-MIB::sysUpTime.0 as one would expect, as the SNMP service starts some small delay after the server hardware.
Despite what Mike said yesterday, at least for the two Windows Servers I am monitoring, Observium CE is definitely displaying the Uptime and Last Reboot based on SNMPv2-MIB::sysUpTime.0 and doesn't appear to take the larger of the two, see below for evidence.
Looks to me like the algorithm that is comparing the 4 values Mike mentioned in his original reply is failing somehow. I tried the poller.php debug output Mike suggested. It produces a lot of data. I can post this, but as I'm new to this maillist, please advise the best way to do it. Inline, attachment or link to Dropbox/Google Drive?
Currently:-
SNMPv2-MIB::sysUpTime.0 = Timeticks: (155746) 0:25:57.46 (SNMP Service restarted 25 minutes prior to SNMPGET) HOST-RESOURCES-MIB::hrSystemUptime.0 = Timeticks: (6491382) 18:01:53.82 (Hardware rebooted 18 hours prior to SNMPGET)
Observium CE GUI shows:-
Uptime 22m 24s (difference in time from above values due to 5 minute polling interval) Last reboot 2019-05-25 16:08:07 (Server was actually last rebooted 2019-05-24 22:06:00 NZST)
Current Localtime = Sat May 25 16:36:45 NZST 2019
There are some slight discrepancies in figures due to the time taken to perform the tests and "copy & paste" the results. Time waits for no one!! :-)
As regards the Tycon Voltage monitor, this is likely just a poor SNMP implementation which is crashing and restarting, but the device itself is OK. As this device only returns one uptime value, there is no way to handle this in Observium.
Regards
Chris Macneill Web: www.cmit.nz
This snippet from poller.php output supports my assertion that the "wrong" OID is begin used.
Using SNMP Agent sysUpTime (5111 sec. => 1h 25m 11s) [^[[0;31mRRD Disabled - create /opt/observium/rrd/device.nz/uptime.rrd^[[0m]^[[0m [^[[0;31mRRD Disabled - update /opt/observium/rrd/device.nz/uptime.rrd N:5111^[[0m]^[[0m ^[[0;36m o ^[[1;37mUptime ^[[0m 1h 25m 11s ^[[0;36m o ^[[1;37mLast reboot ^[[0m 2019-05-25 16:08:07
^[[1;44;96m ^[[0m^[[44;36m /opt/observium/includes/polling/system.inc.php:281 ^[[0m array( [use] => string(9) "sysUpTime" [sysUpTime] => int(5111) [uptime] => int(5111) [formatted] => string(10) "1h 25m 11s" [message] => string(26) "Using SNMP Agent sysUpTime" [previous] => string(4) "4935" [diff] => int(-176) [last_rebooted] => string(10) "1558757287" [rebooted] => int(0) )
Regards
Chris Macneill Web: www.cmit.nz
I started digging in the PHP code............
/opt/observium/includes/polling/system.inc.php lines 122 & 123
if ($device['os'] != 'windows' && $device['snmp_version'] != 'v1' && is_device_mib($device, 'HOST-RESOURCES-MIB'))
This condition statement specifically excludes Windows from using HOST-RESOURCES-MIB::hrSystemUptime.0, but I have no idea why.
I guess if I remove the !windows and it fixes my problem, then that's a solution.
Maybe this OID is only present in some versions of MS Windows?? So condition statement needs to be more granular and only exclude those versions which don't have this OID.
Regards
Chris Macneill Web: www.cmit.nz
Daft question,
why would you be restarting the SNMP service anyways?
As it only restarts on server reboot anyways?
Regards
Simon
On 25 May 2019, at 06:52, Chris Macneill via observium observium@observium.org wrote:
I started digging in the PHP code............
/opt/observium/includes/polling/system.inc.php lines 122 & 123
if ($device['os'] != 'windows' && $device['snmp_version'] != 'v1' && is_device_mib($device, 'HOST-RESOURCES-MIB'))
This condition statement specifically excludes Windows from using HOST-RESOURCES-MIB::hrSystemUptime.0, but I have no idea why.
I guess if I remove the !windows and it fixes my problem, then that's a solution.
Maybe this OID is only present in some versions of MS Windows?? So condition statement needs to be more granular and only exclude those versions which don't have this OID.
Regards
Chris Macneill Web: www.cmit.nz
observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
Yah.. Windows, now I remember, this old-old time snmp uptime issue on windows.
We complete ignore hrSystemUptime on windows system, because really there incorrect uptime. Best description of this issue can view here: https://kb.paessler.com/en/topic/61249-why-does-the-snmp-system-uptime-senso...
Chris Macneill via observium wrote on 25/05/2019 08:14:
OK, I've worked out what the problem was with Windows Server 2012R2, simple really and problem is easily and reliably repeatable.
On Windows Server:-
SNMPv2-MIB::sysUpTime.0 is reset when the SNMP service restarts. HOST-RESOURCES-MIB::hrSystemUptime.0 is reset when the server hardware is rebooted.
So, if you're playing with the SNMP service as I was yesterday and repeatedly restarting the SNMP Service, the two OIDs return completely different values and Observium reports "System Rebooted" each time the SNMP service is restarted. In normal circumstances you wouldn't likely need to restart the SNMP service too often, so this issue wouldn't reveal itself.
When you reboot the system, then the OIDs will return values that are very close, but not identical. HOST-RESOURCES-MIB::hrSystemUptime.0 is always greater than SNMPv2-MIB::sysUpTime.0 as one would expect, as the SNMP service starts some small delay after the server hardware.
Despite what Mike said yesterday, at least for the two Windows Servers I am monitoring, Observium CE is definitely displaying the Uptime and Last Reboot based on SNMPv2-MIB::sysUpTime.0 and doesn't appear to take the larger of the two, see below for evidence.
Looks to me like the algorithm that is comparing the 4 values Mike mentioned in his original reply is failing somehow. I tried the poller.php debug output Mike suggested. It produces a lot of data. I can post this, but as I'm new to this maillist, please advise the best way to do it. Inline, attachment or link to Dropbox/Google Drive?
Currently:-
SNMPv2-MIB::sysUpTime.0 = Timeticks: (155746) 0:25:57.46 (SNMP Service restarted 25 minutes prior to SNMPGET) HOST-RESOURCES-MIB::hrSystemUptime.0 = Timeticks: (6491382) 18:01:53.82 (Hardware rebooted 18 hours prior to SNMPGET)
Observium CE GUI shows:-
Uptime 22m 24s (difference in time from above values due to 5 minute polling interval) Last reboot 2019-05-25 16:08:07 (Server was actually last rebooted 2019-05-24 22:06:00 NZST)
Current Localtime = Sat May 25 16:36:45 NZST 2019
There are some slight discrepancies in figures due to the time taken to perform the tests and "copy & paste" the results. Time waits for no one!! :-)
As regards the Tycon Voltage monitor, this is likely just a poor SNMP implementation which is crashing and restarting, but the device itself is OK. As this device only returns one uptime value, there is no way to handle this in Observium.
Regards
Chris Macneill Web: www.cmit.nz
observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
OK, fair enough, the PRTG document describes the issue nicely. I accept there is a valid reason for excluding Windows Server from using this value if it's going to wrap around so quickly.
Might be worth putting a comment in the code as to why "Windows" devices are excluded and then will prevent other people reporting this "bug" again.
I guess we have to live with the fact Microsoft won't support Open Standards and does things it's own proprietary way. I guess there many be something in WMI that resolves this, but that's another new "can of worms".
Regards
Chris Macneill Web: www.cmit.nz
It seems I have found a more correct way to detect the uptime in windows, committed fix in r9918 (of course, this is in Pro edition currently).
Chris Macneill via observium wrote on 25/05/2019 18:30:
OK, fair enough, the PRTG document describes the issue nicely. I accept there is a valid reason for excluding Windows Server from using this value if it's going to wrap around so quickly.
Might be worth putting a comment in the code as to why "Windows" devices are excluded and then will prevent other people reporting this "bug" again.
I guess we have to live with the fact Microsoft won't support Open Standards and does things it's own proprietary way. I guess there many be something in WMI that resolves this, but that's another new "can of worms".
Regards
Chris Macneill Web: www.cmit.nz
observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
Thanks Mike, we're hopefully upgrading to the Pro version very soon.
Regards
Chris Macneill Web: www.cmit.nz
----- Original Message ----- From: "Mike Stupalov" mike@observium.org To: "observium" observium@observium.org, "observium" observium@observium.org Cc: "Chris Macneill" cmacneill@cmit.nz Sent: Sunday, 26 May, 2019 04:40:23 Subject: Re: [Observium] System Uptime (sysUpTime) OID incorrect for some systems
It seems I have found a more correct way to detect the uptime in windows, committed fix in r9918 (of course, this is in Pro edition currently).
Chris Macneill via observium wrote on 25/05/2019 18:30:
OK, fair enough, the PRTG document describes the issue nicely. I accept there is a valid reason for excluding Windows Server from using this value if it's going to wrap around so quickly.
Might be worth putting a comment in the code as to why "Windows" devices are excluded and then will prevent other people reporting this "bug" again.
I guess we have to live with the fact Microsoft won't support Open Standards and does things it's own proprietary way. I guess there many be something in WMI that resolves this, but that's another new "can of worms".
Regards
Chris Macneill Web: www.cmit.nz
observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
participants (3)
-
Chris Macneill
-
Mike Stupalov
-
Simon Mousey Smith