Hello,
One of the devices I'm polling began misbehaving (its SNMP stack couldn't cope with uptime > 4294967296 seconds). This caused the Observium poller to hang indefinitely.
The underlying cause seems to be an issue with net-snmp, however this did cause other devices to not be polled and eventually exhausted the file descriptor limit (at which point, several thousand php processes were running).
I'm not sure of the best way to protect observium from net-snmp failing; as a very dirty hack I've replaced the following line in the config:
- $config['snmpget'] = "/usr/bin/snmpget"; + $config['snmpget'] = "/usr/bin/timeout 1m /usr/bin/snmpget";
I also tried setting snmpget's -t option, but this did not have an effect.
This is on the community version, but I believe the subscription version would be affected too (I've not seen anything in the change logs that would indicate otherwise).
For completeness, a not-very-interesting poller trace follows.
Hope this is useful,
Adam Bishop
gpg: 0x6609D460
Janet, the UK's research and education network.
Without timeout: ------------------------------------------------
DEBUG! Observium v0.13.10.4586 Poller Starting polling run: SQL[SELECT `device_id` FROM `devices` WHERE `disabled` = 0 AND `hostname` LIKE 'atlas_1r' ORDER BY `device_id` ASC] SQL[SELECT * FROM `devices` WHERE `device_id` = '11'] SQL[SELECT * FROM devices_attribs WHERE `device_id` = '11'] SQL[SELECT * FROM `alert_tests` WHERE 1] SQL[SELECT *,`alert_table`.`alert_table_id` AS `alert_table_id` FROM `alert_table` LEFT JOIN `alert_table-state` ON `alert_table`.`alert_table_id` = `alert_table-state`.`alert_table_id` WHERE `device_id` = '11'] Array ( ) Array ( ) atlas_1r 11 apc DEBUG: SNMP Auth options = -v3 -l 'authNoPriv' -n "" -a 'MD5' -A '<snip>' -u 'observium' /usr/bin/snmpget -v3 -l 'authNoPriv' -n "" -a 'MD5' -A '<snip>' -u 'observium' -Oqv -m SNMPv2-MIB -M /opt/observium/mibs 'udp':'atlas_1r':'161' sysObjectID.0
With timeout:
DEBUG! Observium v0.13.10.4586 Poller Starting polling run: SQL[SELECT `device_id` FROM `devices` WHERE `disabled` = 0 AND `hostname` LIKE 'atlas_1r' ORDER BY `device_id` ASC] SQL[SELECT * FROM `devices` WHERE `device_id` = '11'] SQL[SELECT * FROM devices_attribs WHERE `device_id` = '11'] SQL[SELECT * FROM `alert_tests` WHERE 1] SQL[SELECT *,`alert_table`.`alert_table_id` AS `alert_table_id` FROM `alert_table` LEFT JOIN `alert_table-state` ON `alert_table`.`alert_table_id` = `alert_table-state`.`alert_table_id` WHERE `device_id` = '11'] Array ( ) Array ( ) atlas_1r 11 apc DEBUG: SNMP Auth options = -v3 -l 'authNoPriv' -n "" -a 'MD5' -A '<snip>' -u 'observium' /usr/bin/snmpget -v3 -l 'authNoPriv' -n "" -a 'MD5' -A '<snip>' -u 'observium' -Oqv -m SNMPv2-MIB -M /opt/observium/mibs 'udp':'atlas_1r':'161' sysObjectID.0 SNMP Unreachable SQL[UPDATE `devices` set `status` ='0' WHERE device_id='11'] SQL[INSERT INTO `alerts` (`importance`,`device_id`,`message`) VALUES ('0','11','Device is down')] SQL[INSERT INTO `eventlog` (`device_id`,`reference`,`type`,`timestamp`,`message`) VALUES ('11','NULL','system',NOW(),'Device status changed to Down (snmp)')] RRD[cmd[update /opt/observium/rrd/atlas_1r/status.rrd N:0] stdout[] stderr[]] RRD[cmd[update /opt/observium/rrd/atlas_1r/ping.rrd N:6.02] stdout[] stderr[]] RRD[cmd[update /opt/observium/rrd/atlas_1r/ping_snmp.rrd N:U] stdout[] stderr[]] SQL[INSERT INTO `perf_times` (`type`,`doing`,`start`,`duration`,`devices`) VALUES ('poll','atlas_1r','1389715032.3017','60.58','1')] /opt/observium/poller.php atlas_1r January 14, 2014, 15:58 - 1 devices polled in 60.58 secs MySQL: Cell[0/0s] Row[1/0s] Rows[3/0s] Column[0/0s] Update[1/0.03s] Insert[3/0.03s] Delete[0/0s] Janet(UK) is a trading name of Jisc Collections and Janet Limited, a not-for-profit company which is registered in England under No. 2881024 and whose Registered Office is at Lumen House, Library Avenue, Harwell Oxford, Didcot, Oxfordshire. OX11 0SG. VAT No. 614944238
Hi Adam,
Please use the poller wrapper, it will at least make this not cause issues for your other devices.
Other than that, I believe this is either a net-snmp bug, or you've set custom timeouts insanely high (ie ms vs s in units)...
Tom
On 01/14/2014 05:24 PM, Adam Bishop wrote:
Hello,
One of the devices I'm polling began misbehaving (its SNMP stack couldn't cope with uptime > 4294967296 seconds). This caused the Observium poller to hang indefinitely.
The underlying cause seems to be an issue with net-snmp, however this did cause other devices to not be polled and eventually exhausted the file descriptor limit (at which point, several thousand php processes were running).
I'm not sure of the best way to protect observium from net-snmp failing; as a very dirty hack I've replaced the following line in the config:
- $config['snmpget'] = "/usr/bin/snmpget"; + $config['snmpget'] = "/usr/bin/timeout 1m /usr/bin/snmpget";
I also tried setting snmpget's -t option, but this did not have an effect.
This is on the community version, but I believe the subscription version would be affected too (I've not seen anything in the change logs that would indicate otherwise).
For completeness, a not-very-interesting poller trace follows.
Hope this is useful,
Adam Bishop
gpg: 0x6609D460
Janet, the UK's research and education network.
Without timeout:
DEBUG! Observium v0.13.10.4586 Poller
Starting polling run:
SQL[SELECT `device_id` FROM `devices` WHERE `disabled` = 0 AND `hostname` LIKE 'atlas_1r' ORDER BY `device_id` ASC]
SQL[SELECT * FROM `devices` WHERE `device_id` = '11']
SQL[SELECT * FROM devices_attribs WHERE `device_id` = '11']
SQL[SELECT * FROM `alert_tests` WHERE 1]
SQL[SELECT *,`alert_table`.`alert_table_id` AS `alert_table_id` FROM `alert_table` LEFT JOIN `alert_table-state` ON `alert_table`.`alert_table_id` = `alert_table-state`.`alert_table_id` WHERE `device_id` = '11'] Array ( ) Array ( ) atlas_1r 11 apc DEBUG: SNMP Auth options = -v3 -l 'authNoPriv' -n "" -a 'MD5' -A '<snip>' -u 'observium' /usr/bin/snmpget -v3 -l 'authNoPriv' -n "" -a 'MD5' -A '<snip>' -u 'observium' -Oqv -m SNMPv2-MIB -M /opt/observium/mibs 'udp':'atlas_1r':'161' sysObjectID.0
With timeout:
DEBUG! Observium v0.13.10.4586 Poller
Starting polling run:
SQL[SELECT `device_id` FROM `devices` WHERE `disabled` = 0 AND `hostname` LIKE 'atlas_1r' ORDER BY `device_id` ASC]
SQL[SELECT * FROM `devices` WHERE `device_id` = '11']
SQL[SELECT * FROM devices_attribs WHERE `device_id` = '11']
SQL[SELECT * FROM `alert_tests` WHERE 1]
SQL[SELECT *,`alert_table`.`alert_table_id` AS `alert_table_id` FROM `alert_table` LEFT JOIN `alert_table-state` ON `alert_table`.`alert_table_id` = `alert_table-state`.`alert_table_id` WHERE `device_id` = '11'] Array ( ) Array ( ) atlas_1r 11 apc DEBUG: SNMP Auth options = -v3 -l 'authNoPriv' -n "" -a 'MD5' -A '<snip>' -u 'observium' /usr/bin/snmpget -v3 -l 'authNoPriv' -n "" -a 'MD5' -A '<snip>' -u 'observium' -Oqv -m SNMPv2-MIB -M /opt/observium/mibs 'udp':'atlas_1r':'161' sysObjectID.0
SNMP Unreachable SQL[UPDATE `devices` set `status` ='0' WHERE device_id='11']
SQL[INSERT INTO `alerts` (`importance`,`device_id`,`message`) VALUES ('0','11','Device is down')]
SQL[INSERT INTO `eventlog` (`device_id`,`reference`,`type`,`timestamp`,`message`) VALUES ('11','NULL','system',NOW(),'Device status changed to Down (snmp)')] RRD[cmd[update /opt/observium/rrd/atlas_1r/status.rrd N:0] stdout[] stderr[]] RRD[cmd[update /opt/observium/rrd/atlas_1r/ping.rrd N:6.02] stdout[] stderr[]] RRD[cmd[update /opt/observium/rrd/atlas_1r/ping_snmp.rrd N:U] stdout[] stderr[]]
SQL[INSERT INTO `perf_times` (`type`,`doing`,`start`,`duration`,`devices`) VALUES ('poll','atlas_1r','1389715032.3017','60.58','1')] /opt/observium/poller.php atlas_1r January 14, 2014, 15:58 - 1 devices polled in 60.58 secs MySQL: Cell[0/0s] Row[1/0s] Rows[3/0s] Column[0/0s] Update[1/0.03s] Insert[3/0.03s] Delete[0/0s] Janet(UK) is a trading name of Jisc Collections and Janet Limited, a not-for-profit company which is registered in England under No. 2881024 and whose Registered Office is at Lumen House, Library Avenue, Harwell Oxford, Didcot, Oxfordshire. OX11 0SG. VAT No. 614944238
observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
On 14 Jan 2014, at 16:54, Tom Laermans tom.laermans@powersource.cx wrote:
Please use the poller wrapper, it will at least make this not cause issues for your other devices.
Thanks for the pointer, I've switched over to that - it still leaves processes hanging around but other devices are unaffected. If I have time this week I'll extend the wrapper to cull the affected processes to avoid resource exhaustion.
Other than that, I believe this is either a net-snmp bug, or you've set custom timeouts insanely high (ie ms vs s in units)…
Almost certainly a net-snmp bug - it occurs when calling snmpget against the host independent of observium.
Thanks,
Adam Bishop
gpg: 0x6609D460
Janet, the UK's research and education network.
Janet(UK) is a trading name of Jisc Collections and Janet Limited, a not-for-profit company which is registered in England under No. 2881024 and whose Registered Office is at Lumen House, Library Avenue, Harwell Oxford, Didcot, Oxfordshire. OX11 0SG. VAT No. 614944238
participants (2)
-
Adam Bishop
-
Tom Laermans