Ahh, that makes sense. I thought it was really odd that it wasn’t able to ping a local RRD database or MySQL instance.
As I’ve done more, it seems to be isolated to being able to ping APC NMS2 cards, since all of our other hosts (and older UPSes with NMS1 cards) work just fine.
I’ve set the following in config.php, but it doesn’t seem like it takes, since it’s a generous retries, and when I brute-force it, after the 2nd or 3rd run of poller.php on a host, it takes.
// PING Settings - Retries/Timeouts $config['ping']['retries'] = 6; // How many times to retry ping $config['ping']['timeout'] = 1500; // Timeout in milliseconds
This is what I’m seeing in the ping log. I have no experience with fping, so I can’t tell if the ping settings are taking effect or not.
2014-05-28 07:10:34 | PING ERROR: itups04-01.net.internal (1) | FPING OUT: 10.1.4.4 : xmt/rcv/%loss = 1/0/100% MTR OUT: HOST: observium Loss% Snt Last Avg Best Wrst StDev 1.|-- 10.1.101.1 0.0% 5 5.5 4.3 0.3 7.8 3.4 2.|-- 10.1.4.4 0.0% 5 1.3 1.1 1.0 1.3 0.1
2014-05-28 07:11:05 | PING ERROR: itups-mpoe-01.net.internal (1) | FPING OUT: 10.1.22.97 : xmt/rcv/%loss = 1/0/100% MTR OUT: HOST: observium Loss% Snt Last Avg Best Wrst StDev 1.|-- 10.1.101.1 0.0% 5 8.2 5.7 0.3 10.5 5.0 2.|-- 10.1.22.97 0.0% 5 1.1 1.3 1.1 1.9 0.4
2014-05-28 07:15:08 | PING ERROR: itups03-01.net.internal (1) | FPING OUT: 10.1.3.4 : xmt/rcv/%loss = 1/0/100% MTR OUT: HOST: observium Loss% Snt Last Avg Best Wrst StDev 1.|-- 10.1.101.1 0.0% 5 0.3 1.4 0.2 3.6 1.6 2.|-- 10.1.3.4 0.0% 5 67.4 14.5 1.0 67.4 29.6
2014-05-28 07:15:16 | PING ERROR: itups03-02.net.internal (1) | FPING OUT: 10.1.13.6 : xmt/rcv/%loss = 1/0/100% MTR OUT: HOST: observium Loss% Snt Last Avg Best Wrst StDev 1.|-- 10.1.101.1 0.0% 5 8.2 1.8 0.2 8.2 3.5 2.|-- 10.1.13.6 0.0% 5 0.9 25.7 0.9 62.6 33.7
2014-05-28 07:15:24 | PING ERROR: itups04-02.net.internal (1) | FPING OUT: 10.1.14.4 : xmt/rcv/%loss = 1/0/100% 2014-05-28 07:15:27 | PING ERROR: itups-mpoe-02.net.internal (1) | FPING OUT: 10.1.22.98 : xmt/rcv/%loss = 1/0/100% MTR OUT: HOST: observium Loss% Snt Last Avg Best Wrst StDev 1.|-- 10.1.101.1 0.0% 5 0.2 2.2 0.2 9.3 4.0 2.|-- 10.1.14.4 0.0% 5 1.1 1.1 1.0 1.3 0.1
MTR OUT: HOST: observium Loss% Snt Last Avg Best Wrst StDev 1.|-- 10.1.101.1 0.0% 5 2.6 2.3 0.2 4.8 2.1 2.|-- 10.1.22.98 0.0% 5 1.2 1.2 1.0 1.3 0.1
2014-05-28 07:15:34 | PING ERROR: itups04-01.net.internal (1) | FPING OUT: 10.1.4.4 : xmt/rcv/%loss = 1/0/100% MTR OUT: HOST: observium Loss% Snt Last Avg Best Wrst StDev 1.|-- 10.1.101.1 0.0% 5 0.2 1.6 0.2 6.9 3.0 2.|-- 10.1.4.4 0.0% 5 1.1 29.1 1.1 71.9 37.5
2014-05-28 07:16:07 | PING ERROR: itups-mpoe-01.net.internal (1) | FPING OUT: 10.1.22.97 : xmt/rcv/%loss = 1/0/100% MTR OUT: HOST: observium Loss% Snt Last Avg Best Wrst StDev 1.|-- 10.1.101.1 0.0% 5 3.9 2.2 0.3 6.3 2.7 2.|-- 10.1.22.97 0.0% 5 1.1 1.2 1.1 1.3 0.1
Is there anything else I’m missing, or additional debugging output or logging
So far as I can tell, something changed with the .13 to .14 observium update with pinging (Most of this troubleshooting came from the earlier thread from March 27th from Mark Nellmeann) (http://comments.gmane.org/gmane.network.observium.general/2064). I just don’t know enough about the back-end systems and I don’t want to risk loosing our historical data.
Thanks for the help, and for this excellent system!
Andrew Davis IT Systems J. David Gladstone Institutes
(415) 734-2549 andrew.davis@gladstone.ucsf.edumailto:andrew.davis@gladstone.ucsf.edu
On May 27, 2014, at 11:41 PM, Tom Laermans <tom.laermans@powersource.cxmailto:tom.laermans@powersource.cx> wrote:
It's actually just "Unpingable."
The other output is debug output: RRD[blah] and SQL[blah].
Either way, your host is unpingable.
Tom
On 28/05/2014 05:38, Andrew Davis wrote: All,
We’ve been running Observium on Ubnutu 12.04 for a little over a year now, and after applying the latest community update, we seem to have devices periodically reporting as down.
The error I’m getting is bouncing between UnpingableMySQL and UnpingableRRD.
With debug enabled, this is an example of what we’re seeing:
/poller.php -h itups03-02.net.internal Observium v0.14.4.5229 Poller
Starting polling run:
SQL[SELECT `device_id` FROM `devices` WHERE `disabled` = 0 AND `hostname` LIKE 'itups03-02.net.internal' ORDER BY `device_id` ASC]
SQL[SELECT * FROM `devices` WHERE `device_id` = '40']
SQL[SELECT * FROM devices_attribs WHERE `device_id` = '40'] itups03-02.net.internal 40 apc UnpingableRRD[cmd[update /opt/observium/rrd/itups03-02.net.internal/status.rrd N:0] stdout[OK u:0.00 s:0.00 r:0.05] stderr[]] RRD[cmd[update /opt/observium/rrd/itups03-02.net.internal/ping.rrd N:U] stdout[OK u:0.00 s:0.00 r:0.06] stderr[]] RRD[cmd[update /opt/observium/rrd/itups03-02.net.internal/ping_snmp.rrd N:U] stdout[OK u:0.00 s:0.00 r:0.06] stderr[]]
SQL[INSERT INTO `perf_times` (`type`,`doing`,`start`,`duration`,`devices`) VALUES ('poll','itups03-02.net.internal','1401246542.664','0.066','1')] ./poller.php itups03-02.net.internal May 27, 2014, 20:09 - 1 devices polled in 0.066 secs MySQL: Cell[0/0s] Row[1/0s] Rows[1/0s] Column[0/0s] Update[0/0s] Insert[1/0s] Delete[0/0s]
Any advice on where we can look? Seems like the only way to get them back up is to keep running the poller for the particular host until the poller runs.
Thanks!
Andrew Davis IT Systems J. David Gladstone Institutes
_______________________________________________ observium mailing list observium@observium.orgmailto:observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
_______________________________________________ observium mailing list observium@observium.orgmailto:observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium