polling issues with 5897

Rutger Bevaart

21 Oct 2014 21 Oct '14

12:03 p.m.

Hi,

Updated to the latest svn version yesterday, after which I immediately had issues with the poller. Updated from a release about two weeks older. Now I get up/down for different devices, sometimes complete BGP neighbours down on others. Device down alerts on snmp, etc.

I increased the number of poller processes in the crontab and increased snap timeout values etc. However, I still have graphs with large parts missing because of this. Checked poller times, kept an eye on the processes running and logs. Some devices take 120s to poll completely, but all polling is done in about 2 minutes. Still I get these errors, other than the upgrade no changes have been made to firewalls, policies, etc.

Any clues? Do I need to upgrade php?

Regards, Rutger

Show replies by date

Mike Stupalov

21 Oct 21 Oct

12:25 p.m.

On 21.10.2014 14:03, Rutger Bevaart wrote:

...

Hi,

Updated to the latest svn version yesterday, after which I immediately had issues with the poller. Updated from a release about two weeks older. Now I get up/down for different devices, sometimes complete BGP neighbours down on others. Device down alerts on snmp, etc.

I increased the number of poller processes in the crontab and increased snap timeout values etc. However, I still have graphs with large parts missing because of this. Checked poller times, kept an eye on the processes running and logs. Some devices take 120s to poll completely, but all polling is done in about 2 minutes. Still I get these errors, other than the upgrade no changes have been made to firewalls, policies, etc.

Any clues? Do I need to upgrade php?

Show your crontab. How many memory on observium host.

Send debug output for one device (with long polling time): ./poller.php -d -h some_device > /tmp/debug_poller

(Do not sent this output to list) ;)

...

Regards, Rutger

observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

-- Mike Stupalov http://observium.org

Mike Stupalov

12:26 p.m.

On 21.10.2014 14:03, Rutger Bevaart wrote:

...

Hi,

Updated to the latest svn version yesterday, after which I immediately had issues with the poller. Updated from a release about two weeks older. Now I get up/down for different devices, sometimes complete BGP neighbours down on others. Device down alerts on snmp, etc.

I increased the number of poller processes in the crontab and increased snap timeout values etc. However, I still have graphs with large parts missing because of this. Checked poller times, kept an eye on the processes running and logs. Some devices take 120s to poll completely, but all polling is done in about 2 minutes. Still I get these errors, other than the upgrade no changes have been made to firewalls, policies, etc.

Any clues? Do I need to upgrade php?

Show your crontab. How many memory on observium host.

Send debug output for one device (with long polling time): ./poller.php -d -h some_device > /tmp/debug_poller

(Do not sent this output to list) ;)

...

Regards, Rutger

observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

-- Mike Stupalov http://observium.org

Joseph Zancocchio

4:27 p.m.

I believe I have been experiencing the same issue with the latest stable update, and have narrowed down the problem I have seen to the changes made in revision 5884. Can you try to svn up -r 5883 (make sure to switch to the current branch if you are using the stable currently) and see if everything works, and if so, svn up -r 5884 to see if the behavior returns?

-Joe

On 10/21/2014 06:26 AM, Mike Stupalov wrote:

...

On 21.10.2014 14:03, Rutger Bevaart wrote:

...
Hi,

Updated to the latest svn version yesterday, after which I immediately had issues with the poller. Updated from a release about two weeks older. Now I get up/down for different devices, sometimes complete BGP neighbours down on others. Device down alerts on snmp, etc.

I increased the number of poller processes in the crontab and increased snap timeout values etc. However, I still have graphs with large parts missing because of this. Checked poller times, kept an eye on the processes running and logs. Some devices take 120s to poll completely, but all polling is done in about 2 minutes. Still I get these errors, other than the upgrade no changes have been made to firewalls, policies, etc.

Any clues? Do I need to upgrade php?

Show your crontab. How many memory on observium host.

Send debug output for one device (with long polling time): ./poller.php -d -h some_device > /tmp/debug_poller

(Do not sent this output to list) ;)

...
Regards, Rutger

observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

Mike Stupalov

5:13 p.m.

Joseph,

same questions.. you get this issue on latest stable (r5889) or latest trunk?

if yes, then send me debug output for any device with high polling time: ./poller.php -d -h your_device

changes in r5884, can't create such problems.

On Tue, Oct 21, 2014 at 6:27 PM, Joseph Zancocchio joseph@nyi.net wrote:

...

I believe I have been experiencing the same issue with the latest stable update, and have narrowed down the problem I have seen to the changes made in revision 5884. Can you try to svn up -r 5883 (make sure to switch to the current branch if you are using the stable currently) and see if everything works, and if so, svn up -r 5884 to see if the behavior returns?

-Joe

On 10/21/2014 06:26 AM, Mike Stupalov wrote:

On 21.10.2014 14:03, Rutger Bevaart wrote:

Hi,

Updated to the latest svn version yesterday, after which I immediately had issues with the poller. Updated from a release about two weeks older. Now I get up/down for different devices, sometimes complete BGP neighbours down on others. Device down alerts on snmp, etc.

I increased the number of poller processes in the crontab and increased snap timeout values etc. However, I still have graphs with large parts missing because of this. Checked poller times, kept an eye on the processes running and logs. Some devices take 120s to poll completely, but all polling is done in about 2 minutes. Still I get these errors, other than the upgrade no changes have been made to firewalls, policies, etc.

Any clues? Do I need to upgrade php?

Show your crontab. How many memory on observium host.

Send debug output for one device (with long polling time): ./poller.php -d -h some_device > /tmp/debug_poller

(Do not sent this output to list) ;)

Regards, Rutger

observium mailing listobservium@observium.orghttp://postman.memetic.org/cgi-bin/mailman/listinfo/observium

observium mailing listobservium@observium.orghttp://postman.memetic.org/cgi-bin/mailman/listinfo/observium

observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

-- Mike Stupalov http://observium.org/

Joseph Zancocchio

9:13 p.m.

We have the issues on latest stable (we even tried the just-committed 5898), and then switched to the trunk branch to try incremental updates until the issue appeared. Everything is OK up to revision 5883. However, when I ran a poll with debug, everything works OK (even in the latest revisions). We use the poller-wrapper script in our crontab, and when I edited it to pass the debug argument to the poller.php script, the issue is resolved. So only non-debug polling via the poller-wrapper script seems (partially) broken, strangely.

I can provide the full debug and non-debug polling logs, if required, but there isn't really anything useful there, since everything works as expected with the debug turned on, and pretty much nothing gets output on the non-debug runs that fail. Here are the first 15 lines of the no-debug poll run log, with a host "xxx1" that failed at the top:

Observium v0.14.10.5898 Poller

Starting polling run:

xxx1nyi.net 26 ios (cisco) Observium v0.14.10.5898 Poller

Starting polling run:

xxx2.nyi.net 9 ios (cisco) Observium v0.14.10.5898 Poller

Also, when the non-debug poll runs and fails, it only takes < 1 second to finish the poll. Whatever is failing is happening very early in the polling process, before it runs any SQL. Here is a snippet of some failures from the poll log:

[2014/10/21 14:35:00 -0400] discovery.php(84391): /usr/local/www/apache24/observium/discovery.php: new - 0 devices discovered in 0.002 secs [2014/10/21 14:35:01 -0400] poller.php(84397): /usr/local/www/apache24/observium/poller.php: xxx1.nyi.net - 1 devices polled in 0.294 secs [2014/10/21 14:35:01 -0400] poller.php(84402): /usr/local/www/apache24/observium/poller.php: xxx2.nyi.net - 1 devices polled in 0.516 secs [2014/10/21 14:35:08 -0400] poller.php(84601): /usr/local/www/apache24/observium/poller.php: xxx3.nyi.net - 1 devices polled in 0.405 secs [2014/10/21 14:35:11 -0400] poller.php(84689): /usr/local/www/apache24/observium/poller.php: xxx4.nyi.net - 1 devices polled in 0.259 secs [2014/10/21 14:35:12 -0400] poller.php(84711): /usr/local/www/apache24/observium/poller.php: xxx5.nyi.net - 1 devices polled in 0.334 secs [2014/10/21 14:35:17 -0400] poller.php(84855): /usr/local/www/apache24/observium/poller.php: xxx6.nyi.net - 1 devices polled in 0.170 secs [2014/10/21 14:35:17 -0400] poller.php(84871): /usr/local/www/apache24/observium/poller.php: xxx7.nyi.net - 1 devices polled in 0.270 secs

These servers normally take about 15 - 30 seconds each to poll. Here is the start of the successful debug-enabled poll run for the same host that failed first above (xxx1):

xxx1.nyi.net 26 ios (cisco) Observium v0.14.10.5898 Poller

Starting polling run:

SQL[SELECT `device_id` FROM `devices` WHERE `disabled` = 0 AND `device_id` = '9' ORDER BY `device_id` ASC]

SQL[SELECT * FROM `devices` WHERE `device_id` = '9']

And the poll log times:

[2014/10/21 14:37:58 -0400] poller.php(86759): /usr/local/www/apache24/observium/poller.php: xxx1.nyi.net - 1 devices polled in 21.44 secs [2014/10/21 14:38:00 -0400] poller.php(86755): /usr/local/www/apache24/observium/poller.php: xxx2.nyi.net - 1 devices polled in 23.24 secs [2014/10/21 14:38:04 -0400] poller.php(86753): /usr/local/www/apache24/observium/poller.php: xxx3.nyi.net - 1 devices polled in 27.00 secs [2014/10/21 14:38:05 -0400] poller.php(87587): /usr/local/www/apache24/observium/poller.php: xxx4.nyi.net - 1 devices polled in 6.771 secs [2014/10/21 14:38:07 -0400] poller.php(87645): /usr/local/www/apache24/observium/poller.php: xxx5.nyi.net - 1 devices polled in 7.678 secs [2014/10/21 14:38:07 -0400] poller.php(86760): /usr/local/www/apache24/observium/poller.php: xxx6.nyi.net - 1 devices polled in 30.35 secs [2014/10/21 14:38:16 -0400] poller.php(87247): /usr/local/www/apache24/observium/poller.php: xxx7.nyi.net - 1 devices polled in 27.67 secs

On 10/21/2014 11:13 AM, Mike Stupalov wrote:

...

Joseph,

same questions.. you get this issue on latest stable (r5889) or latest trunk?

if yes, then send me debug output for any device with high polling time: ./poller.php -d -h your_device

changes in r5884, can'tcreate such problems.

On Tue, Oct 21, 2014 at 6:27 PM, Joseph Zancocchio <joseph@nyi.net mailto:joseph@nyi.net> wrote:

I believe I have been experiencing the same issue with the latest
stable update, and have narrowed down the problem I have seen to
the changes made in revision 5884.  Can you try to svn up -r 5883
(make sure to switch to the current branch if you are using the
stable currently) and see if everything works, and if so, svn up
-r 5884 to see if the behavior returns?

-Joe



On 10/21/2014 06:26 AM, Mike Stupalov wrote:

...

On 21.10.2014 14:03, Rutger Bevaart wrote:

...

Hi,

Updated to the latest svn version yesterday, after which I immediately had issues with the poller. Updated from a release about two weeks older. Now I get up/down for different devices, sometimes complete BGP neighbours down on others. Device down alerts on snmp, etc.

I increased the number of poller processes in the crontab and increased snap timeout values etc. However, I still have graphs with large parts missing because of this. Checked poller times, kept an eye on the processes running and logs. Some devices take 120s to poll completely, but all polling is done in about 2 minutes. Still I get these errors, other than the upgrade no changes have been made to firewalls, policies, etc.

Any clues? Do I need to upgrade php?

Show your crontab.
How many memory on observium host.

Send debug output for one device (with long polling time):
./poller.php -d -h some_device > /tmp/debug_poller

(Do not sent this output to list) ;)

...

Regards,
Rutger

_______________________________________________
observium mailing list
observium@observium.org  <mailto:observium@observium.org>
http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

_______________________________________________
observium mailing list
observium@observium.org  <mailto:observium@observium.org>
http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

_______________________________________________
observium mailing list
observium@observium.org <mailto:observium@observium.org>
http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

-- Mike Stupalov http://observium.org/

observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

Mike Stupalov

11:55 p.m.

Mistic :)

I correctly understand, that problem not in ports polling (or any other individual module) but in incorrect detect that device up/down? (Of corse if device incorrectly detected as down than all other modules should show the wrong data).

As I see in debug output sent by Rutger, everything is well.

But than on latest trunk enable debug for ping, add to config: $config['ping']['debug'] = TRUE;

and wait for next false down, than send contenf of /opt/observium/logs/debug.log

in this log should be stored output from fping and mtr cmd when device detected as down.

On Tue, Oct 21, 2014 at 11:13 PM, Joseph Zancocchio joseph@nyi.net wrote:

...

We have the issues on latest stable (we even tried the just-committed 5898), and then switched to the trunk branch to try incremental updates until the issue appeared. Everything is OK up to revision 5883. However, when I ran a poll with debug, everything works OK (even in the latest revisions). We use the poller-wrapper script in our crontab, and when I edited it to pass the debug argument to the poller.php script, the issue is resolved. So only non-debug polling via the poller-wrapper script seems (partially) broken, strangely.

I can provide the full debug and non-debug polling logs, if required, but there isn't really anything useful there, since everything works as expected with the debug turned on, and pretty much nothing gets output on the non-debug runs that fail. Here are the first 15 lines of the no-debug poll run log, with a host "xxx1" that failed at the top:

Observium v0.14.10.5898 Poller

Starting polling run:

xxx1nyi.net 26 ios (cisco) Observium v0.14.10.5898 Poller

Starting polling run:

xxx2.nyi.net 9 ios (cisco) Observium v0.14.10.5898 Poller

Also, when the non-debug poll runs and fails, it only takes < 1 second to finish the poll. Whatever is failing is happening very early in the polling process, before it runs any SQL. Here is a snippet of some failures from the poll log:

[2014/10/21 14:35:00 -0400] discovery.php(84391): /usr/local/www/apache24/observium/discovery.php: new - 0 devices discovered in 0.002 secs [2014/10/21 14:35:01 -0400] poller.php(84397): /usr/local/www/apache24/observium/poller.php: xxx1.nyi.net - 1 devices polled in 0.294 secs [2014/10/21 14:35:01 -0400] poller.php(84402): /usr/local/www/apache24/observium/poller.php: xxx2.nyi.net - 1 devices polled in 0.516 secs [2014/10/21 14:35:08 -0400] poller.php(84601): /usr/local/www/apache24/observium/poller.php: xxx3.nyi.net - 1 devices polled in 0.405 secs [2014/10/21 14:35:11 -0400] poller.php(84689): /usr/local/www/apache24/observium/poller.php: xxx4.nyi.net - 1 devices polled in 0.259 secs [2014/10/21 14:35:12 -0400] poller.php(84711): /usr/local/www/apache24/observium/poller.php: xxx5.nyi.net - 1 devices polled in 0.334 secs [2014/10/21 14:35:17 -0400] poller.php(84855): /usr/local/www/apache24/observium/poller.php: xxx6.nyi.net - 1 devices polled in 0.170 secs [2014/10/21 14:35:17 -0400] poller.php(84871): /usr/local/www/apache24/observium/poller.php: xxx7.nyi.net - 1 devices polled in 0.270 secs

These servers normally take about 15 - 30 seconds each to poll. Here is the start of the successful debug-enabled poll run for the same host that failed first above (xxx1):

xxx1.nyi.net 26 ios (cisco) Observium v0.14.10.5898 Poller

Starting polling run:

SQL[SELECT `device_id` FROM `devices` WHERE `disabled` = 0 AND `device_id` = '9' ORDER BY `device_id` ASC]

SQL[SELECT * FROM `devices` WHERE `device_id` = '9']

And the poll log times:

[2014/10/21 14:37:58 -0400] poller.php(86759): /usr/local/www/apache24/observium/poller.php: xxx1.nyi.net - 1 devices polled in 21.44 secs [2014/10/21 14:38:00 -0400] poller.php(86755): /usr/local/www/apache24/observium/poller.php: xxx2.nyi.net - 1 devices polled in 23.24 secs [2014/10/21 14:38:04 -0400] poller.php(86753): /usr/local/www/apache24/observium/poller.php: xxx3.nyi.net - 1 devices polled in 27.00 secs [2014/10/21 14:38:05 -0400] poller.php(87587): /usr/local/www/apache24/observium/poller.php: xxx4.nyi.net - 1 devices polled in 6.771 secs [2014/10/21 14:38:07 -0400] poller.php(87645): /usr/local/www/apache24/observium/poller.php: xxx5.nyi.net - 1 devices polled in 7.678 secs [2014/10/21 14:38:07 -0400] poller.php(86760): /usr/local/www/apache24/observium/poller.php: xxx6.nyi.net - 1 devices polled in 30.35 secs [2014/10/21 14:38:16 -0400] poller.php(87247): /usr/local/www/apache24/observium/poller.php: xxx7.nyi.net - 1 devices polled in 27.67 secs

On 10/21/2014 11:13 AM, Mike Stupalov wrote:

Joseph,

same questions.. you get this issue on latest stable (r5889) or latest trunk?

if yes, then send me debug output for any device with high polling time: ./poller.php -d -h your_device

changes in r5884, can't create such problems.

On Tue, Oct 21, 2014 at 6:27 PM, Joseph Zancocchio joseph@nyi.net wrote:

...
I believe I have been experiencing the same issue with the latest stable update, and have narrowed down the problem I have seen to the changes made in revision 5884. Can you try to svn up -r 5883 (make sure to switch to the current branch if you are using the stable currently) and see if everything works, and if so, svn up -r 5884 to see if the behavior returns?

-Joe

On 10/21/2014 06:26 AM, Mike Stupalov wrote:

On 21.10.2014 14:03, Rutger Bevaart wrote:

Hi,

Updated to the latest svn version yesterday, after which I immediately had issues with the poller. Updated from a release about two weeks older. Now I get up/down for different devices, sometimes complete BGP neighbours down on others. Device down alerts on snmp, etc.

I increased the number of poller processes in the crontab and increased snap timeout values etc. However, I still have graphs with large parts missing because of this. Checked poller times, kept an eye on the processes running and logs. Some devices take 120s to poll completely, but all polling is done in about 2 minutes. Still I get these errors, other than the upgrade no changes have been made to firewalls, policies, etc.

Any clues? Do I need to upgrade php?

Show your crontab. How many memory on observium host.

Send debug output for one device (with long polling time): ./poller.php -d -h some_device > /tmp/debug_poller

(Do not sent this output to list) ;)

Regards, Rutger

observium mailing listobservium@observium.orghttp://postman.memetic.org/cgi-bin/mailman/listinfo/observium

observium mailing listobservium@observium.orghttp://postman.memetic.org/cgi-bin/mailman/listinfo/observium

observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

-- Mike Stupalov http://observium.org/

observium mailing listobservium@observium.orghttp://postman.memetic.org/cgi-bin/mailman/listinfo/observium

observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

-- Mike Stupalov http://observium.org/

Mike Stupalov

22 Oct 22 Oct

12:05 a.m.

what entries you see in eventlog?

Device status changed to Down (ping) or Device status changed to Down (snmp)

On Wed, Oct 22, 2014 at 1:55 AM, Mike Stupalov mike@observium.org wrote:

...

Mistic :)

I correctly understand, that problem not in ports polling (or any other individual module) but in incorrect detect that device up/down? (Of corse if device incorrectly detected as down than all other modules should show the wrong data).

As I see in debug output sent by Rutger, everything is well.

But than on latest trunk enable debug for ping, add to config: $config['ping']['debug'] = TRUE;

and wait for next false down, than send contenf of /opt/observium/logs/debug.log

in this log should be stored output from fping and mtr cmd when device detected as down.

On Tue, Oct 21, 2014 at 11:13 PM, Joseph Zancocchio joseph@nyi.net wrote:

...
We have the issues on latest stable (we even tried the just-committed 5898), and then switched to the trunk branch to try incremental updates until the issue appeared. Everything is OK up to revision 5883. However, when I ran a poll with debug, everything works OK (even in the latest revisions). We use the poller-wrapper script in our crontab, and when I edited it to pass the debug argument to the poller.php script, the issue is resolved. So only non-debug polling via the poller-wrapper script seems (partially) broken, strangely.

I can provide the full debug and non-debug polling logs, if required, but there isn't really anything useful there, since everything works as expected with the debug turned on, and pretty much nothing gets output on the non-debug runs that fail. Here are the first 15 lines of the no-debug poll run log, with a host "xxx1" that failed at the top:

Observium v0.14.10.5898 Poller

Starting polling run:

xxx1nyi.net 26 ios (cisco) Observium v0.14.10.5898 Poller

Starting polling run:

xxx2.nyi.net 9 ios (cisco) Observium v0.14.10.5898 Poller

Also, when the non-debug poll runs and fails, it only takes < 1 second to finish the poll. Whatever is failing is happening very early in the polling process, before it runs any SQL. Here is a snippet of some failures from the poll log:

[2014/10/21 14:35:00 -0400] discovery.php(84391): /usr/local/www/apache24/observium/discovery.php: new - 0 devices discovered in 0.002 secs [2014/10/21 14:35:01 -0400] poller.php(84397): /usr/local/www/apache24/observium/poller.php: xxx1.nyi.net - 1 devices polled in 0.294 secs [2014/10/21 14:35:01 -0400] poller.php(84402): /usr/local/www/apache24/observium/poller.php: xxx2.nyi.net - 1 devices polled in 0.516 secs [2014/10/21 14:35:08 -0400] poller.php(84601): /usr/local/www/apache24/observium/poller.php: xxx3.nyi.net - 1 devices polled in 0.405 secs [2014/10/21 14:35:11 -0400] poller.php(84689): /usr/local/www/apache24/observium/poller.php: xxx4.nyi.net - 1 devices polled in 0.259 secs [2014/10/21 14:35:12 -0400] poller.php(84711): /usr/local/www/apache24/observium/poller.php: xxx5.nyi.net - 1 devices polled in 0.334 secs [2014/10/21 14:35:17 -0400] poller.php(84855): /usr/local/www/apache24/observium/poller.php: xxx6.nyi.net - 1 devices polled in 0.170 secs [2014/10/21 14:35:17 -0400] poller.php(84871): /usr/local/www/apache24/observium/poller.php: xxx7.nyi.net - 1 devices polled in 0.270 secs

These servers normally take about 15 - 30 seconds each to poll. Here is the start of the successful debug-enabled poll run for the same host that failed first above (xxx1):

xxx1.nyi.net 26 ios (cisco) Observium v0.14.10.5898 Poller

Starting polling run:

SQL[SELECT `device_id` FROM `devices` WHERE `disabled` = 0 AND `device_id` = '9' ORDER BY `device_id` ASC]

SQL[SELECT * FROM `devices` WHERE `device_id` = '9']

And the poll log times:

[2014/10/21 14:37:58 -0400] poller.php(86759): /usr/local/www/apache24/observium/poller.php: xxx1.nyi.net - 1 devices polled in 21.44 secs [2014/10/21 14:38:00 -0400] poller.php(86755): /usr/local/www/apache24/observium/poller.php: xxx2.nyi.net - 1 devices polled in 23.24 secs [2014/10/21 14:38:04 -0400] poller.php(86753): /usr/local/www/apache24/observium/poller.php: xxx3.nyi.net - 1 devices polled in 27.00 secs [2014/10/21 14:38:05 -0400] poller.php(87587): /usr/local/www/apache24/observium/poller.php: xxx4.nyi.net - 1 devices polled in 6.771 secs [2014/10/21 14:38:07 -0400] poller.php(87645): /usr/local/www/apache24/observium/poller.php: xxx5.nyi.net - 1 devices polled in 7.678 secs [2014/10/21 14:38:07 -0400] poller.php(86760): /usr/local/www/apache24/observium/poller.php: xxx6.nyi.net - 1 devices polled in 30.35 secs [2014/10/21 14:38:16 -0400] poller.php(87247): /usr/local/www/apache24/observium/poller.php: xxx7.nyi.net - 1 devices polled in 27.67 secs

On 10/21/2014 11:13 AM, Mike Stupalov wrote:

Joseph,

same questions.. you get this issue on latest stable (r5889) or latest trunk?

if yes, then send me debug output for any device with high polling time: ./poller.php -d -h your_device

changes in r5884, can't create such problems.

On Tue, Oct 21, 2014 at 6:27 PM, Joseph Zancocchio joseph@nyi.net wrote:

...
I believe I have been experiencing the same issue with the latest stable update, and have narrowed down the problem I have seen to the changes made in revision 5884. Can you try to svn up -r 5883 (make sure to switch to the current branch if you are using the stable currently) and see if everything works, and if so, svn up -r 5884 to see if the behavior returns?

-Joe

On 10/21/2014 06:26 AM, Mike Stupalov wrote:

On 21.10.2014 14:03, Rutger Bevaart wrote:

Hi,

Updated to the latest svn version yesterday, after which I immediately had issues with the poller. Updated from a release about two weeks older. Now I get up/down for different devices, sometimes complete BGP neighbours down on others. Device down alerts on snmp, etc.

I increased the number of poller processes in the crontab and increased snap timeout values etc. However, I still have graphs with large parts missing because of this. Checked poller times, kept an eye on the processes running and logs. Some devices take 120s to poll completely, but all polling is done in about 2 minutes. Still I get these errors, other than the upgrade no changes have been made to firewalls, policies, etc.

Any clues? Do I need to upgrade php?

Show your crontab. How many memory on observium host.

Send debug output for one device (with long polling time): ./poller.php -d -h some_device > /tmp/debug_poller

(Do not sent this output to list) ;)

Regards, Rutger

observium mailing listobservium@observium.orghttp://postman.memetic.org/cgi-bin/mailman/listinfo/observium

observium mailing listobservium@observium.orghttp://postman.memetic.org/cgi-bin/mailman/listinfo/observium

observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

-- Mike Stupalov http://observium.org/

observium mailing listobservium@observium.orghttp://postman.memetic.org/cgi-bin/mailman/listinfo/observium

observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

-- Mike Stupalov http://observium.org/

-- Mike Stupalov http://observium.org/

joseph＠nyi.net

3:09 a.m.

I set the ping debug config option, but nothing gets logged. The eventlog shows "Device status changed to Down (snmp)", so it looks like isSNMPable() is returning 0 on the failed runs?

I ran snmpget directly on the command line for an affected device with the OIB and mib that isSNMPable uses, and it returns properly:

# snmpget -v 2c -c *** -m DISMAN-EVENT-MIB xxx1.nyi.net .1.3.6.1.2.1.1.3.0 DISMAN-EVENT-MIB::sysUpTimeInstance = Timeticks: (1132194804) 131 days, 0:59:08.04

# snmpget -v 2c -c *** -Oqtn -m DISMAN-EVENT-MIB xxx1.nyi.net .1.3.6.1.2.1.1.3.0 .1.3.6.1.2.1.1.3.0 1132195361

I dug deeper and think that the issue must be with using get_device_mibs(). My MIBs array defined in defaults.inc.php does not have DISMAN-EVENT-MIB in it, and the debug flag causes the get_device_mibs() function to include all mibs. So I need to populate that array or keep the debug flag added to poller-wrapper, but I am left wondering when this array was first meant to be the source of allowed polling mibs when no debug flag is set, and how it is supposed to populate? Perhaps Rutger has the same issue?

Thanks for your help, and sorry for the delayed info.

On 21.10.2014 17:55, Mike Stupalov wrote:

...

Mistic :)

...

I correctly understand, that problem not in ports polling

(or any

...

other individual module) but in incorrect detect that device

up/down?

...

(Of corse if device incorrectly detected as down than all

other

...

modules should show the wrong data).

As I see in debug

output sent by Rutger, everything is well.

...

But than on latest trunk

enable debug for ping, add to config:

...

$config['ping']['debug'] =

TRUE;

...

and wait for next false down, than send contenf of

/opt/observium/logs/debug.log

...

in this log should be stored output

from fping and mtr cmd when

...

device detected as down.

On Tue, Oct

21, 2014 at 11:13 PM, Joseph Zancocchio joseph@nyi.net wrote:

...

...
We

have the issues on latest stable (we even tried the just-committed 5898), and then switched to the trunk branch to try incremental updates until the issue appeared. Everything is OK up to revision 5883. However, when I ran a poll with debug, everything works OK (even in the latest revisions). We use the poller-wrapper script in our crontab, and when I edited it to pass the debug argument to the poller.php script, the issue is resolved. So only non-debug polling via the poller-wrapper script seems (partially) broken, strangely.

...

...
I can provide the full debug

and non-debug polling logs, if required, but there isn't really anything useful there, since everything works as expected with the debug turned on, and pretty much nothing gets output on the non-debug runs that fail. Here are the first 15 lines of the no-debug poll run log, with a host "xxx1" that failed at the top:

...

...
...
Observium v0.14.10.5898

Poller

...

...
...
Starting polling run:

xxx1nyi.net [1] 26 ios

(cisco)

...

...
...
Observium v0.14.10.5898 Poller

Starting polling

run:

...

...
...
xxx2.nyi.net [2] 9 ios (cisco) Observium

v0.14.10.5898

...

...
...
Poller

Also, when the non-debug poll runs and

fails, it only takes < 1 second to finish the poll. Whatever is failing is happening very early in the polling process, before it runs any SQL. Here is a snippet of some failures from the poll log:

...

...
...
[2014/10/21

14:35:00 -0400] discovery.php(84391): /usr/local/www/apache24/observium/discovery.php: new - 0 devices discovered in 0.002 secs

...

...
...
[2014/10/21 14:35:01 -0400]

poller.php(84397): /usr/local/www/apache24/observium/poller.php: xxx1.nyi.net [3] - 1 devices polled in 0.294 secs

...

...
...
[2014/10/21

14:35:01 -0400] poller.php(84402): /usr/local/www/apache24/observium/poller.php: xxx2.nyi.net [2] - 1 devices polled in 0.516 secs

...

...
...
[2014/10/21 14:35:08 -0400]

poller.php(84601): /usr/local/www/apache24/observium/poller.php: xxx3.nyi.net [4] - 1 devices polled in 0.405 secs

...

...
...
[2014/10/21

14:35:11 -0400] poller.php(84689): /usr/local/www/apache24/observium/poller.php: xxx4.nyi.net [5] - 1 devices polled in 0.259 secs

...

...
...
[2014/10/21 14:35:12 -0400]

poller.php(84711): /usr/local/www/apache24/observium/poller.php: xxx5.nyi.net [6] - 1 devices polled in 0.334 secs

...

...
...
[2014/10/21

14:35:17 -0400] poller.php(84855): /usr/local/www/apache24/observium/poller.php: xxx6.nyi.net [7] - 1 devices polled in 0.170 secs

...

...
...
[2014/10/21 14:35:17 -0400]

poller.php(84871): /usr/local/www/apache24/observium/poller.php: xxx7.nyi.net [8] - 1 devices polled in 0.270 secs

...

...
These servers

normally take about 15 - 30 seconds each to poll. Here is the start of the successful debug-enabled poll run for the same host that failed first above (xxx1):

...

...
xxx1.nyi.net [3] 26 ios (cisco) Observium

v0.14.10.5898

...

...
Poller

Starting polling run:

...
SQL[SELECT

`device_id` FROM `devices` WHERE `disabled` = 0 AND `device_id` = '9' ORDER BY `device_id` ASC]

...

...
...
SQL[SELECT * FROM `devices` WHERE

`device_id` = '9']

...

...
...
And the poll log times: [2014/10/21

14:37:58 -0400] poller.php(86759): /usr/local/www/apache24/observium/poller.php: xxx1.

...

...
1 devices polled

in 21.44 secs

...

...
[2014/10/21 14:38:00 -0400] poller.php(86755):

/usr/local/www/apache24/observium/poller.php: xxx2.nyi.net [2] - 1 devices polled in 23.24 secs

...

...
[2014/10/21 14:38:04 -0400]

poller.php(86753): /usr/local/www/apache24/observium/poller.php: xxx3.nyi.net [4] - 1 devices polled in 27.00 secs

...

...
[2014/10/21

14:38:05 -0400] poller.php(87587): /usr/local/www/apache24/observium/poller.ph

...

...
...
http://xxx4.nyi.net"

target="_blank">xxx4.nyi.net - 1 devices polled in 6.771 secs

...

...
...

[2014/10/21 14:38:07 -0400] poller.php(87645): /usr/local/www/apache24/observium/poller.php: xxx5.nyi.net [6] - 1 devices polled in 7.678 secs

...

...
...
[2014/10/21 14:38:07 -0400]

poller.php(86760): /usr/local/www/apache24/observium/poller.php: xxx6.nyi.net [7] - 1 devices polled in 30.35 secs

...

...
...
[2014/10/21

14:38:16 -0400] poller.php(87247): /usr/local/www/apache24/observium/poller.php: xxx7.nyi.net [8] - 1 devices polled in 27.67 secs

...

...
...
On 10/21/2014 11:13 AM, Mike

Stupalov wrote:

...

...
...
Joseph,

same questions.. you get

this issue on latest stable (r5889) or latest trunk?

...

...
...
if yes,

then send me debug output for any device with high polling time:

...

...
...

./poller.php -d -h your_device

...

...
s in r5884, can't create such

problems.

...

...
On Tue, Oct 21, 2014 at 6:27 PM, Joseph Zancocchio <

...

-- Mike Stupalov http://observium.org/ [10]

Links:

------

...

[1] http://xxx1nyi.net [2] http://xxx2.nyi.net [3]

http://xxx1.nyi.net

...

[4] http://xxx3.nyi.net [5]

http://xxx4.nyi.net

...

[6] http://xxx5.nyi.net [7]

http://xxx6.nyi.net

...

[8] http://xxx7.nyi.net [9]

http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

...

[10]

http://observium.org/

...

_______________________________________________

...

observium mailing

list

...

observium@observium.org

http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

4032

Age (days ago)

4033

Last active (days ago)

List overview

Download

8 comments

4 participants

tags (0)

participants (4)

Joseph Zancocchio
joseph＠nyi.net
Mike Stupalov
Rutger Bevaart