Hello,

I just want to clarify that my response isn't to remove any modules, in case you misread it as that. But if the ports module itself is longer than 5 minutes that is another problem that this can't solve.

This still polls all of the modules, but in separate cron jobs. ports was the longest for me too, so I have a separate cron for -m ports then separate for -m everything else by device ID and these devices are excluded from the main polling instance '-e 22'.

If you have a juniper there are some commands you can set to filter out interfaces that are unnecessary to help save some polling time on ports too. There is one command that does most, but you can manually add more if you want to try to reduce ports time too.

set snmp filter-interfaces all-internal-interfaces

Mine is PTX that is taking forever on ports, plus it's half way around the world so that adds lots of latency between snmp requests and responses too.

Thanks,
Chris

On Sun, Apr 11, 2021 at 5:45 PM Damien Gardner <damien.gardner@serversaustralia.com.au> wrote:
Thanks guys,

Unfortunately in this case disabling modules won't help, as it's the ports module that is taking all the time, and we can't really live without that :)

Adams reply got me thinking, and I went on a deep dive of RRD internals, and found the answer - heartbeat!  by default, it's 10 minutes. I've pushed it out to 30 minutes for the affected devices, and they're graphing now, with them only being polled every 15 minutes!

Incase it's useful, I did the following for each device's rrd directory:

for rrdfilename in *.rrd; do
   rrdinfo $rrdfilename | grep ds.*heartbeat | cut -d'[' -f 2 | cut -d']' -f 1 | while read dsname; do
    rrdtool tune --heartbeat $dsname:1800 $rrdfilename ;
  done ;
done

And of course, don't forget to shutdown rrdcache before, and start it back up after ;) 

The graphs for those devices aren't quite as pretty with 15 minute polling, but it's better than no graphs at all, so we'll live with it until the vendor figures out what in earth is going on :D 

Oh, btw, Adam - I found an issue with poller-wrapper.py - the -d debug does not actually work, as it's not able to log to the file (the file handle needs to be passed to subprocess.check_call, not the name of the file.  I'm not really a python dev, so this might not be the nicest way to fix it, but this got it working:

--- poller-wrapper.py (revision 11224)
+++ poller-wrapper.py (working copy)
@@ -844,7 +844,7 @@
             command_args.extend(command_list)
             command_args.append(device_id)
             if debug:
-                command_out = temp_path + '/observium_' + process + '_' + str(device_id) + '.debug'
+                command_out = open(temp_path + '/observium_' + process + '_' + str(device_id) + '.debug', 'w')
             #print(command_args) #debug
             subprocess.check_call(map(str, command_args), stdout=command_out, stderr=subprocess.STDOUT)

Thanks,

Damien




Damien Gardner Jnr

Senior Dev Ops
P: (02) 8115 8812 
4 Amy Close, Wyong, NSW 2259
Need assistance? We are here 24/7 +61 2 8115 8888
Sent from Front
On April 9, 2021, 1:28 AM GMT+10 observium@observium.org wrote:

I've had a similar problem and made a custom solution to the problem. I used some features of the polling wrapper to poll the device in multiple threads by specifying the modules on the command line and excluding the group of devices from the main polling command.

Here are the changes to my cron.d
#-e to exclude group 22
/home/observium/observium-wrapper poller -e 22 >> /home/observium/logs/poller_wrapper.log 2>&1
#the 2 devices that take longer than 5 minutes to poll. split the ports module separately
/home/observium/poller.php -h 742 -m ports  >> /dev/null 2>&1
/home/observium/poller.php -h 740 -m ports  >> /dev/null 2>&1
/home/observium/poller.php -h 742 -m bgp-peers,graphs,sensors,storage,ospf,processors,netstats,system,mempools,mac-accounting,fdb-table,ipSystemStats,ucd-mib,status,os,wifi  >> /dev/null 2>&1
/home/observium/poller.php -h 740 -m bgp-peers,graphs,sensors,storage,ospf,processors,netstats,system,mempools,mac-accounting,fdb-table,ipSystemStats,ucd-mib,status,os,wifi  >> /dev/null 2>&1

This dropped my polling time down significantly. I found what module was taking the most time through the device performance tab. I even added more devices and more threads(at this time I actually just started using observium-wrapper). I attached a screenshot. I still think the high time is still because of these 2 devices. But this could help keep 5 minute polling on your devices if you know what modules are slow and try to poll them separately. Or if they aren't important, disable them.

Thanks,
Chris Laffin

On Wed, Apr 7, 2021 at 10:58 PM Adam Armstrong via observium <observium@observium.org> wrote:

It’s not really possible to do this because of the rrd data format. RRD stores the difference between two values, and usually has a maximum distance of 1.5 datapoints between each value. You may have ruined the RRDs trying to change the step size, I’m not sure what RRD does if I you do that on existing data. Changing the step size should have worked for data after the change, though. It would take half an hour to actually get a datapoint, though!

 

The correct solution here is just to disable poller modules until the devices can complete a polling cycle. You can see which modules use the most time in the performance data section on the right hand cog menu of a device navbar.

 

Some devices may behave better with the separate_walk ports poller mode, too. This also allows the “disable” port setting to actually prevent the device from polling the interface at all (the usual mode walks the entire ports table).

 

Don’t mess with

 

Adam.

 

From: observium <observium-bounces@observium.org> On Behalf Of Damien Gardner via observium
Sent: 08 April 2021 04:20
To: Observium <observium@observium.org>
Cc: Damien Gardner <damien.gardner@serversaustralia.com.au>
Subject: [Observium] Changing some devices to only poll every 10 or 15 minutes temporariliy?

 

Hi all,

 

I have a bit of an oddball one, hoping it's not horrible to be able to do...  We have a few devices which have recently started taking > 5 mins to complete polling. This is leaving large gaps in our logs, and is also effecting discovery (as the polling of the devices are piling up on top of themselves and pushing load on those devices up heavily).  For now I've just turned off polling on those devices, but that's not ideal either..

 

Is there a way to be able to just poll those manually at say 10 or 15 minute intervals?  I shoved them into a group and excluded that group from our polls - and then ran poller.php -h deviceid on one device in cron every 15 mins to see how it would go - but we just get no graphs..   Also as a test, tried using rrdtool tune --step 900 to change the step size to 15 mins, but also no go (didn't really entirely expect it to..)

 

Is there a not-horrible (or if it has to be, I'm ok with horrible in the short-term until the vendor can figure out what on earth is going on..) way to get these being polled on a 15-minute interval and also going into rrd's?  (We're on the current paid subscription version, if that helps)

 

Thanks,

 

Damien

 


Image removed by sender.
Damien Gardner Jnr

Senior Dev Ops
P: (02) 8115 8812 
4 Amy Close, Wyong, NSW 2259
Need assistance? We are here 24/7 +61 2 8115 8888

Image removed by sender. Sent from Front

_______________________________________________
observium mailing list
observium@observium.org
http://postman.memetic.org/cgi-bin/mailman/listinfo/observium