Sorry yeah, I got that - the ports module is taking 500-700 seconds depending on device on a few of these devices (but other identical units on the same firmware and patch level, with mostly the same traffic passing through are doing a full polling run in the same 80 seconds these devices were taking up until January), so polling individual modules separately isn't going to work :( Was a really good idea though - I didn't realise we could poll individual modules separately, so I'll definitely keep that in mind :)
Regards,
Damien
[https://info.serversaustralia.com.au/hubfs/Brand-2018/logo-font-sau.gif] Damien Gardner Jnr Senior Dev Ops P: (02) 8115 8812 4 Amy Close, Wyong, NSW 2259 Need assistance? We are here 24/7 +61 2 8115 8888 [Sent from Front] On April 12, 2021, 11:54 AM GMT+10 chris.laffin@automattic.commailto:chris.laffin@automattic.com wrote:
Hello,
I just want to clarify that my response isn't to remove any modules, in case you misread it as that. But if the ports module itself is longer than 5 minutes that is another problem that this can't solve.
This still polls all of the modules, but in separate cron jobs. ports was the longest for me too, so I have a separate cron for -m ports then separate for -m everything else by device ID and these devices are excluded from the main polling instance '-e 22'.
If you have a juniper there are some commands you can set to filter out interfaces that are unnecessary to help save some polling time on ports too. There is one command that does most, but you can manually add more if you want to try to reduce ports time too.
set snmp filter-interfaces all-internal-interfaces
Mine is PTX that is taking forever on ports, plus it's half way around the world so that adds lots of latency between snmp requests and responses too.
Thanks, Chris
On Sun, Apr 11, 2021 at 5:45 PM Damien Gardner <damien.gardner@serversaustralia.com.aumailto:damien.gardner@serversaustralia.com.au> wrote: Thanks guys,
Unfortunately in this case disabling modules won't help, as it's the ports module that is taking all the time, and we can't really live without that :)
Adams reply got me thinking, and I went on a deep dive of RRD internals, and found the answer - heartbeat! by default, it's 10 minutes. I've pushed it out to 30 minutes for the affected devices, and they're graphing now, with them only being polled every 15 minutes!
Incase it's useful, I did the following for each device's rrd directory:
for rrdfilename in *.rrd; do rrdinfo $rrdfilename | grep ds.*heartbeat | cut -d'[' -f 2 | cut -d']' -f 1 | while read dsname; do rrdtool tune --heartbeat $dsname:1800 $rrdfilename ; done ; done
And of course, don't forget to shutdown rrdcache before, and start it back up after ;)
The graphs for those devices aren't quite as pretty with 15 minute polling, but it's better than no graphs at all, so we'll live with it until the vendor figures out what in earth is going on :D
Oh, btw, Adam - I found an issue with poller-wrapper.pyhttp://poller-wrapper.py/ - the -d debug does not actually work, as it's not able to log to the file (the file handle needs to be passed to subprocess.check_call, not the name of the file. I'm not really a python dev, so this might not be the nicest way to fix it, but this got it working:
--- poller-wrapper.pyhttp://poller-wrapper.py/ (revision 11224) +++ poller-wrapper.pyhttp://poller-wrapper.py/ (working copy) @@ -844,7 +844,7 @@ command_args.extend(command_list) command_args.append(device_id) if debug: - command_out = temp_path + '/observium_' + process + '_' + str(device_id) + '.debug' + command_out = open(temp_path + '/observium_' + process + '_' + str(device_id) + '.debug', 'w') #print(command_args) #debug subprocess.check_call(map(str, command_args), stdout=command_out, stderr=subprocess.STDOUT)
Thanks,
Damien
[https://info.serversaustralia.com.au/hubfs/Brand-2018/logo-font-sau.gif] Damien Gardner Jnr Senior Dev Ops P: (02) 8115 8812 4 Amy Close, Wyong, NSW 2259 Need assistance? We are here 24/7 +61 2 8115 8888 [Sent from Front] On April 9, 2021, 1:28 AM GMT+10 observium@observium.orgmailto:observium@observium.org wrote:
I've had a similar problem and made a custom solution to the problem. I used some features of the polling wrapper to poll the device in multiple threads by specifying the modules on the command line and excluding the group of devices from the main polling command.
Here are the changes to my cron.d #-e to exclude group 22 /home/observium/observium-wrapper poller -e 22 >> /home/observium/logs/poller_wrapper.log 2>&1 #the 2 devices that take longer than 5 minutes to poll. split the ports module separately /home/observium/poller.php -h 742 -m ports >> /dev/null 2>&1 /home/observium/poller.php -h 740 -m ports >> /dev/null 2>&1 /home/observium/poller.php -h 742 -m bgp-peers,graphs,sensors,storage,ospf,processors,netstats,system,mempools,mac-accounting,fdb-table,ipSystemStats,ucd-mib,status,os,wifi >> /dev/null 2>&1 /home/observium/poller.php -h 740 -m bgp-peers,graphs,sensors,storage,ospf,processors,netstats,system,mempools,mac-accounting,fdb-table,ipSystemStats,ucd-mib,status,os,wifi >> /dev/null 2>&1
This dropped my polling time down significantly. I found what module was taking the most time through the device performance tab. I even added more devices and more threads(at this time I actually just started using observium-wrapper). I attached a screenshot. I still think the high time is still because of these 2 devices. But this could help keep 5 minute polling on your devices if you know what modules are slow and try to poll them separately. Or if they aren't important, disable them.
Thanks, Chris Laffin
On Wed, Apr 7, 2021 at 10:58 PM Adam Armstrong via observium <observium@observium.orgmailto:observium@observium.org> wrote:
It’s not really possible to do this because of the rrd data format. RRD stores the difference between two values, and usually has a maximum distance of 1.5 datapoints between each value. You may have ruined the RRDs trying to change the step size, I’m not sure what RRD does if I you do that on existing data. Changing the step size should have worked for data after the change, though. It would take half an hour to actually get a datapoint, though!
The correct solution here is just to disable poller modules until the devices can complete a polling cycle. You can see which modules use the most time in the performance data section on the right hand cog menu of a device navbar.
Some devices may behave better with the separate_walk ports poller mode, too. This also allows the “disable” port setting to actually prevent the device from polling the interface at all (the usual mode walks the entire ports table).
Don’t mess with
Adam.
From: observium <observium-bounces@observium.orgmailto:observium-bounces@observium.org> On Behalf Of Damien Gardner via observium Sent: 08 April 2021 04:20 To: Observium <observium@observium.orgmailto:observium@observium.org> Cc: Damien Gardner <damien.gardner@serversaustralia.com.aumailto:damien.gardner@serversaustralia.com.au> Subject: [Observium] Changing some devices to only poll every 10 or 15 minutes temporariliy?
Hi all,
I have a bit of an oddball one, hoping it's not horrible to be able to do... We have a few devices which have recently started taking > 5 mins to complete polling. This is leaving large gaps in our logs, and is also effecting discovery (as the polling of the devices are piling up on top of themselves and pushing load on those devices up heavily). For now I've just turned off polling on those devices, but that's not ideal either..
Is there a way to be able to just poll those manually at say 10 or 15 minute intervals? I shoved them into a group and excluded that group from our polls - and then ran poller.php -h deviceid on one device in cron every 15 mins to see how it would go - but we just get no graphs.. Also as a test, tried using rrdtool tune --step 900 to change the step size to 15 mins, but also no go (didn't really entirely expect it to..)
Is there a not-horrible (or if it has to be, I'm ok with horrible in the short-term until the vendor can figure out what on earth is going on..) way to get these being polled on a 15-minute interval and also going into rrd's? (We're on the current paid subscription version, if that helps)
Thanks,
Damien
[Image removed by sender.] Damien Gardner Jnr Senior Dev Ops P: (02) 8115 8812 4 Amy Close, Wyong, NSW 2259 Need assistance? We are here 24/7 +61 2 8115 8888
[Image removed by sender. Sent from Front]
_______________________________________________ observium mailing list observium@observium.orgmailto:observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium