On 10/07/2013 05:11 PM, Niklas Larsson wrote:
Tom Laermans skrev 2013-10-07 15:48:
Hi Sweden!
These are the points from the Belgian vote:
You have issues with your monitored machines and would best fix them; your SNMP is timing out.
I start with "we havn't change anything...", but in case it's true - this has happened on both xenserver and pure centos boxes. All of them have lost the values at the same time - and some of this boxes have not been updated or reboot in a while > 6 months. The only thing that changed is observium.
That doesn't mean things don't break behind your back though - I have a lot of Supermicro machines that lost their sensors, and not only in Observium, but sdt on the commandline also lost them :[ Maybe they don't like being polled every 5 minutes :) But as you mentioned you could still get them via sd on the commandline I knew this wasn't the case.
And then I start to look again and not all have lost fans etc - I still have some that are showing sensor values. They are all newer boxes.
So i started to search trough commits and testing a bit. At last I found the timeout value for snmpbulkwalk (was looking at the -Cr value first), tested with -t20 and it worked - added 20 to timeout for the device and did a new discovery - sucess fans, voltages, temperatures and storage are back!
So the older SuperMicro boxes didn't handle the short timeout.
Hmm. Your timeout is set to 20 microseconds? Or is it 20 seconds, god, the Net-SNMP documentation is such shit. If 20 seconds is not enough to get that entire tree walked, you will indeed see this issue. As far as I know we don't set timeouts unless you do yourself, did you? Or is the net-snmp default not long enough?
Glad you got it sorted!
Tom