Re: [Observium] TP-Link switches freeze with new Observium CE
![](https://secure.gravatar.com/avatar/3bbbd945c333b8013d0dfa23058f65b9.jpg?s=120&d=mm&r=g)
Hrm,
on all my tested TP-Link switches this feature improved polling speed. That why separate_walk feature for this os enabled.
But such problems are certainly unacceptable, I will disable that for os.
Anyway, if you have not so many devices, goto device edit page -> Modules -> disable 'separate_walk' option.
Attila Nagy mailto:tylla_at_memetic.org@tylla.hu 7 September 2018 at 19:31 Hi List!
Firstly our appreciations go to the team for the new CE version. Thanks a lot!
Now the somewhat bitter part: we updated yesterday evening to the latest CE version and today morning we were confronted with the problem that a few Tp-Link switches have frozen. To be more precise, only the management part was frozen, so the switch was degraded to a dumb non-manageable switch. After reboot the switches return to full function only to freeze again on next SNMP poll.
The switches in question are all T1600G-28TS or T1600G-28PS (poor-man's manageable gigabit switches :-)) There is no newer firmware update available for them. They worked perfectly with the previous CE version up until the update.
We have other Tp-Link switches which are not affected by the same problem: TL-SG2424, TL-SG2216, TL-SG2008.
We tried to narrow down the problem, and it seems that discovery works as it should, the freeze enacts during the polling, more exactly when the ports module runs. Looking at the debug log the following command produces the freeze: /usr/bin/snmpget -v2c -c 'public' -Pu -OQUs -m IF-MIB -M /opt/observium/mibs/rfc:/opt/observium/mibs/net-snmp 'udp':'switch-test':'161' ifInMulticastPkts.1 ifOutMulticastPkts.1 ifInBroadcastPkts.1 ifOutBroadcastPkts.1 ifHCInOctets.1 ifHCOutOctets.1 ifHCInUcastPkts.1 ifHCOutUcastPkts.1 ifHCInMulticastPkts.1 ifHCOutMulticastPkts.1 ifHCInBroadcastPkts.1 ifHCOutBroadcastPkts.1 ifInOctets.1 ifOutOctets.1 ifInUcastPkts.1 ifOutUcastPkts.1 ifInNUcastPkts.1 ifOutNUcastPkts.1 ifInDiscards.1 ifOutDiscards.1 ifInErrors.1 ifOutErrors.1 ifInUnknownProtos.1 ifMtu.1 ifSpeed.1 ifPhysAddress.1 ifAdminStatus.1 ifLastChange.1 ifPromiscuousMode.1 ifConnectorPresent.1
From here onward there is no SNMP response from the switch, the SYS LED stops blinking and is frozen in whichever state it was, and there is no network connectivity to the switch (does not respond to ping and the management interface is inaccessible). The dumb switch hardware was still working for some time, but this was inconsistent, some switches have frozen to a complete halt some time after, while others were still working as dumb switches. As these are under constant load we didn't had the time to more thoroughly investigate on them, we restored their functionality and disabled them in Observium. We have one extra hardware that we can put online for testing if needed.
My questions would be:
- Should I include a full debug log or the above info is enough?
- Has anyone met the same problem/has a solution to it?
- Can we know what was changed on the above command line, so maybe as
a workaround we can disable that single feature?
- I suspect this is a firmware issue, but I am not sure (that's why
I'm asking here first, here I have way bigger chances to receive meaningful answers compared to Tp-Link's support), maybe somebody can help me narrow down the firmware bug if it is really one.
Thank you guys in advance.
Best regards, Tylla
observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
![](https://secure.gravatar.com/avatar/4fad748f042c73a9d01d1ff340dbced4.jpg?s=120&d=mm&r=g)
Thanks Mike,
disabling separate_walk solved the issue for the test device. We are gonna go through all the sensitive switches and disable/test them one-by-one, and report back with the final outcome.
Is there anything we can do to further narrow down the problem?
As I mentioned we have a test hardware so we can run some tests (or we can even put it on-line if you want to get direct access to it).
Regards, Tylla
On 2018-09-07 23:53, Mike Stupalov wrote:
Hrm,
on all my tested TP-Link switches this feature improved polling speed. That why separate_walk feature for this os enabled.
But such problems are certainly unacceptable, I will disable that for os.
Anyway, if you have not so many devices, goto device edit page -> Modules -> disable 'separate_walk' option.
-- Mike Stupalov https://stupalov.com
-- Mike Stupalov Observium Limited, http://observium.org
![](https://secure.gravatar.com/avatar/4fad748f042c73a9d01d1ff340dbced4.jpg?s=120&d=mm&r=g)
Further info about this bug: From my testing it seems that using snmpget with lots of OID's makes the switch freeze. More exactly: using the command line as used by Observium (quoted in my original mail) has a limit of 24, so the 25th OID causes the switch to freeze, while 24 OID's produce consistent valid output. Interestingly, when used with the same OID (eg: ifOutOctets.1 ifOutOctets.1 ifOutOctets.1 ..., or ifMtu.1 ifMtu.1 ifMtu.1 .... etc.) then 12 is the magic limit number. Hmmm...
On 2018-09-10 10:49, Attila Nagy wrote:
Thanks Mike,
disabling separate_walk solved the issue for the test device. We are gonna go through all the sensitive switches and disable/test them one-by-one, and report back with the final outcome.
Is there anything we can do to further narrow down the problem?
As I mentioned we have a test hardware so we can run some tests (or we can even put it on-line if you want to get direct access to it).
Regards, Tylla
On 2018-09-07 23:53, Mike Stupalov wrote:
Hrm,
on all my tested TP-Link switches this feature improved polling speed. That why separate_walk feature for this os enabled.
But such problems are certainly unacceptable, I will disable that for os.
Anyway, if you have not so many devices, goto device edit page -> Modules -> disable 'separate_walk' option.
-- Mike Stupalov https://stupalov.com
-- Mike Stupalov Observium Limited, http://observium.org
observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
![](https://secure.gravatar.com/avatar/21caf0a08d095be7196a1648d20942be.jpg?s=120&d=mm&r=g)
I wonder if using a max number of gets would still increase performance over walking... :-)
On 9/10/2018 1:45 PM, Attila Nagy wrote:
Further info about this bug: From my testing it seems that using snmpget with lots of OID's makes the switch freeze. More exactly: using the command line as used by Observium (quoted in my original mail) has a limit of 24, so the 25th OID causes the switch to freeze, while 24 OID's produce consistent valid output. Interestingly, when used with the same OID (eg: ifOutOctets.1 ifOutOctets.1 ifOutOctets.1 ..., or ifMtu.1 ifMtu.1 ifMtu.1 .... etc.) then 12 is the magic limit number. Hmmm...
On 2018-09-10 10:49, Attila Nagy wrote:
Thanks Mike,
disabling separate_walk solved the issue for the test device. We are gonna go through all the sensitive switches and disable/test them one-by-one, and report back with the final outcome.
Is there anything we can do to further narrow down the problem?
As I mentioned we have a test hardware so we can run some tests (or we can even put it on-line if you want to get direct access to it).
Regards, Tylla
On 2018-09-07 23:53, Mike Stupalov wrote:
Hrm,
on all my tested TP-Link switches this feature improved polling speed. That why separate_walk feature for this os enabled.
But such problems are certainly unacceptable, I will disable that for os.
Anyway, if you have not so many devices, goto device edit page -> Modules -> disable 'separate_walk' option.
-- Mike Stupalov https://stupalov.com
-- Mike Stupalov Observium Limited, http://observium.org
observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
![](https://secure.gravatar.com/avatar/0fa97865a0e1ab36152b6b2299eedb49.jpg?s=120&d=mm&r=g)
Presumably this is related to the size of the return packet and some buffer overrun somewhere.
Adam.
Sent from BlueMail [http://www.bluemail.me/r?b=13569] On 10 Sep 2018, at 12:45, Attila Nagy <memetic.org [mailto:tylla_at_<a]@tylla.hu target=_blank>tylla_at_memetic.org [http://memetic.org%5D@tylla.hu%3E wrote: Further info about this bug:
From my testing it seems that using snmpget with lots of OID's makes the switch freeze.
More exactly: using the command line as used by Observium (quoted in my original mail) has a limit of 24, so the 25th OID causes the switch to freeze, while 24 OID's produce consistent valid output. Interestingly, when used with the same OID (eg: ifOutOctets.1 ifOutOctets.1 ifOutOctets.1 ..., or ifMtu.1 ifMtu.1 ifMtu.1 .... etc.) then 12 is the magic limit number. Hmmm...
On 2018-09-10 10:49, Attila Nagy wrote:
Thanks Mike,
disabling separate_walk solved the issue for the test device. We are gonna go through all the sensitive switches and disable/test them one-by-one, and report back with the final outcome.
Is there anything we can do to further narrow down the problem?
As I mentioned we have a test hardware so we can run some tests (or we can even put it on-line if you want to get direct access to it).
Regards, Tylla
On 2018-09-07 23:53, Mike Stupalov wrote:
Hrm,
on all my tested TP-Link switches this feature improved polling speed. That why separate_walk feature for this os enabled.
But such problems are certainly unacceptable, I will disable that for os.
Anyway, if you have not so many devices, goto device edit page -> Modules -> disable 'separate_walk' option.
-- Mike Stupalov https://stupalov.com [https://stupalov.com]
-- Mike Stupalov Observium Limited, http://observium.org [http://observium.org]
_______________________________________________ observium mailing list observium@observium.org [mailto:observium@observium.org] http://postman.memetic.org/cgi-bin/mailman/listinfo/observium [http://postman.memetic.org/cgi-bin/mailman/listinfo/observium]
observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium [http://postman.memetic.org/cgi-bin/mailman/listinfo/observium]
participants (4)
-
Adam Armstrong
-
Attila Nagy
-
Mike Stupalov
-
Tom Laermans