On 2018-09-07 17:38:12, Attila Nagy <tylla_at_memetic.org@tylla.hu> wrote:
Hi List!
Firstly our appreciations go to the team for the new CE version. Thanks a lot!
Now the somewhat bitter part: we updated yesterday evening to the latest CE version and today morning we were confronted with the problem that a few Tp-Link switches have frozen. To be more precise, only the management part was frozen, so the switch was degraded to a dumb non-manageable switch. After reboot the switches return to full function only to freeze again on next SNMP poll.
The switches in question are all T1600G-28TS or T1600G-28PS (poor-man's manageable gigabit switches :-))
There is no newer firmware update available for them.
They worked perfectly with the previous CE version up until the update.
We have other Tp-Link switches which are not affected by the same problem: TL-SG2424, TL-SG2216, TL-SG2008.
We tried to narrow down the problem, and it seems that discovery works as it should, the freeze enacts during the polling, more exactly when the ports module runs. Looking at the debug log the following command produces the freeze:
/usr/bin/snmpget -v2c -c 'public' -Pu -OQUs -m IF-MIB -M /opt/observium/mibs/rfc:/opt/observium/mibs/net-snmp 'udp':'switch-test':'161' ifInMulticastPkts.1 ifOutMulticastPkts.1 ifInBroadcastPkts.1 ifOutBroadcastPkts.1 ifHCInOctets.1 ifHCOutOctets.1 ifHCInUcastPkts.1 ifHCOutUcastPkts.1 ifHCInMulticastPkts.1 ifHCOutMulticastPkts.1 ifHCInBroadcastPkts.1 ifHCOutBroadcastPkts.1 ifInOctets.1 ifOutOctets.1 ifInUcastPkts.1 ifOutUcastPkts.1 ifInNUcastPkts.1 ifOutNUcastPkts.1 ifInDiscards.1 ifOutDiscards.1 ifInErrors.1 ifOutErrors.1 ifInUnknownProtos.1 ifMtu.1 ifSpeed.1 ifPhysAddress.1 ifAdminStatus.1 ifLastChange.1 ifPromiscuousMode.1 ifConnectorPresent.1
From here onward there is no SNMP response from the switch, the SYS LED stops blinking and is frozen in whichever state it was, and there is no network connectivity to the switch (does not respond to ping and the management interface is inaccessible). The dumb switch hardware was still working for some time, but this was inconsistent, some switches have frozen to a complete halt some time after, while others were still working as dumb switches. As these are under constant load we didn't had the time to more thoroughly investigate on them, we restored their functionality and disabled them in Observium.
We have one extra hardware that we can put online for testing if needed.
My questions would be:
- Should I include a full debug log or the above info is enough?
- Has anyone met the same problem/has a solution to it?
- Can we know what was changed on the above command line, so maybe as a workaround we can disable that single feature?
- I suspect this is a firmware issue, but I am not sure (that's why I'm asking here first, here I have way bigger chances to receive meaningful answers compared to Tp-Link's support), maybe somebody can help me narrow down the firmware bug if it is really one.
Thank you guys in advance.
Best regards,
Tylla