Hi List!
Firstly our appreciations go to the team for the new CE version.
Thanks a lot!
Now the somewhat bitter part: we updated yesterday evening to the
latest CE version and today morning we were confronted with the
problem that a few Tp-Link switches have frozen. To be more
precise, only the management part was frozen, so the switch was
degraded to a dumb non-manageable switch. After reboot the
switches return to full function only to freeze again on next SNMP
poll.
The switches in question are all T1600G-28TS or T1600G-28PS (poor-man's
manageable gigabit switches :-))
There is no newer firmware update available for them.
They worked perfectly with the previous CE version up until the
update.
We have other Tp-Link
switches which are not affected by the same problem: TL-SG2424,
TL-SG2216, TL-SG2008.
We tried to narrow down the problem, and it seems that discovery
works as it should, the freeze enacts during the polling, more
exactly when the ports module runs. Looking at the debug log the
following command produces the freeze:
/usr/bin/snmpget -v2c -c 'public' -Pu -OQUs -m IF-MIB -M
/opt/observium/mibs/rfc:/opt/observium/mibs/net-snmp
'udp':'switch-test':'161' ifInMulticastPkts.1 ifOutMulticastPkts.1
ifInBroadcastPkts.1 ifOutBroadcastPkts.1 ifHCInOctets.1
ifHCOutOctets.1 ifHCInUcastPkts.1 ifHCOutUcastPkts.1
ifHCInMulticastPkts.1 ifHCOutMulticastPkts.1 ifHCInBroadcastPkts.1
ifHCOutBroadcastPkts.1 ifInOctets.1 ifOutOctets.1 ifInUcastPkts.1
ifOutUcastPkts.1 ifInNUcastPkts.1 ifOutNUcastPkts.1 ifInDiscards.1
ifOutDiscards.1 ifInErrors.1 ifOutErrors.1 ifInUnknownProtos.1
ifMtu.1 ifSpeed.1 ifPhysAddress.1 ifAdminStatus.1 ifLastChange.1
ifPromiscuousMode.1 ifConnectorPresent.1
From here onward there is no SNMP response from the switch, the
SYS LED stops blinking and is frozen in whichever state it was,
and there is no network connectivity to the switch (does not
respond to ping and the management interface is inaccessible). The
dumb switch hardware was still working for some time, but this was
inconsistent, some switches have frozen to a complete halt some
time after, while others were still working as dumb switches. As
these are under constant load we didn't had the time to more
thoroughly investigate on them, we restored their functionality
and disabled them in Observium.
We have one extra hardware that we can put online for testing if
needed.
My questions would be:
- Should I include a
full debug log or the above info is enough?
- Has anyone met the same problem/has a solution to it?
- Can we know what was changed on the above command line, so maybe
as a workaround we can disable that single feature?
- I suspect this is a
firmware issue, but I am not sure (that's why I'm asking here
first, here I have way bigger chances to receive meaningful
answers compared to Tp-Link's support), maybe somebody can help
me narrow down the firmware bug if it is really one.
Thank you guys in advance.
Best regards,
Tylla