Discovery of large stack causes hardware to become unresponsive
Hello I are testing Observium in a new environment after previous success stories elsewhere. We have noted that when we run device discovery on a stack of 5 Cat3750 (12.2-50-SE1) the "device" becomes unresponsive to any other queries / pings / ssh for approximately 5-6 minutes (during the discovery cycle).
Once the discovery is complete, the poller, pings, and other operations return to normal. Note that traffic flow does not appear to be affected across the switch.
Total ports on the stack are 278 (no argument that it is a large number of ports to poll from a single device).
Regular polling of the device does not appear to have any consequences, it is specific to the discovery module.
Does anyone have any insight into this?
Thanks Patrick
I have noticed this too, and after doing basic snmpwalks and then snmpgets on a device (bladecenter switch in my case) it appears that there are specific GET requests that seem to lock up the switch for 5 (or so) minutes.
I wrote it off as a Cisco quirk.
On Mon, Mar 4, 2013 at 8:02 AM, Patrick Zaloum pzaloum@gmail.com wrote:
Hello I are testing Observium in a new environment after previous success stories elsewhere. We have noted that when we run device discovery on a stack of 5 Cat3750 (12.2-50-SE1) the "device" becomes unresponsive to any other queries / pings / ssh for approximately 5-6 minutes (during the discovery cycle).
Once the discovery is complete, the poller, pings, and other operations return to normal. Note that traffic flow does not appear to be affected across the switch.
Total ports on the stack are 278 (no argument that it is a large number of ports to poll from a single device).
Regular polling of the device does not appear to have any consequences, it is specific to the discovery module.
Does anyone have any insight into this?
Thanks Patrick
observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
It's probably related to either ARP or FDB tables.
You can run discovery by hand as Tom suggests and then disable the module causing the issues.
Next time you might think twice of buying shitty 3750s. (and using the pointless "stacking" crap, which basically has resulted in your 300 ports having a single, massively underpowered control plane cpu) :)
adam.
On Mon, 4 Mar 2013 10:02:04 -0500, Patrick Zaloum pzaloum@gmail.com wrote:
Hello I are testing Observium in a new environment after previous success
stories
elsewhere. We have noted that when we run device discovery on a stack of 5 Cat3750 (12.2-50-SE1) the "device" becomes unresponsive to any other queries / pings / ssh for approximately 5-6 minutes (during the discovery cycle).
Once the discovery is complete, the poller, pings, and other operations return to normal. Note that traffic flow does not appear to be affected across the switch.
Total ports on the stack are 278 (no argument that it is a large number
of
ports to poll from a single device).
Regular polling of the device does not appear to have any consequences,
it
is specific to the discovery module.
Does anyone have any insight into this?
Thanks Patrick
OK First things first - didn't choose the hardware or setup, I've just walked into the place :)
To clarify the observations, SNMP stops working, other services are OK. Watching the CPU on the switch, CPU isn't very high (about 35%) and SNMP server is using about 10%.
It kicks into high gear and stops responding during the dot1dStpPortEntry dot1dBasePortEntry polling once it hits a few VLANs that exist in the vlan table but not used. Our active VLANs pass without any issues. Once it hits this wall, no further queries get responses.
Polling these same values on the unused VLANs from other switches in the same VTP causes no issues (various models and versions)
On Mon, Mar 4, 2013 at 12:26 PM, Adam Armstrong adama@memetic.org wrote:
It's probably related to either ARP or FDB tables.
You can run discovery by hand as Tom suggests and then disable the module causing the issues.
Next time you might think twice of buying shitty 3750s. (and using the pointless "stacking" crap, which basically has resulted in your 300 ports having a single, massively underpowered control plane cpu) :)
adam.
On Mon, 4 Mar 2013 10:02:04 -0500, Patrick Zaloum pzaloum@gmail.com wrote:
Hello I are testing Observium in a new environment after previous success
stories
elsewhere. We have noted that when we run device discovery on a stack of 5 Cat3750 (12.2-50-SE1) the "device" becomes unresponsive to any other queries / pings / ssh for approximately 5-6 minutes (during the discovery cycle).
Once the discovery is complete, the poller, pings, and other operations return to normal. Note that traffic flow does not appear to be affected across the switch.
Total ports on the stack are 278 (no argument that it is a large number
of
ports to poll from a single device).
Regular polling of the device does not appear to have any consequences,
it
is specific to the discovery module.
Does anyone have any insight into this?
Thanks Patrick
observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
It kicks into high gear and stops responding during the dot1dStpPortEntry dot1dBasePortEntry polling once it hits a few
VLANs
that exist in the vlan table but not used. Our active VLANs pass without any issues. Once it hits this wall, no further queries get responses.
Polling these same values on the unused VLANs from other switches in the same VTP causes no issues (various models and versions)
This is likely because the control plane is going to the individual device's forwarding ASICs to get information. The code doing that is often a bit inefficient or buggy.
Witness the amazing QC on code from Cisco which is aimed at "enterprise" applications :)
The answer to "3750 stack?" is almost always "6500" (sometimes 4500) :)
adam.
Or you can just go with a nice juniper VC and forget cisco all together ;)
On Mon, Mar 4, 2013 at 11:33 AM, Adam Armstrong adama@memetic.org wrote:
It kicks into high gear and stops responding during the dot1dStpPortEntry dot1dBasePortEntry polling once it hits a few
VLANs
that exist in the vlan table but not used. Our active VLANs pass without any issues. Once it hits this wall, no further queries get responses.
Polling these same values on the unused VLANs from other switches in the same VTP causes no issues (various models and versions)
This is likely because the control plane is going to the individual device's forwarding ASICs to get information. The code doing that is often a bit inefficient or buggy.
Witness the amazing QC on code from Cisco which is aimed at "enterprise" applications :)
The answer to "3750 stack?" is almost always "6500" (sometimes 4500) :)
adam. _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
participants (4)
-
Adam Armstrong
-
Alex Pressé
-
Morgan McLean
-
Patrick Zaloum