Discovery of large stack causes hardware to become unresponsive

Patrick Zaloum

4 Mar 2013 4 Mar '13

4:02 p.m.

Hello I are testing Observium in a new environment after previous success stories elsewhere. We have noted that when we run device discovery on a stack of 5 Cat3750 (12.2-50-SE1) the "device" becomes unresponsive to any other queries / pings / ssh for approximately 5-6 minutes (during the discovery cycle).

Once the discovery is complete, the poller, pings, and other operations return to normal. Note that traffic flow does not appear to be affected across the switch.

Total ports on the stack are 278 (no argument that it is a large number of ports to poll from a single device).

Regular polling of the device does not appear to have any consequences, it is specific to the discovery module.

Does anyone have any insight into this?

Thanks Patrick

Attachments:

attachment.html (text/html — 1005 bytes)

Show replies by date

Alex Pressé

4 Mar 4 Mar

4:26 p.m.

I have noticed this too, and after doing basic snmpwalks and then snmpgets on a device (bladecenter switch in my case) it appears that there are specific GET requests that seem to lock up the switch for 5 (or so) minutes.

I wrote it off as a Cisco quirk.

On Mon, Mar 4, 2013 at 8:02 AM, Patrick Zaloum pzaloum@gmail.com wrote:

...

Hello I are testing Observium in a new environment after previous success stories elsewhere. We have noted that when we run device discovery on a stack of 5 Cat3750 (12.2-50-SE1) the "device" becomes unresponsive to any other queries / pings / ssh for approximately 5-6 minutes (during the discovery cycle).

Once the discovery is complete, the poller, pings, and other operations return to normal. Note that traffic flow does not appear to be affected across the switch.

Total ports on the stack are 278 (no argument that it is a large number of ports to poll from a single device).

Regular polling of the device does not appear to have any consequences, it is specific to the discovery module.

Does anyone have any insight into this?

Thanks Patrick

observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

-- Alex Presse "How much net work could a network work if a network could net work?"

Adam Armstrong

6:26 p.m.

New subject: Discovery of large stack causes hardware to become unresponsive

It's probably related to either ARP or FDB tables.

You can run discovery by hand as Tom suggests and then disable the module causing the issues.

Next time you might think twice of buying shitty 3750s. (and using the pointless "stacking" crap, which basically has resulted in your 300 ports having a single, massively underpowered control plane cpu) :)

adam.

On Mon, 4 Mar 2013 10:02:04 -0500, Patrick Zaloum pzaloum@gmail.com wrote:

...

Hello I are testing Observium in a new environment after previous success

stories

...

elsewhere. We have noted that when we run device discovery on a stack of 5 Cat3750 (12.2-50-SE1) the "device" becomes unresponsive to any other queries / pings / ssh for approximately 5-6 minutes (during the discovery cycle).

Once the discovery is complete, the poller, pings, and other operations return to normal. Note that traffic flow does not appear to be affected across the switch.

Total ports on the stack are 278 (no argument that it is a large number

...

ports to poll from a single device).

Regular polling of the device does not appear to have any consequences,

...

is specific to the discovery module.

Does anyone have any insight into this?

Thanks Patrick

Patrick Zaloum

9:13 p.m.

OK First things first - didn't choose the hardware or setup, I've just walked into the place :)

To clarify the observations, SNMP stops working, other services are OK. Watching the CPU on the switch, CPU isn't very high (about 35%) and SNMP server is using about 10%.

It kicks into high gear and stops responding during the dot1dStpPortEntry dot1dBasePortEntry polling once it hits a few VLANs that exist in the vlan table but not used. Our active VLANs pass without any issues. Once it hits this wall, no further queries get responses.

Polling these same values on the unused VLANs from other switches in the same VTP causes no issues (various models and versions)

On Mon, Mar 4, 2013 at 12:26 PM, Adam Armstrong adama@memetic.org wrote:

...

It's probably related to either ARP or FDB tables.

You can run discovery by hand as Tom suggests and then disable the module causing the issues.

Next time you might think twice of buying shitty 3750s. (and using the pointless "stacking" crap, which basically has resulted in your 300 ports having a single, massively underpowered control plane cpu) :)

adam.

On Mon, 4 Mar 2013 10:02:04 -0500, Patrick Zaloum pzaloum@gmail.com wrote:

...
Hello I are testing Observium in a new environment after previous success

stories

...
elsewhere. We have noted that when we run device discovery on a stack of 5 Cat3750 (12.2-50-SE1) the "device" becomes unresponsive to any other queries / pings / ssh for approximately 5-6 minutes (during the discovery cycle).

Once the discovery is complete, the poller, pings, and other operations return to normal. Note that traffic flow does not appear to be affected across the switch.

Total ports on the stack are 278 (no argument that it is a large number

of

...
ports to poll from a single device).

Regular polling of the device does not appear to have any consequences,

it

...
is specific to the discovery module.

Does anyone have any insight into this?

Thanks Patrick

observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

Adam Armstrong

8:33 p.m.

New subject: Discovery of large stack causes hardware to become unresponsive

...

It kicks into high gear and stops responding during the dot1dStpPortEntry dot1dBasePortEntry polling once it hits a few

VLANs

...

that exist in the vlan table but not used. Our active VLANs pass without any issues. Once it hits this wall, no further queries get responses.

Polling these same values on the unused VLANs from other switches in the same VTP causes no issues (various models and versions)

This is likely because the control plane is going to the individual device's forwarding ASICs to get information. The code doing that is often a bit inefficient or buggy.

Witness the amazing QC on code from Cisco which is aimed at "enterprise" applications :)

The answer to "3750 stack?" is almost always "6500" (sometimes 4500) :)

adam.

Morgan McLean

5 Mar 5 Mar

12:38 a.m.

Or you can just go with a nice juniper VC and forget cisco all together ;)

On Mon, Mar 4, 2013 at 11:33 AM, Adam Armstrong adama@memetic.org wrote:

...

...
It kicks into high gear and stops responding during the dot1dStpPortEntry dot1dBasePortEntry polling once it hits a few

VLANs

...
that exist in the vlan table but not used. Our active VLANs pass without any issues. Once it hits this wall, no further queries get responses.

Polling these same values on the unused VLANs from other switches in the same VTP causes no issues (various models and versions)

This is likely because the control plane is going to the individual device's forwarding ASICs to get information. The code doing that is often a bit inefficient or buggy.

Witness the amazing QC on code from Cisco which is aimed at "enterprise" applications :)

The answer to "3750 stack?" is almost always "6500" (sometimes 4500) :)

adam. _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

-- Thanks, Morgan

4534

Age (days ago)

4534

Last active (days ago)

List overview

Download

5 comments

4 participants

tags (0)

participants (4)

Adam Armstrong
Alex Pressé
Morgan McLean
Patrick Zaloum