poller performance

older
Brocade FastIron SX 1600, IronWare...

F.Reenders＠utwente.nl

9 Oct 2013 9 Oct '13

1:35 p.m.

Hi,

I'm having problems with the performance of the poller. We have 200 switches and almost 16000 ports included and the poller is having trouble getting them checked in 5 minutes. With 20 threads it's working with a load of 15. When I add more threads it takes longer to get the data from switches. With less threads it also doesn't check all switches in 5 minutes. I'm using new fast hardware.

I've implemented all the performance tuning tips on the observium site. I also disabled the checking of fbd-table, arp-table and mac-accounting. When I debug the poller the ports check takes the longest. About 90 % of the total time. 1 switch takes about 8 to 10 seconds now.

Is there a way to speed it up? Maybe I can extend the check cycle time to 10 minutes? Distributed poller instances?

Regards,

Show replies by date

Peter Persson

9 Oct 9 Oct

1:37 p.m.

We are running poller-wrapper with 64 threads, and our 250 devices and 20000 ports is getting polled in about 2 minutes.

We are running this on SSD drives for performance.

/Peter

2013/10/9 F.Reenders@utwente.nl

...

Hi,

I'm having problems with the performance of the poller. We have 200 switches and almost 16000 ports included and the poller is having trouble getting them checked in 5 minutes. With 20 threads it's working with a load of 15. When I add more threads it takes longer to get the data from switches. With less threads it also doesn't check all switches in 5 minutes. I'm using new fast hardware.

I've implemented all the performance tuning tips on the observium site. I also disabled the checking of fbd-table, arp-table and mac-accounting. When I debug the poller the ports check takes the longest. About 90 % of the total time. 1 switch takes about 8 to 10 seconds now.

Is there a way to speed it up? Maybe I can extend the check cycle time to 10 minutes? Distributed poller instances?

Regards,

Frederik Reenders | ICTS Operation Centre | University of Twente | ICT Service Centre | ICTS Server Operation | P.O.Box 217, 7500 AE Enschede The Netherlands | Drienerlolaan 5, 7522 NB Enschede | Campus building: Spiegel, room 416 | T: +31 53 489 2653/6723 | f.reenders@utwente.nl | www.utwente.nl/icts/en/

observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

F.Reenders＠utwente.nl

1:59 p.m.

Hi Peter,

Do you have all the modules enabled?

When I set it to 64 threads the load goes up to 60 but it makes the cycle in 3 minutes. I don't see much network traffic and also not very much disk access.

What load is you system having?

Regards,

Frederik

From: observium [mailto:observium-bounces@observium.org] On Behalf Of Peter Persson Sent: woensdag 9 oktober 2013 13:37 To: Observium Network Observation System Subject: Re: [Observium] poller performance

We are running poller-wrapper with 64 threads, and our 250 devices and 20000 ports is getting polled in about 2 minutes.

We are running this on SSD drives for performance.

/Peter

2013/10/9 <F.Reenders@utwente.nlmailto:F.Reenders@utwente.nl> Hi,

Is there a way to speed it up? Maybe I can extend the check cycle time to 10 minutes? Distributed poller instances?

Regards,

_______________________________________________ observium mailing list observium@observium.orgmailto:observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

Laurens Vets

2:05 p.m.

...

We are running poller-wrapper with 64 threads, and our 250 devices and 20000 ports is getting polled in about 2 minutes.

Where can I quickly see this?

F.Reenders＠utwente.nl

2:11 p.m.

In the ~observium/observium.log per device and the time difference between the last checked host and the discovery.php new proc.

Regards,

Frederik

-----Original Message----- From: observium [mailto:observium-bounces@observium.org] On Behalf Of Laurens Vets Sent: woensdag 9 oktober 2013 14:05 To: Observium Network Observation System Subject: Re: [Observium] poller performance

...

We are running poller-wrapper with 64 threads, and our 250 devices and 20000 ports is getting polled in about 2 minutes.

Where can I quickly see this? _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

Laurens Vets

2:30 p.m.

On 2013-10-09 14:11, F.Reenders@utwente.nl wrote:

...

In the ~observium/observium.log per device and the time difference between the last checked host and the discovery.php new proc.

Oh nice, we're doing all our polling in 2 minutes with 150 switches and approx. 16000 ports on 5+ years old hardware (no SSDs)...

...

-----Original Message----- From: observium [mailto:observium-bounces@observium.org] On Behalf Of Laurens Vets Sent: woensdag 9 oktober 2013 14:05 To: Observium Network Observation System Subject: Re: [Observium] poller performance

...
We are running poller-wrapper with 64 threads, and our 250 devices and 20000 ports is getting polled in about 2 minutes.

Where can I quickly see this? _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

Adam Armstrong

1:58 p.m.

On 2013-10-09 12:35, F.Reenders@utwente.nl wrote:

...

Hi,

I'm having problems with the performance of the poller. We have 200 switches and almost 16000 ports included and the poller is having trouble getting them checked in 5 minutes. With 20 threads it's working with a load of 15. When I add more threads it takes longer to get the data from switches. With less threads it also doesn't check all switches in 5 minutes. I'm using new fast hardware.

I've implemented all the performance tuning tips on the observium site. I also disabled the checking of fbd-table, arp-table and mac-accounting. When I debug the poller the ports check takes the longest. About 90 % of the total time. 1 switch takes about 8 to 10 seconds now.

Is there a way to speed it up? Maybe I can extend the check cycle time to 10 minutes? Distributed poller instances?

Your bottleneck is almost certainly I/O. We write a *lot* of data.

The solution is either RAM disks or SSD.

20k ports should fit in 48GB of RAM. There are instructions on how to do this properly on the Wiki. With a RAM disk you don't worry about I/O at all, only CPU and network performance.

Alternatively you can use an SSD, which have much higher throughput than a harddisk (still not unlimited, and their write speeds can be a bit slow).

adam.

F.Reenders＠utwente.nl

2:02 p.m.

OK, thanks. I'll have a look at the RAM disk then.

It's a new page on the wiki? I didn't see it last week.

Regards,

Frederik

-----Original Message----- From: observium [mailto:observium-bounces@observium.org] On Behalf Of Adam Armstrong Sent: woensdag 9 oktober 2013 13:58 To: Observium Network Observation System Subject: Re: [Observium] poller performance

On 2013-10-09 12:35, F.Reenders@utwente.nl wrote:

...

Hi,

I'm having problems with the performance of the poller. We have 200 switches and almost 16000 ports included and the poller is having trouble getting them checked in 5 minutes. With 20 threads it's working with a load of 15. When I add more threads it takes longer to get the data from switches. With less threads it also doesn't check all switches in 5 minutes. I'm using new fast hardware.

I've implemented all the performance tuning tips on the observium site. I also disabled the checking of fbd-table, arp-table and mac-accounting. When I debug the poller the ports check takes the longest. About 90 % of the total time. 1 switch takes about 8 to 10 seconds now.

Is there a way to speed it up? Maybe I can extend the check cycle time to 10 minutes? Distributed poller instances?

Your bottleneck is almost certainly I/O. We write a *lot* of data.

The solution is either RAM disks or SSD.

20k ports should fit in 48GB of RAM. There are instructions on how to do this properly on the Wiki. With a RAM disk you don't worry about I/O at all, only CPU and network performance.

Alternatively you can use an SSD, which have much higher throughput than a harddisk (still not unlimited, and their write speeds can be a bit slow).

adam. _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

Adam Armstrong

2:04 p.m.

On 2013-10-09 13:02, F.Reenders@utwente.nl wrote:

...

OK, thanks. I'll have a look at the RAM disk then.

It's a new page on the wiki? I didn't see it last week.

It's been there since 2009 :)

http://www.observium.org/wiki/Persistent_RAM_disk_RRD_storage

Tom Laermans

2:06 p.m.

It's been there for over a year.

But do you see "load of 60" as being a problem? As long as the polls finish in 3 minutes, I don't really see an issue, as far as data collection goes.

Web interface may be sluggish due to I/O load though.

Tom

On 9/10/2013 14:02, F.Reenders@utwente.nl wrote:

...

OK, thanks. I'll have a look at the RAM disk then.

It's a new page on the wiki? I didn't see it last week.

Regards,

Frederik

-----Original Message----- From: observium [mailto:observium-bounces@observium.org] On Behalf Of Adam Armstrong Sent: woensdag 9 oktober 2013 13:58 To: Observium Network Observation System Subject: Re: [Observium] poller performance

On 2013-10-09 12:35, F.Reenders@utwente.nl wrote:

...
Hi,

I'm having problems with the performance of the poller. We have 200 switches and almost 16000 ports included and the poller is having trouble getting them checked in 5 minutes. With 20 threads it's working with a load of 15. When I add more threads it takes longer to get the data from switches. With less threads it also doesn't check all switches in 5 minutes. I'm using new fast hardware.

I've implemented all the performance tuning tips on the observium site. I also disabled the checking of fbd-table, arp-table and mac-accounting. When I debug the poller the ports check takes the longest. About 90 % of the total time. 1 switch takes about 8 to 10 seconds now.

Is there a way to speed it up? Maybe I can extend the check cycle time to 10 minutes? Distributed poller instances?

Your bottleneck is almost certainly I/O. We write a *lot* of data.

The solution is either RAM disks or SSD.

20k ports should fit in 48GB of RAM. There are instructions on how to do this properly on the Wiki. With a RAM disk you don't worry about I/O at all, only CPU and network performance.

Alternatively you can use an SSD, which have much higher throughput than a harddisk (still not unlimited, and their write speeds can be a bit slow).

adam. _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

F.Reenders＠utwente.nl

2:12 p.m.

I must have read over it. I will try it today or tomorrow and share the results.

The system still is responsive so it not that big a deal.

Regards,

Frederik

-----Original Message----- From: observium [mailto:observium-bounces@observium.org] On Behalf Of Tom Laermans Sent: woensdag 9 oktober 2013 14:07 To: Observium Network Observation System Subject: Re: [Observium] poller performance

It's been there for over a year.

But do you see "load of 60" as being a problem? As long as the polls finish in 3 minutes, I don't really see an issue, as far as data collection goes.

Web interface may be sluggish due to I/O load though.

Tom

On 9/10/2013 14:02, F.Reenders@utwente.nl wrote:

...

OK, thanks. I'll have a look at the RAM disk then.

It's a new page on the wiki? I didn't see it last week.

Regards,

Frederik

-----Original Message----- From: observium [mailto:observium-bounces@observium.org] On Behalf Of Adam Armstrong Sent: woensdag 9 oktober 2013 13:58 To: Observium Network Observation System Subject: Re: [Observium] poller performance

On 2013-10-09 12:35, F.Reenders@utwente.nl wrote:

...
Hi,

I'm having problems with the performance of the poller. We have 200 switches and almost 16000 ports included and the poller is having trouble getting them checked in 5 minutes. With 20 threads it's working with a load of 15. When I add more threads it takes longer to get the data from switches. With less threads it also doesn't check all switches in 5 minutes. I'm using new fast hardware.

I've implemented all the performance tuning tips on the observium site. I also disabled the checking of fbd-table, arp-table and mac-accounting. When I debug the poller the ports check takes the longest. About 90 % of the total time. 1 switch takes about 8 to 10 seconds now.

Is there a way to speed it up? Maybe I can extend the check cycle time to 10 minutes? Distributed poller instances?

Your bottleneck is almost certainly I/O. We write a *lot* of data.

The solution is either RAM disks or SSD.

20k ports should fit in 48GB of RAM. There are instructions on how to do this properly on the Wiki. With a RAM disk you don't worry about I/O at all, only CPU and network performance.

Alternatively you can use an SSD, which have much higher throughput than a harddisk (still not unlimited, and their write speeds can be a bit slow).

adam. _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

_______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

F.Reenders＠utwente.nl

10 Oct 10 Oct

2:34 p.m.

Hi,

I tried the ramdisk. 48 Gb. I only have 64 Gb ram. It slowed down my system compared to the rrd's on disk. Maybe not enough memory left for all the other procs.

Then I added some more switches and my ramdisk wasn't big enough anymore. Now I have 280 switches/routers and 25000 ports a system load of 100+ and it's not making 1 cycle in 5 minutes. I tried 80,100 and 120 threads but when I get past the 80 threads the time to check one switch increases to much. Some routers take up to 200 seconds to check. I could setup a second system to check the routers...

A distributed option is a solution I think as I still need to add more switches.

Regards,

Frederik

It's been there for over a year.

But do you see "load of 60" as being a problem? As long as the polls finish in 3 minutes, I don't really see an issue, as far as data collection goes.

Web interface may be sluggish due to I/O load though.

Tom

On 9/10/2013 14:02, F.Reenders@utwente.nl wrote:

...

OK, thanks. I'll have a look at the RAM disk then.

It's a new page on the wiki? I didn't see it last week.

Regards,

Frederik

-----Original Message----- From: observium [mailto:observium-bounces@observium.org] On Behalf Of Adam Armstrong Sent: woensdag 9 oktober 2013 13:58 To: Observium Network Observation System Subject: Re: [Observium] poller performance

On 2013-10-09 12:35, F.Reenders@utwente.nl wrote:

...
Hi,

I'm having problems with the performance of the poller. We have 200 switches and almost 16000 ports included and the poller is having trouble getting them checked in 5 minutes. With 20 threads it's working with a load of 15. When I add more threads it takes longer to get the data from switches. With less threads it also doesn't check all switches in 5 minutes. I'm using new fast hardware.

I've implemented all the performance tuning tips on the observium site. I also disabled the checking of fbd-table, arp-table and mac-accounting. When I debug the poller the ports check takes the longest. About 90 % of the total time. 1 switch takes about 8 to 10 seconds now.

Is there a way to speed it up? Maybe I can extend the check cycle time to 10 minutes? Distributed poller instances?

Your bottleneck is almost certainly I/O. We write a *lot* of data.

The solution is either RAM disks or SSD.

20k ports should fit in 48GB of RAM. There are instructions on how to do this properly on the Wiki. With a RAM disk you don't worry about I/O at all, only CPU and network performance.

Alternatively you can use an SSD, which have much higher throughput than a harddisk (still not unlimited, and their write speeds can be a bit slow).

adam. _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

_______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

Moerman, Maarten

2:39 p.m.

Post your hardware specs.

We do 236 devices 14000 ports, /opt/observium/rrd on XFS SSD with noatime and other things

16core E5620 @ 2.4ghz, 16gb ram

Load = 12:38:56 up 2:28, 8 users, load average: 0.11, 0.18, 0.19

Maarten

-- Maarten Moerman | Mgr, Network Engineering | eBay Classifieds | +31-655122247 | mmoerman@ebay.com

On 10/10/13 2:34 PM, "F.Reenders@utwente.nl" F.Reenders@utwente.nl wrote:

...

Hi,

I tried the ramdisk. 48 Gb. I only have 64 Gb ram. It slowed down my system compared to the rrd's on disk. Maybe not enough memory left for all the other procs.

Then I added some more switches and my ramdisk wasn't big enough anymore. Now I have 280 switches/routers and 25000 ports a system load of 100+ and it's not making 1 cycle in 5 minutes. I tried 80,100 and 120 threads but when I get past the 80 threads the time to check one switch increases to much. Some routers take up to 200 seconds to check. I could setup a second system to check the routers...

A distributed option is a solution I think as I still need to add more switches.

Regards,

Frederik

-----Original Message----- From: observium [mailto:observium-bounces@observium.org] On Behalf Of Tom Laermans Sent: woensdag 9 oktober 2013 14:07 To: Observium Network Observation System Subject: Re: [Observium] poller performance

It's been there for over a year.

But do you see "load of 60" as being a problem? As long as the polls finish in 3 minutes, I don't really see an issue, as far as data collection goes.

Web interface may be sluggish due to I/O load though.

Tom

On 9/10/2013 14:02, F.Reenders@utwente.nl wrote:

...
OK, thanks. I'll have a look at the RAM disk then.

It's a new page on the wiki? I didn't see it last week.

Regards,

Frederik

-----Original Message----- From: observium [mailto:observium-bounces@observium.org] On Behalf Of Adam Armstrong Sent: woensdag 9 oktober 2013 13:58 To: Observium Network Observation System Subject: Re: [Observium] poller performance

On 2013-10-09 12:35, F.Reenders@utwente.nl wrote:

...
Hi,

I'm having problems with the performance of the poller. We have 200 switches and almost 16000 ports included and the poller is having trouble getting them checked in 5 minutes. With 20 threads it's working with a load of 15. When I add more threads it takes longer to get the data from switches. With less threads it also doesn't check all switches in 5 minutes. I'm using new fast hardware.

I've implemented all the performance tuning tips on the observium site. I also disabled the checking of fbd-table, arp-table and mac-accounting. When I debug the poller the ports check takes the longest. About 90 % of the total time. 1 switch takes about 8 to 10 seconds now.

Is there a way to speed it up? Maybe I can extend the check cycle time to 10 minutes? Distributed poller instances?

Your bottleneck is almost certainly I/O. We write a *lot* of data.

The solution is either RAM disks or SSD.

20k ports should fit in 48GB of RAM. There are instructions on how to do this properly on the Wiki. With a RAM disk you don't worry about I/O at all, only CPU and network performance.

Alternatively you can use an SSD, which have much higher throughput than a harddisk (still not unlimited, and their write speeds can be a bit slow).

adam. _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

F.Reenders＠utwente.nl

3:36 p.m.

I've got 2 x Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz, 64 Gb ram and 2 x raid 1 SAS disks(4 disks) speed 10k. It is a HP DL380G8. Load is now 200. :) but that's because the 5 minutes is not enough.

I will try the noatime also.

Frederik

-----Original Message----- From: observium [mailto:observium-bounces@observium.org] On Behalf Of Moerman, Maarten Sent: donderdag 10 oktober 2013 14:40 To: Observium Network Observation System Subject: Re: [Observium] poller performance

Post your hardware specs.

We do 236 devices 14000 ports, /opt/observium/rrd on XFS SSD with noatime and other things

16core E5620 @ 2.4ghz, 16gb ram

Load = 12:38:56 up 2:28, 8 users, load average: 0.11, 0.18, 0.19

Maarten

-- Maarten Moerman | Mgr, Network Engineering | eBay Classifieds | +31-655122247 | mmoerman@ebay.com

On 10/10/13 2:34 PM, "F.Reenders@utwente.nl" F.Reenders@utwente.nl wrote:

...

Hi,

I tried the ramdisk. 48 Gb. I only have 64 Gb ram. It slowed down my system compared to the rrd's on disk. Maybe not enough memory left for all the other procs.

Then I added some more switches and my ramdisk wasn't big enough anymore. Now I have 280 switches/routers and 25000 ports a system load of 100+ and it's not making 1 cycle in 5 minutes. I tried 80,100 and 120 threads but when I get past the 80 threads the time to check one switch increases to much. Some routers take up to 200 seconds to check. I could setup a second system to check the routers...

A distributed option is a solution I think as I still need to add more switches.

Regards,

Frederik

-----Original Message----- From: observium [mailto:observium-bounces@observium.org] On Behalf Of Tom Laermans Sent: woensdag 9 oktober 2013 14:07 To: Observium Network Observation System Subject: Re: [Observium] poller performance

It's been there for over a year.

But do you see "load of 60" as being a problem? As long as the polls finish in 3 minutes, I don't really see an issue, as far as data collection goes.

Web interface may be sluggish due to I/O load though.

Tom

On 9/10/2013 14:02, F.Reenders@utwente.nl wrote:

...
OK, thanks. I'll have a look at the RAM disk then.

It's a new page on the wiki? I didn't see it last week.

Regards,

Frederik

-----Original Message----- From: observium [mailto:observium-bounces@observium.org] On Behalf Of Adam Armstrong Sent: woensdag 9 oktober 2013 13:58 To: Observium Network Observation System Subject: Re: [Observium] poller performance

On 2013-10-09 12:35, F.Reenders@utwente.nl wrote:

...
Hi,

I'm having problems with the performance of the poller. We have 200 switches and almost 16000 ports included and the poller is having trouble getting them checked in 5 minutes. With 20 threads it's working with a load of 15. When I add more threads it takes longer to get the data from switches. With less threads it also doesn't check all switches in 5 minutes. I'm using new fast hardware.

I've implemented all the performance tuning tips on the observium site. I also disabled the checking of fbd-table, arp-table and mac-accounting. When I debug the poller the ports check takes the longest. About 90 % of the total time. 1 switch takes about 8 to 10 seconds now.

Is there a way to speed it up? Maybe I can extend the check cycle time to 10 minutes? Distributed poller instances?

Your bottleneck is almost certainly I/O. We write a *lot* of data.

The solution is either RAM disks or SSD.

20k ports should fit in 48GB of RAM. There are instructions on how to do this properly on the Wiki. With a RAM disk you don't worry about I/O at all, only CPU and network performance.

Alternatively you can use an SSD, which have much higher throughput than a harddisk (still not unlimited, and their write speeds can be a bit slow).

adam. _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

_______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

Paul Gear

12 Oct 12 Oct

11:24 p.m.

On 10/10/2013 11:36 PM, F.Reenders@utwente.nl wrote:

...

I've got 2 x Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz, 64 Gb ram and 2 x raid 1 SAS disks(4 disks) speed 10k. It is a HP DL380G8. Load is now 200. :) but that's because the 5 minutes is not enough.

I will try the noatime also.

I'm running on a pretty similar system; a Dell R720 with the same CPUs & RAM, and 2 x 7 15K SAS in RAID 10. We also V2P-ed our server when we upgraded to this machine, and I was pretty underwhelmed at the performance improvement considering it's a bit overkill for our environment (details below).

Then I started digging into the poller stats and found that some of my remote Linux servers (which run the PPPoE for the branch's ADSL connection) were running around 200-300 seconds for poll time. When I ran the poller in debug mode I found that the interfaces poll was taking a huge proportion of the poller's run time, even though they only have 3 NICs, plus ppp0 for ADSL and tun0 for OpenVPN. But because the kernel gives both ppp0 and tun0 a new interface id every time the connection goes down & up again, net-snmp was reporting this as a new interface (duly noting a warning in syslog), we were ending up with hundreds of interfaces per server over time, and net-snmp seems to be particularly inefficient at reporting them (or perhaps Observium is trying to poll too much data from non-existent ports?).

Regardless, I rolled out a script with puppet to restart net-snmp every time ppp0 or tun0 comes up. Now we have poll times for all those hosts < 20 secs and the load on our server doesn't go over about 1.5, even during polls with 32 concurrent pollers.

Regards, Paul

P.S. Device/port counts from our Observium installation:

F.Reenders＠utwente.nl

14 Oct 14 Oct

10:09 a.m.

Hi,

I solved my problem with the slow performance!

The problem was php cli. This is a known problem that was also an issue in 2007 I noticed when searching on google. Can be noticed by trying to run: php -i on the command line. When it takes some time to display anything and waits 5 seconds before give the prompt back you have this problem. Can by solved by removing modules from php to try which one is causing the problem on your system. In my case it was the snmp.so module. When I remove this and type a simple php command line like php -i it just outputs all in 1 second.

When I run the poller now all my 412 devices with 32000 ports are checked in under 3 minutes.

Thanks for all the ideas for solving this.

Regards,

Frederik

From: observium [mailto:observium-bounces@observium.org] On Behalf Of Paul Gear Sent: zaterdag 12 oktober 2013 23:24 To: Observium Network Observation System Subject: Re: [Observium] poller performance

On 10/10/2013 11:36 PM, F.Reenders@utwente.nlmailto:F.Reenders@utwente.nl wrote:

I've got 2 x Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz, 64 Gb ram and 2 x raid 1 SAS disks(4 disks) speed 10k.

It is a HP DL380G8.

Load is now 200. :) but that's because the 5 minutes is not enough.

I will try the noatime also.

Regards, Paul

P.S. Device/port counts from our Observium installation:

Total

Down

Ignored

Disabled

Deviceshttp://observium.buq.org.au/devices/

187http://observium.buq.org.au/devices/

165 uphttp://observium.buq.org.au/devices/status=1/

3 downhttp://observium.buq.org.au/devices/status=0/

5 ignoredhttp://observium.buq.org.au/devices/ignore=1/

14 disabledhttp://observium.buq.org.au/devices/disabled=1/

Portshttp://observium.buq.org.au/ports/

2614http://observium.buq.org.au/ports/

1044 up http://observium.buq.org.au/ports/state=up/

12 down http://observium.buq.org.au/ports/state=down/

1275 ignored http://observium.buq.org.au/ports/ignore=1/

173 shutdownhttp://observium.buq.org.au/ports/state=admindown/

Tom Laermans

10:54 a.m.

Hi Frederik,

Great news! I wonder what makes the snmp module load so slow...

We don't use it, so unless you're sharing the system with other things which use it, it's indeed best to load as few modules as possible.

Tom

On 10/14/2013 10:09 AM, F.Reenders@utwente.nl wrote:

...

Hi,

I solved my problem with the slow performance!

The problem was php cli. This is a known problem that was also an issue in 2007 I noticed when searching on google.

Can be noticed by trying to run: php --i on the command line. When it takes some time to display anything and waits 5 seconds before give the prompt back you have this problem.

Can by solved by removing modules from php to try which one is causing the problem on your system.

In my case it was the snmp.so module. When I remove this and type a simple php command line like php --i it just outputs all in 1 second.

When I run the poller now all my 412 devices with 32000 ports are checked in under 3 minutes.

Thanks for all the ideas for solving this.

Regards,

Frederik

*From:*observium [mailto:observium-bounces@observium.org] *On Behalf Of *Paul Gear *Sent:* zaterdag 12 oktober 2013 23:24 *To:* Observium Network Observation System *Subject:* Re: [Observium] poller performance

On 10/10/2013 11:36 PM, F.Reenders@utwente.nl mailto:F.Reenders@utwente.nl wrote:
I've got 2 x Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz, 64 Gb ram and 2 x raid 1 SAS disks(4 disks) speed 10k.

It is a HP DL380G8.

Load is now 200. :) but that's because the 5 minutes is not enough.



I will try the noatime also.
I'm running on a pretty similar system; a Dell R720 with the same CPUs & RAM, and 2 x 7 15K SAS in RAID 10. We also V2P-ed our server when we upgraded to this machine, and I was pretty underwhelmed at the performance improvement considering it's a bit overkill for our environment (details below).

Then I started digging into the poller stats and found that some of my remote Linux servers (which run the PPPoE for the branch's ADSL connection) were running around 200-300 seconds for poll time. When I ran the poller in debug mode I found that the interfaces poll was taking a huge proportion of the poller's run time, even though they only have 3 NICs, plus ppp0 for ADSL and tun0 for OpenVPN. But because the kernel gives both ppp0 and tun0 a new interface id every time the connection goes down & up again, net-snmp was reporting this as a new interface (duly noting a warning in syslog), we were ending up with hundreds of interfaces per server over time, and net-snmp seems to be particularly inefficient at reporting them (or perhaps Observium is trying to poll too much data from non-existent ports?).

Regardless, I rolled out a script with puppet to restart net-snmp every time ppp0 or tun0 comes up. Now we have poll times for all those hosts < 20 secs and the load on our server doesn't go over about 1.5, even during polls with 32 concurrent pollers.

Regards, Paul

P.S. Device/port counts from our Observium installation:

*Total*

*Up*

*Down*

*Ignored*

*Disabled*

*Devices http://observium.buq.org.au/devices/*

187 http://observium.buq.org.au/devices/

165 up http://observium.buq.org.au/devices/status=1/

3 down http://observium.buq.org.au/devices/status=0/

5 ignored http://observium.buq.org.au/devices/ignore=1/

14 disabled http://observium.buq.org.au/devices/disabled=1/

*Ports http://observium.buq.org.au/ports/*

2614 http://observium.buq.org.au/ports/

1044 up http://observium.buq.org.au/ports/state=up/

12 down http://observium.buq.org.au/ports/state=down/

1275 ignored http://observium.buq.org.au/ports/ignore=1/

173 shutdown http://observium.buq.org.au/ports/state=admindown/

observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

F.Reenders＠utwente.nl

10:56 a.m.

Hi Tom,

The bug was a combination with mysql.so and other modules.

Observium is running fine without the snmp.so indeed.

Maybe you should remove it from the dependency list in the install doc? :)

Regards,

Frederik

From: observium [mailto:observium-bounces@observium.org] On Behalf Of Tom Laermans Sent: maandag 14 oktober 2013 10:55 To: Observium Network Observation System Subject: Re: [Observium] poller performance

Hi Frederik,

Great news! I wonder what makes the snmp module load so slow...

We don't use it, so unless you're sharing the system with other things which use it, it's indeed best to load as few modules as possible.

Tom

On 10/14/2013 10:09 AM, F.Reenders@utwente.nlmailto:F.Reenders@utwente.nl wrote: Hi,

I solved my problem with the slow performance!

When I run the poller now all my 412 devices with 32000 ports are checked in under 3 minutes.

Thanks for all the ideas for solving this.

Regards,

Frederik

On 10/10/2013 11:36 PM, F.Reenders@utwente.nlmailto:F.Reenders@utwente.nl wrote:

I've got 2 x Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz, 64 Gb ram and 2 x raid 1 SAS disks(4 disks) speed 10k.

It is a HP DL380G8.

Load is now 200. :) but that's because the 5 minutes is not enough.

I will try the noatime also.

Regards, Paul

P.S. Device/port counts from our Observium installation:

Total

Down

Ignored

Disabled

Deviceshttp://observium.buq.org.au/devices/

187http://observium.buq.org.au/devices/

165 uphttp://observium.buq.org.au/devices/status=1/

3 downhttp://observium.buq.org.au/devices/status=0/

5 ignoredhttp://observium.buq.org.au/devices/ignore=1/

14 disabledhttp://observium.buq.org.au/devices/disabled=1/

Portshttp://observium.buq.org.au/ports/

2614http://observium.buq.org.au/ports/

1044 up http://observium.buq.org.au/ports/state=up/

12 down http://observium.buq.org.au/ports/state=down/

1275 ignored http://observium.buq.org.au/ports/ignore=1/

173 shutdownhttp://observium.buq.org.au/ports/state=admindown/

_______________________________________________

observium mailing list

observium@observium.orgmailto:observium@observium.org

http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

James Bensley

10 Oct 10 Oct

2:49 p.m.

Just to throw in my two pence worth;

We have 185 devices being polled and it happens in about 1 minute max. Server is 2x 15k SAS drives in HW RAID1, 2x2.5Ghz quad core Xeon 2x2GB RAM. Faily low end server I'd say!

Our Cacti server polls 10k data sources form 180 devices in about 1.5 - 2 minutes (although that uses the Boost plugin and throws it into memory and flushed to disk later, with Observium we are going strait to rrd's on disk).

If you can't manage the 5 minute poll with RAM disk I'd still continue to investigate your server set up and performance tweaks.

James.

James Bensley

2:52 p.m.

On 10 October 2013 13:49, James Bensley jwbensley@gmail.com wrote: 2x2.5Ghz quad core Xeon

Should be 1x ^ quad core

...

Our Cacti server polls 10k data sources form 180 devices in about 1.5

2 minutes....

What I missed off there to make that more relevant is that the Cacti server is a bit slower!

James.

F.Reenders＠utwente.nl

3:38 p.m.

The performance tweaks are what I'm trying to find out. I also monitor my server with munin for basic performance stats but no real problems.

Frederik

-----Original Message----- From: observium [mailto:observium-bounces@observium.org] On Behalf Of James Bensley Sent: donderdag 10 oktober 2013 14:50 To: Observium Network Observation System Subject: Re: [Observium] poller performance

Just to throw in my two pence worth;

We have 185 devices being polled and it happens in about 1 minute max. Server is 2x 15k SAS drives in HW RAID1, 2x2.5Ghz quad core Xeon 2x2GB RAM. Faily low end server I'd say!

If you can't manage the 5 minute poll with RAM disk I'd still continue to investigate your server set up and performance tweaks.

James. _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

Adam Armstrong

3:21 p.m.

On 2013-10-10 13:34, F.Reenders@utwente.nl wrote:

...

Hi,

I tried the ramdisk. 48 Gb. I only have 64 Gb ram. It slowed down my system compared to the rrd's on disk. Maybe not enough memory left for all the other procs.

Then I added some more switches and my ramdisk wasn't big enough anymore. Now I have 280 switches/routers and 25000 ports a system load of 100+ and it's not making 1 cycle in 5 minutes. I tried 80,100 and 120 threads but when I get past the 80 threads the time to check one switch increases to much. Some routers take up to 200 seconds to check. I could setup a second system to check the routers...

A distributed option is a solution I think as I still need to add more switches.

Disable the fdb-table module globally.

adam.

F.Reenders＠utwente.nl

3:40 p.m.

I disabled it now. Last time I disabled this I noticed the biggest bottleneck was the port module. That takes about 85% of the total time.

Frederik

-----Original Message----- From: observium [mailto:observium-bounces@observium.org] On Behalf Of Adam Armstrong Sent: donderdag 10 oktober 2013 15:21 To: Observium Network Observation System Subject: Re: [Observium] poller performance

On 2013-10-10 13:34, F.Reenders@utwente.nl wrote:

...

Hi,

I tried the ramdisk. 48 Gb. I only have 64 Gb ram. It slowed down my system compared to the rrd's on disk. Maybe not enough memory left for all the other procs.

Then I added some more switches and my ramdisk wasn't big enough anymore. Now I have 280 switches/routers and 25000 ports a system load of 100+ and it's not making 1 cycle in 5 minutes. I tried 80,100 and 120 threads but when I get past the 80 threads the time to check one switch increases to much. Some routers take up to 200 seconds to check. I could setup a second system to check the routers...

A distributed option is a solution I think as I still need to add more switches.

Disable the fdb-table module globally.

adam. _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

Adam Armstrong

3:47 p.m.

On 2013-10-10 14:40, F.Reenders@utwente.nl wrote:

...

I disabled it now. Last time I disabled this I noticed the biggest bottleneck was the port module. That takes about 85% of the total time.

That's because ports accounts for 85% of total I/O :)

You can see how much I/O is being caused by running iostat or iotop. You are almost certainly running far more poller processes than your I/O system can sustain.

the fdb-table module is a lot of SNMP due to Cisco being lazy and not implementing a sensible place to get mac address tables from.

You are definitely running too many threads.

adam.

F.Reenders＠utwente.nl

4:14 p.m.

Hi Adam,

With 50 threads it almost makes the 5 minutes for 1 cycle. And a low load of 45. The iotop shows 6 rrdcached procs who are active. I should try to run 100 threads to see what the i/o impact is.

I have a different question. In the menu under ports if see alerts with 14000 alerts. Will the alert go away by them self or is there an option to clear them?

And a second question regarding the fdb-table. If I disable this in the config.php the web page still remains and so does the data in the mysql. Is there a way to delete this data? Or will this go away by itself?

Regards,

Frederik

-----Original Message----- From: observium [mailto:observium-bounces@observium.org] On Behalf Of Adam Armstrong Sent: donderdag 10 oktober 2013 15:48 To: Observium Network Observation System Subject: Re: [Observium] poller performance

On 2013-10-10 14:40, F.Reenders@utwente.nl wrote:

...

I disabled it now. Last time I disabled this I noticed the biggest bottleneck was the port module. That takes about 85% of the total time.

That's because ports accounts for 85% of total I/O :)

You can see how much I/O is being caused by running iostat or iotop. You are almost certainly running far more poller processes than your I/O system can sustain.

the fdb-table module is a lot of SNMP due to Cisco being lazy and not implementing a sensible place to get mac address tables from.

You are definitely running too many threads.

adam. _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

Paul Gear

12 Oct 12 Oct

11:05 p.m.

On 10/10/2013 10:34 PM, F.Reenders@utwente.nl wrote:

...

Hi,

I tried the ramdisk. 48 Gb. I only have 64 Gb ram. It slowed down my system compared to the rrd's on disk. Maybe not enough memory left for all the other procs.

Then I added some more switches and my ramdisk wasn't big enough anymore. Now I have 280 switches/routers and 25000 ports a system load of 100+ and it's not making 1 cycle in 5 minutes. I tried 80,100 and 120 threads but when I get past the 80 threads the time to check one switch increases to much. Some routers take up to 200 seconds to check. I could setup a second system to check the routers...

A distributed option is a solution I think as I still need to add more switches.

I found that setting vm.dirty_writeback_centisecs in /etc/sysctl.conf to a value higher than the poll interval gave a lot of the benefits of the RAM disk without the overhead of managing it. I figure the risk of losing power or the system crashing is fairly low and the worst that can happen is the loss of a poll or two.

Paul

4311

Age (days ago)

4316

Last active (days ago)

List overview

Download

25 comments

8 participants

tags (0)

participants (8)

Adam Armstrong
F.Reenders＠utwente.nl
James Bensley
Laurens Vets
Moerman, Maarten
Paul Gear
Peter Persson
Tom Laermans