![](https://secure.gravatar.com/avatar/855516eddd8ba1dfeb3d67622dc38319.jpg?s=120&d=mm&r=g)
Hi,
I'm having problems with the performance of the poller. We have 200 switches and almost 16000 ports included and the poller is having trouble getting them checked in 5 minutes. With 20 threads it's working with a load of 15. When I add more threads it takes longer to get the data from switches. With less threads it also doesn't check all switches in 5 minutes. I'm using new fast hardware.
I've implemented all the performance tuning tips on the observium site. I also disabled the checking of fbd-table, arp-table and mac-accounting. When I debug the poller the ports check takes the longest. About 90 % of the total time. 1 switch takes about 8 to 10 seconds now.
Is there a way to speed it up? Maybe I can extend the check cycle time to 10 minutes? Distributed poller instances?
Regards,
Frederik Reenders | ICTS Operation Centre | University of Twente | ICT Service Centre | ICTS Server Operation | P.O.Box 217, 7500 AE Enschede The Netherlands | Drienerlolaan 5, 7522 NB Enschede | Campus building: Spiegel, room 416 | T: +31 53 489 2653/6723 | f.reenders@utwente.nl | www.utwente.nl/icts/en/
![](https://secure.gravatar.com/avatar/b9a662d07b64a1778d5d170d5cd2b36b.jpg?s=120&d=mm&r=g)
We are running poller-wrapper with 64 threads, and our 250 devices and 20000 ports is getting polled in about 2 minutes.
We are running this on SSD drives for performance.
/Peter
2013/10/9 F.Reenders@utwente.nl
Hi,
I'm having problems with the performance of the poller. We have 200 switches and almost 16000 ports included and the poller is having trouble getting them checked in 5 minutes. With 20 threads it's working with a load of 15. When I add more threads it takes longer to get the data from switches. With less threads it also doesn't check all switches in 5 minutes. I'm using new fast hardware.
I've implemented all the performance tuning tips on the observium site. I also disabled the checking of fbd-table, arp-table and mac-accounting. When I debug the poller the ports check takes the longest. About 90 % of the total time. 1 switch takes about 8 to 10 seconds now.
Is there a way to speed it up? Maybe I can extend the check cycle time to 10 minutes? Distributed poller instances?
Regards,
Frederik Reenders | ICTS Operation Centre | University of Twente | ICT Service Centre | ICTS Server Operation | P.O.Box 217, 7500 AE Enschede The Netherlands | Drienerlolaan 5, 7522 NB Enschede | Campus building: Spiegel, room 416 | T: +31 53 489 2653/6723 | f.reenders@utwente.nl | www.utwente.nl/icts/en/
observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
![](https://secure.gravatar.com/avatar/855516eddd8ba1dfeb3d67622dc38319.jpg?s=120&d=mm&r=g)
Hi Peter,
Do you have all the modules enabled?
When I set it to 64 threads the load goes up to 60 but it makes the cycle in 3 minutes. I don't see much network traffic and also not very much disk access.
What load is you system having?
Regards,
Frederik
From: observium [mailto:observium-bounces@observium.org] On Behalf Of Peter Persson Sent: woensdag 9 oktober 2013 13:37 To: Observium Network Observation System Subject: Re: [Observium] poller performance
We are running poller-wrapper with 64 threads, and our 250 devices and 20000 ports is getting polled in about 2 minutes.
We are running this on SSD drives for performance.
/Peter
2013/10/9 <F.Reenders@utwente.nlmailto:F.Reenders@utwente.nl> Hi,
I'm having problems with the performance of the poller. We have 200 switches and almost 16000 ports included and the poller is having trouble getting them checked in 5 minutes. With 20 threads it's working with a load of 15. When I add more threads it takes longer to get the data from switches. With less threads it also doesn't check all switches in 5 minutes. I'm using new fast hardware.
I've implemented all the performance tuning tips on the observium site. I also disabled the checking of fbd-table, arp-table and mac-accounting. When I debug the poller the ports check takes the longest. About 90 % of the total time. 1 switch takes about 8 to 10 seconds now.
Is there a way to speed it up? Maybe I can extend the check cycle time to 10 minutes? Distributed poller instances?
Regards,
Frederik Reenders | ICTS Operation Centre | University of Twente | ICT Service Centre | ICTS Server Operation | P.O.Box 217, 7500 AE Enschede The Netherlands | Drienerlolaan 5, 7522 NB Enschede | Campus building: Spiegel, room 416 | T: +31 53 489 2653/6723 | f.reenders@utwente.nlmailto:f.reenders@utwente.nl | www.utwente.nl/icts/en/http://www.utwente.nl/icts/en/
_______________________________________________ observium mailing list observium@observium.orgmailto:observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
![](https://secure.gravatar.com/avatar/855516eddd8ba1dfeb3d67622dc38319.jpg?s=120&d=mm&r=g)
In the ~observium/observium.log per device and the time difference between the last checked host and the discovery.php new proc.
Regards,
Frederik
-----Original Message----- From: observium [mailto:observium-bounces@observium.org] On Behalf Of Laurens Vets Sent: woensdag 9 oktober 2013 14:05 To: Observium Network Observation System Subject: Re: [Observium] poller performance
We are running poller-wrapper with 64 threads, and our 250 devices and 20000 ports is getting polled in about 2 minutes.
Where can I quickly see this? _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
![](https://secure.gravatar.com/avatar/1e6b06d67e97abea9a8fb143c7374c94.jpg?s=120&d=mm&r=g)
On 2013-10-09 14:11, F.Reenders@utwente.nl wrote:
In the ~observium/observium.log per device and the time difference between the last checked host and the discovery.php new proc.
Oh nice, we're doing all our polling in 2 minutes with 150 switches and approx. 16000 ports on 5+ years old hardware (no SSDs)...
-----Original Message----- From: observium [mailto:observium-bounces@observium.org] On Behalf Of Laurens Vets Sent: woensdag 9 oktober 2013 14:05 To: Observium Network Observation System Subject: Re: [Observium] poller performance
We are running poller-wrapper with 64 threads, and our 250 devices and 20000 ports is getting polled in about 2 minutes.
Where can I quickly see this? _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
![](https://secure.gravatar.com/avatar/0fa97865a0e1ab36152b6b2299eedb49.jpg?s=120&d=mm&r=g)
On 2013-10-09 12:35, F.Reenders@utwente.nl wrote:
Hi,
I'm having problems with the performance of the poller. We have 200 switches and almost 16000 ports included and the poller is having trouble getting them checked in 5 minutes. With 20 threads it's working with a load of 15. When I add more threads it takes longer to get the data from switches. With less threads it also doesn't check all switches in 5 minutes. I'm using new fast hardware.
I've implemented all the performance tuning tips on the observium site. I also disabled the checking of fbd-table, arp-table and mac-accounting. When I debug the poller the ports check takes the longest. About 90 % of the total time. 1 switch takes about 8 to 10 seconds now.
Is there a way to speed it up? Maybe I can extend the check cycle time to 10 minutes? Distributed poller instances?
Your bottleneck is almost certainly I/O. We write a *lot* of data.
The solution is either RAM disks or SSD.
20k ports should fit in 48GB of RAM. There are instructions on how to do this properly on the Wiki. With a RAM disk you don't worry about I/O at all, only CPU and network performance.
Alternatively you can use an SSD, which have much higher throughput than a harddisk (still not unlimited, and their write speeds can be a bit slow).
adam.
![](https://secure.gravatar.com/avatar/855516eddd8ba1dfeb3d67622dc38319.jpg?s=120&d=mm&r=g)
OK, thanks. I'll have a look at the RAM disk then.
It's a new page on the wiki? I didn't see it last week.
Regards,
Frederik
-----Original Message----- From: observium [mailto:observium-bounces@observium.org] On Behalf Of Adam Armstrong Sent: woensdag 9 oktober 2013 13:58 To: Observium Network Observation System Subject: Re: [Observium] poller performance
On 2013-10-09 12:35, F.Reenders@utwente.nl wrote:
Hi,
I'm having problems with the performance of the poller. We have 200 switches and almost 16000 ports included and the poller is having trouble getting them checked in 5 minutes. With 20 threads it's working with a load of 15. When I add more threads it takes longer to get the data from switches. With less threads it also doesn't check all switches in 5 minutes. I'm using new fast hardware.
I've implemented all the performance tuning tips on the observium site. I also disabled the checking of fbd-table, arp-table and mac-accounting. When I debug the poller the ports check takes the longest. About 90 % of the total time. 1 switch takes about 8 to 10 seconds now.
Is there a way to speed it up? Maybe I can extend the check cycle time to 10 minutes? Distributed poller instances?
Your bottleneck is almost certainly I/O. We write a *lot* of data.
The solution is either RAM disks or SSD.
20k ports should fit in 48GB of RAM. There are instructions on how to do this properly on the Wiki. With a RAM disk you don't worry about I/O at all, only CPU and network performance.
Alternatively you can use an SSD, which have much higher throughput than a harddisk (still not unlimited, and their write speeds can be a bit slow).
adam. _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
![](https://secure.gravatar.com/avatar/0fa97865a0e1ab36152b6b2299eedb49.jpg?s=120&d=mm&r=g)
On 2013-10-09 13:02, F.Reenders@utwente.nl wrote:
OK, thanks. I'll have a look at the RAM disk then.
It's a new page on the wiki? I didn't see it last week.
It's been there since 2009 :)
http://www.observium.org/wiki/Persistent_RAM_disk_RRD_storage
![](https://secure.gravatar.com/avatar/21caf0a08d095be7196a1648d20942be.jpg?s=120&d=mm&r=g)
It's been there for over a year.
But do you see "load of 60" as being a problem? As long as the polls finish in 3 minutes, I don't really see an issue, as far as data collection goes.
Web interface may be sluggish due to I/O load though.
Tom
On 9/10/2013 14:02, F.Reenders@utwente.nl wrote:
OK, thanks. I'll have a look at the RAM disk then.
It's a new page on the wiki? I didn't see it last week.
Regards,
Frederik
-----Original Message----- From: observium [mailto:observium-bounces@observium.org] On Behalf Of Adam Armstrong Sent: woensdag 9 oktober 2013 13:58 To: Observium Network Observation System Subject: Re: [Observium] poller performance
On 2013-10-09 12:35, F.Reenders@utwente.nl wrote:
Hi,
I'm having problems with the performance of the poller. We have 200 switches and almost 16000 ports included and the poller is having trouble getting them checked in 5 minutes. With 20 threads it's working with a load of 15. When I add more threads it takes longer to get the data from switches. With less threads it also doesn't check all switches in 5 minutes. I'm using new fast hardware.
I've implemented all the performance tuning tips on the observium site. I also disabled the checking of fbd-table, arp-table and mac-accounting. When I debug the poller the ports check takes the longest. About 90 % of the total time. 1 switch takes about 8 to 10 seconds now.
Is there a way to speed it up? Maybe I can extend the check cycle time to 10 minutes? Distributed poller instances?
Your bottleneck is almost certainly I/O. We write a *lot* of data.
The solution is either RAM disks or SSD.
20k ports should fit in 48GB of RAM. There are instructions on how to do this properly on the Wiki. With a RAM disk you don't worry about I/O at all, only CPU and network performance.
Alternatively you can use an SSD, which have much higher throughput than a harddisk (still not unlimited, and their write speeds can be a bit slow).
adam. _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
![](https://secure.gravatar.com/avatar/855516eddd8ba1dfeb3d67622dc38319.jpg?s=120&d=mm&r=g)
I must have read over it. I will try it today or tomorrow and share the results.
The system still is responsive so it not that big a deal.
Regards,
Frederik
-----Original Message----- From: observium [mailto:observium-bounces@observium.org] On Behalf Of Tom Laermans Sent: woensdag 9 oktober 2013 14:07 To: Observium Network Observation System Subject: Re: [Observium] poller performance
It's been there for over a year.
But do you see "load of 60" as being a problem? As long as the polls finish in 3 minutes, I don't really see an issue, as far as data collection goes.
Web interface may be sluggish due to I/O load though.
Tom
On 9/10/2013 14:02, F.Reenders@utwente.nl wrote:
OK, thanks. I'll have a look at the RAM disk then.
It's a new page on the wiki? I didn't see it last week.
Regards,
Frederik
-----Original Message----- From: observium [mailto:observium-bounces@observium.org] On Behalf Of Adam Armstrong Sent: woensdag 9 oktober 2013 13:58 To: Observium Network Observation System Subject: Re: [Observium] poller performance
On 2013-10-09 12:35, F.Reenders@utwente.nl wrote:
Hi,
I'm having problems with the performance of the poller. We have 200 switches and almost 16000 ports included and the poller is having trouble getting them checked in 5 minutes. With 20 threads it's working with a load of 15. When I add more threads it takes longer to get the data from switches. With less threads it also doesn't check all switches in 5 minutes. I'm using new fast hardware.
I've implemented all the performance tuning tips on the observium site. I also disabled the checking of fbd-table, arp-table and mac-accounting. When I debug the poller the ports check takes the longest. About 90 % of the total time. 1 switch takes about 8 to 10 seconds now.
Is there a way to speed it up? Maybe I can extend the check cycle time to 10 minutes? Distributed poller instances?
Your bottleneck is almost certainly I/O. We write a *lot* of data.
The solution is either RAM disks or SSD.
20k ports should fit in 48GB of RAM. There are instructions on how to do this properly on the Wiki. With a RAM disk you don't worry about I/O at all, only CPU and network performance.
Alternatively you can use an SSD, which have much higher throughput than a harddisk (still not unlimited, and their write speeds can be a bit slow).
adam. _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
_______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
![](https://secure.gravatar.com/avatar/855516eddd8ba1dfeb3d67622dc38319.jpg?s=120&d=mm&r=g)
Hi,
I tried the ramdisk. 48 Gb. I only have 64 Gb ram. It slowed down my system compared to the rrd's on disk. Maybe not enough memory left for all the other procs.
Then I added some more switches and my ramdisk wasn't big enough anymore. Now I have 280 switches/routers and 25000 ports a system load of 100+ and it's not making 1 cycle in 5 minutes. I tried 80,100 and 120 threads but when I get past the 80 threads the time to check one switch increases to much. Some routers take up to 200 seconds to check. I could setup a second system to check the routers...
A distributed option is a solution I think as I still need to add more switches.
Regards,
Frederik
-----Original Message----- From: observium [mailto:observium-bounces@observium.org] On Behalf Of Tom Laermans Sent: woensdag 9 oktober 2013 14:07 To: Observium Network Observation System Subject: Re: [Observium] poller performance
It's been there for over a year.
But do you see "load of 60" as being a problem? As long as the polls finish in 3 minutes, I don't really see an issue, as far as data collection goes.
Web interface may be sluggish due to I/O load though.
Tom
On 9/10/2013 14:02, F.Reenders@utwente.nl wrote:
OK, thanks. I'll have a look at the RAM disk then.
It's a new page on the wiki? I didn't see it last week.
Regards,
Frederik
-----Original Message----- From: observium [mailto:observium-bounces@observium.org] On Behalf Of Adam Armstrong Sent: woensdag 9 oktober 2013 13:58 To: Observium Network Observation System Subject: Re: [Observium] poller performance
On 2013-10-09 12:35, F.Reenders@utwente.nl wrote:
Hi,
I'm having problems with the performance of the poller. We have 200 switches and almost 16000 ports included and the poller is having trouble getting them checked in 5 minutes. With 20 threads it's working with a load of 15. When I add more threads it takes longer to get the data from switches. With less threads it also doesn't check all switches in 5 minutes. I'm using new fast hardware.
I've implemented all the performance tuning tips on the observium site. I also disabled the checking of fbd-table, arp-table and mac-accounting. When I debug the poller the ports check takes the longest. About 90 % of the total time. 1 switch takes about 8 to 10 seconds now.
Is there a way to speed it up? Maybe I can extend the check cycle time to 10 minutes? Distributed poller instances?
Your bottleneck is almost certainly I/O. We write a *lot* of data.
The solution is either RAM disks or SSD.
20k ports should fit in 48GB of RAM. There are instructions on how to do this properly on the Wiki. With a RAM disk you don't worry about I/O at all, only CPU and network performance.
Alternatively you can use an SSD, which have much higher throughput than a harddisk (still not unlimited, and their write speeds can be a bit slow).
adam. _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
_______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
![](https://secure.gravatar.com/avatar/defdef53b588cb6b5f6b09e33764723a.jpg?s=120&d=mm&r=g)
Post your hardware specs.
We do 236 devices 14000 ports, /opt/observium/rrd on XFS SSD with noatime and other things
16core E5620 @ 2.4ghz, 16gb ram
Load = 12:38:56 up 2:28, 8 users, load average: 0.11, 0.18, 0.19
Maarten
-- Maarten Moerman | Mgr, Network Engineering | eBay Classifieds | +31-655122247 | mmoerman@ebay.com
On 10/10/13 2:34 PM, "F.Reenders@utwente.nl" F.Reenders@utwente.nl wrote:
Hi,
I tried the ramdisk. 48 Gb. I only have 64 Gb ram. It slowed down my system compared to the rrd's on disk. Maybe not enough memory left for all the other procs.
Then I added some more switches and my ramdisk wasn't big enough anymore. Now I have 280 switches/routers and 25000 ports a system load of 100+ and it's not making 1 cycle in 5 minutes. I tried 80,100 and 120 threads but when I get past the 80 threads the time to check one switch increases to much. Some routers take up to 200 seconds to check. I could setup a second system to check the routers...
A distributed option is a solution I think as I still need to add more switches.
Regards,
Frederik
-----Original Message----- From: observium [mailto:observium-bounces@observium.org] On Behalf Of Tom Laermans Sent: woensdag 9 oktober 2013 14:07 To: Observium Network Observation System Subject: Re: [Observium] poller performance
It's been there for over a year.
But do you see "load of 60" as being a problem? As long as the polls finish in 3 minutes, I don't really see an issue, as far as data collection goes.
Web interface may be sluggish due to I/O load though.
Tom
On 9/10/2013 14:02, F.Reenders@utwente.nl wrote:
OK, thanks. I'll have a look at the RAM disk then.
It's a new page on the wiki? I didn't see it last week.
Regards,
Frederik
-----Original Message----- From: observium [mailto:observium-bounces@observium.org] On Behalf Of Adam Armstrong Sent: woensdag 9 oktober 2013 13:58 To: Observium Network Observation System Subject: Re: [Observium] poller performance
On 2013-10-09 12:35, F.Reenders@utwente.nl wrote:
Hi,
I'm having problems with the performance of the poller. We have 200 switches and almost 16000 ports included and the poller is having trouble getting them checked in 5 minutes. With 20 threads it's working with a load of 15. When I add more threads it takes longer to get the data from switches. With less threads it also doesn't check all switches in 5 minutes. I'm using new fast hardware.
I've implemented all the performance tuning tips on the observium site. I also disabled the checking of fbd-table, arp-table and mac-accounting. When I debug the poller the ports check takes the longest. About 90 % of the total time. 1 switch takes about 8 to 10 seconds now.
Is there a way to speed it up? Maybe I can extend the check cycle time to 10 minutes? Distributed poller instances?
Your bottleneck is almost certainly I/O. We write a *lot* of data.
The solution is either RAM disks or SSD.
20k ports should fit in 48GB of RAM. There are instructions on how to do this properly on the Wiki. With a RAM disk you don't worry about I/O at all, only CPU and network performance.
Alternatively you can use an SSD, which have much higher throughput than a harddisk (still not unlimited, and their write speeds can be a bit slow).
adam. _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
![](https://secure.gravatar.com/avatar/855516eddd8ba1dfeb3d67622dc38319.jpg?s=120&d=mm&r=g)
I've got 2 x Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz, 64 Gb ram and 2 x raid 1 SAS disks(4 disks) speed 10k. It is a HP DL380G8. Load is now 200. :) but that's because the 5 minutes is not enough.
I will try the noatime also.
Frederik
-----Original Message----- From: observium [mailto:observium-bounces@observium.org] On Behalf Of Moerman, Maarten Sent: donderdag 10 oktober 2013 14:40 To: Observium Network Observation System Subject: Re: [Observium] poller performance
Post your hardware specs.
We do 236 devices 14000 ports, /opt/observium/rrd on XFS SSD with noatime and other things
16core E5620 @ 2.4ghz, 16gb ram
Load = 12:38:56 up 2:28, 8 users, load average: 0.11, 0.18, 0.19
Maarten
-- Maarten Moerman | Mgr, Network Engineering | eBay Classifieds | +31-655122247 | mmoerman@ebay.com
On 10/10/13 2:34 PM, "F.Reenders@utwente.nl" F.Reenders@utwente.nl wrote:
Hi,
I tried the ramdisk. 48 Gb. I only have 64 Gb ram. It slowed down my system compared to the rrd's on disk. Maybe not enough memory left for all the other procs.
Then I added some more switches and my ramdisk wasn't big enough anymore. Now I have 280 switches/routers and 25000 ports a system load of 100+ and it's not making 1 cycle in 5 minutes. I tried 80,100 and 120 threads but when I get past the 80 threads the time to check one switch increases to much. Some routers take up to 200 seconds to check. I could setup a second system to check the routers...
A distributed option is a solution I think as I still need to add more switches.
Regards,
Frederik
-----Original Message----- From: observium [mailto:observium-bounces@observium.org] On Behalf Of Tom Laermans Sent: woensdag 9 oktober 2013 14:07 To: Observium Network Observation System Subject: Re: [Observium] poller performance
It's been there for over a year.
But do you see "load of 60" as being a problem? As long as the polls finish in 3 minutes, I don't really see an issue, as far as data collection goes.
Web interface may be sluggish due to I/O load though.
Tom
On 9/10/2013 14:02, F.Reenders@utwente.nl wrote:
OK, thanks. I'll have a look at the RAM disk then.
It's a new page on the wiki? I didn't see it last week.
Regards,
Frederik
-----Original Message----- From: observium [mailto:observium-bounces@observium.org] On Behalf Of Adam Armstrong Sent: woensdag 9 oktober 2013 13:58 To: Observium Network Observation System Subject: Re: [Observium] poller performance
On 2013-10-09 12:35, F.Reenders@utwente.nl wrote:
Hi,
I'm having problems with the performance of the poller. We have 200 switches and almost 16000 ports included and the poller is having trouble getting them checked in 5 minutes. With 20 threads it's working with a load of 15. When I add more threads it takes longer to get the data from switches. With less threads it also doesn't check all switches in 5 minutes. I'm using new fast hardware.
I've implemented all the performance tuning tips on the observium site. I also disabled the checking of fbd-table, arp-table and mac-accounting. When I debug the poller the ports check takes the longest. About 90 % of the total time. 1 switch takes about 8 to 10 seconds now.
Is there a way to speed it up? Maybe I can extend the check cycle time to 10 minutes? Distributed poller instances?
Your bottleneck is almost certainly I/O. We write a *lot* of data.
The solution is either RAM disks or SSD.
20k ports should fit in 48GB of RAM. There are instructions on how to do this properly on the Wiki. With a RAM disk you don't worry about I/O at all, only CPU and network performance.
Alternatively you can use an SSD, which have much higher throughput than a harddisk (still not unlimited, and their write speeds can be a bit slow).
adam. _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
_______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
![](https://secure.gravatar.com/avatar/1c685a39a957c5e4dd2544f4cdc48c02.jpg?s=120&d=mm&r=g)
On 10/10/2013 11:36 PM, F.Reenders@utwente.nl wrote:
I've got 2 x Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz, 64 Gb ram and 2 x raid 1 SAS disks(4 disks) speed 10k. It is a HP DL380G8. Load is now 200. :) but that's because the 5 minutes is not enough.
I will try the noatime also.
I'm running on a pretty similar system; a Dell R720 with the same CPUs & RAM, and 2 x 7 15K SAS in RAID 10. We also V2P-ed our server when we upgraded to this machine, and I was pretty underwhelmed at the performance improvement considering it's a bit overkill for our environment (details below).
Then I started digging into the poller stats and found that some of my remote Linux servers (which run the PPPoE for the branch's ADSL connection) were running around 200-300 seconds for poll time. When I ran the poller in debug mode I found that the interfaces poll was taking a huge proportion of the poller's run time, even though they only have 3 NICs, plus ppp0 for ADSL and tun0 for OpenVPN. But because the kernel gives both ppp0 and tun0 a new interface id every time the connection goes down & up again, net-snmp was reporting this as a new interface (duly noting a warning in syslog), we were ending up with hundreds of interfaces per server over time, and net-snmp seems to be particularly inefficient at reporting them (or perhaps Observium is trying to poll too much data from non-existent ports?).
Regardless, I rolled out a script with puppet to restart net-snmp every time ppp0 or tun0 comes up. Now we have poll times for all those hosts < 20 secs and the load on our server doesn't go over about 1.5, even during polls with 32 concurrent pollers.
Regards, Paul
P.S. Device/port counts from our Observium installation:
Total Up Down Ignored Disabled *Devices http://observium.buq.org.au/devices/* 187 http://observium.buq.org.au/devices/ 165 up http://observium.buq.org.au/devices/status=1/ 3 down http://observium.buq.org.au/devices/status=0/ 5 ignored http://observium.buq.org.au/devices/ignore=1/ 14 disabled http://observium.buq.org.au/devices/disabled=1/ *Ports http://observium.buq.org.au/ports/* 2614 http://observium.buq.org.au/ports/ 1044 up http://observium.buq.org.au/ports/state=up/ 12 down http://observium.buq.org.au/ports/state=down/ 1275 ignored http://observium.buq.org.au/ports/ignore=1/ 173 shutdown http://observium.buq.org.au/ports/state=admindown/
![](https://secure.gravatar.com/avatar/855516eddd8ba1dfeb3d67622dc38319.jpg?s=120&d=mm&r=g)
Hi,
I solved my problem with the slow performance!
The problem was php cli. This is a known problem that was also an issue in 2007 I noticed when searching on google. Can be noticed by trying to run: php -i on the command line. When it takes some time to display anything and waits 5 seconds before give the prompt back you have this problem. Can by solved by removing modules from php to try which one is causing the problem on your system. In my case it was the snmp.so module. When I remove this and type a simple php command line like php -i it just outputs all in 1 second.
When I run the poller now all my 412 devices with 32000 ports are checked in under 3 minutes.
Thanks for all the ideas for solving this.
Regards,
Frederik
From: observium [mailto:observium-bounces@observium.org] On Behalf Of Paul Gear Sent: zaterdag 12 oktober 2013 23:24 To: Observium Network Observation System Subject: Re: [Observium] poller performance
On 10/10/2013 11:36 PM, F.Reenders@utwente.nlmailto:F.Reenders@utwente.nl wrote:
I've got 2 x Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz, 64 Gb ram and 2 x raid 1 SAS disks(4 disks) speed 10k.
It is a HP DL380G8.
Load is now 200. :) but that's because the 5 minutes is not enough.
I will try the noatime also.
I'm running on a pretty similar system; a Dell R720 with the same CPUs & RAM, and 2 x 7 15K SAS in RAID 10. We also V2P-ed our server when we upgraded to this machine, and I was pretty underwhelmed at the performance improvement considering it's a bit overkill for our environment (details below).
Then I started digging into the poller stats and found that some of my remote Linux servers (which run the PPPoE for the branch's ADSL connection) were running around 200-300 seconds for poll time. When I ran the poller in debug mode I found that the interfaces poll was taking a huge proportion of the poller's run time, even though they only have 3 NICs, plus ppp0 for ADSL and tun0 for OpenVPN. But because the kernel gives both ppp0 and tun0 a new interface id every time the connection goes down & up again, net-snmp was reporting this as a new interface (duly noting a warning in syslog), we were ending up with hundreds of interfaces per server over time, and net-snmp seems to be particularly inefficient at reporting them (or perhaps Observium is trying to poll too much data from non-existent ports?).
Regardless, I rolled out a script with puppet to restart net-snmp every time ppp0 or tun0 comes up. Now we have poll times for all those hosts < 20 secs and the load on our server doesn't go over about 1.5, even during polls with 32 concurrent pollers.
Regards, Paul
P.S. Device/port counts from our Observium installation:
Total
Up
Down
Ignored
Disabled
Deviceshttp://observium.buq.org.au/devices/
187http://observium.buq.org.au/devices/
165 uphttp://observium.buq.org.au/devices/status=1/
3 downhttp://observium.buq.org.au/devices/status=0/
5 ignoredhttp://observium.buq.org.au/devices/ignore=1/
14 disabledhttp://observium.buq.org.au/devices/disabled=1/
Portshttp://observium.buq.org.au/ports/
2614http://observium.buq.org.au/ports/
1044 up http://observium.buq.org.au/ports/state=up/
12 down http://observium.buq.org.au/ports/state=down/
1275 ignored http://observium.buq.org.au/ports/ignore=1/
173 shutdownhttp://observium.buq.org.au/ports/state=admindown/
![](https://secure.gravatar.com/avatar/21caf0a08d095be7196a1648d20942be.jpg?s=120&d=mm&r=g)
Hi Frederik,
Great news! I wonder what makes the snmp module load so slow...
We don't use it, so unless you're sharing the system with other things which use it, it's indeed best to load as few modules as possible.
Tom
On 10/14/2013 10:09 AM, F.Reenders@utwente.nl wrote:
Hi,
I solved my problem with the slow performance!
The problem was php cli. This is a known problem that was also an issue in 2007 I noticed when searching on google.
Can be noticed by trying to run: php --i on the command line. When it takes some time to display anything and waits 5 seconds before give the prompt back you have this problem.
Can by solved by removing modules from php to try which one is causing the problem on your system.
In my case it was the snmp.so module. When I remove this and type a simple php command line like php --i it just outputs all in 1 second.
When I run the poller now all my 412 devices with 32000 ports are checked in under 3 minutes.
Thanks for all the ideas for solving this.
Regards,
Frederik
*From:*observium [mailto:observium-bounces@observium.org] *On Behalf Of *Paul Gear *Sent:* zaterdag 12 oktober 2013 23:24 *To:* Observium Network Observation System *Subject:* Re: [Observium] poller performance
On 10/10/2013 11:36 PM, F.Reenders@utwente.nl mailto:F.Reenders@utwente.nl wrote:
I've got 2 x Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz, 64 Gb ram and 2 x raid 1 SAS disks(4 disks) speed 10k. It is a HP DL380G8. Load is now 200. :) but that's because the 5 minutes is not enough. I will try the noatime also.
I'm running on a pretty similar system; a Dell R720 with the same CPUs & RAM, and 2 x 7 15K SAS in RAID 10. We also V2P-ed our server when we upgraded to this machine, and I was pretty underwhelmed at the performance improvement considering it's a bit overkill for our environment (details below).
Then I started digging into the poller stats and found that some of my remote Linux servers (which run the PPPoE for the branch's ADSL connection) were running around 200-300 seconds for poll time. When I ran the poller in debug mode I found that the interfaces poll was taking a huge proportion of the poller's run time, even though they only have 3 NICs, plus ppp0 for ADSL and tun0 for OpenVPN. But because the kernel gives both ppp0 and tun0 a new interface id every time the connection goes down & up again, net-snmp was reporting this as a new interface (duly noting a warning in syslog), we were ending up with hundreds of interfaces per server over time, and net-snmp seems to be particularly inefficient at reporting them (or perhaps Observium is trying to poll too much data from non-existent ports?).
Regardless, I rolled out a script with puppet to restart net-snmp every time ppp0 or tun0 comes up. Now we have poll times for all those hosts < 20 secs and the load on our server doesn't go over about 1.5, even during polls with 32 concurrent pollers.
Regards, Paul
P.S. Device/port counts from our Observium installation:
*Total*
*Up*
*Down*
*Ignored*
*Disabled*
*Devices http://observium.buq.org.au/devices/*
187 http://observium.buq.org.au/devices/
165 up http://observium.buq.org.au/devices/status=1/
3 down http://observium.buq.org.au/devices/status=0/
5 ignored http://observium.buq.org.au/devices/ignore=1/
14 disabled http://observium.buq.org.au/devices/disabled=1/
*Ports http://observium.buq.org.au/ports/*
2614 http://observium.buq.org.au/ports/
1044 up http://observium.buq.org.au/ports/state=up/
12 down http://observium.buq.org.au/ports/state=down/
1275 ignored http://observium.buq.org.au/ports/ignore=1/
173 shutdown http://observium.buq.org.au/ports/state=admindown/
observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
![](https://secure.gravatar.com/avatar/855516eddd8ba1dfeb3d67622dc38319.jpg?s=120&d=mm&r=g)
Hi Tom,
The bug was a combination with mysql.so and other modules.
Observium is running fine without the snmp.so indeed.
Maybe you should remove it from the dependency list in the install doc? :)
Regards,
Frederik
From: observium [mailto:observium-bounces@observium.org] On Behalf Of Tom Laermans Sent: maandag 14 oktober 2013 10:55 To: Observium Network Observation System Subject: Re: [Observium] poller performance
Hi Frederik,
Great news! I wonder what makes the snmp module load so slow...
We don't use it, so unless you're sharing the system with other things which use it, it's indeed best to load as few modules as possible.
Tom
On 10/14/2013 10:09 AM, F.Reenders@utwente.nlmailto:F.Reenders@utwente.nl wrote: Hi,
I solved my problem with the slow performance!
The problem was php cli. This is a known problem that was also an issue in 2007 I noticed when searching on google. Can be noticed by trying to run: php -i on the command line. When it takes some time to display anything and waits 5 seconds before give the prompt back you have this problem. Can by solved by removing modules from php to try which one is causing the problem on your system. In my case it was the snmp.so module. When I remove this and type a simple php command line like php -i it just outputs all in 1 second.
When I run the poller now all my 412 devices with 32000 ports are checked in under 3 minutes.
Thanks for all the ideas for solving this.
Regards,
Frederik
From: observium [mailto:observium-bounces@observium.org] On Behalf Of Paul Gear Sent: zaterdag 12 oktober 2013 23:24 To: Observium Network Observation System Subject: Re: [Observium] poller performance
On 10/10/2013 11:36 PM, F.Reenders@utwente.nlmailto:F.Reenders@utwente.nl wrote:
I've got 2 x Intel(R) Xeon(R) CPU E5-2650 0 @ 2.00GHz, 64 Gb ram and 2 x raid 1 SAS disks(4 disks) speed 10k.
It is a HP DL380G8.
Load is now 200. :) but that's because the 5 minutes is not enough.
I will try the noatime also.
I'm running on a pretty similar system; a Dell R720 with the same CPUs & RAM, and 2 x 7 15K SAS in RAID 10. We also V2P-ed our server when we upgraded to this machine, and I was pretty underwhelmed at the performance improvement considering it's a bit overkill for our environment (details below).
Then I started digging into the poller stats and found that some of my remote Linux servers (which run the PPPoE for the branch's ADSL connection) were running around 200-300 seconds for poll time. When I ran the poller in debug mode I found that the interfaces poll was taking a huge proportion of the poller's run time, even though they only have 3 NICs, plus ppp0 for ADSL and tun0 for OpenVPN. But because the kernel gives both ppp0 and tun0 a new interface id every time the connection goes down & up again, net-snmp was reporting this as a new interface (duly noting a warning in syslog), we were ending up with hundreds of interfaces per server over time, and net-snmp seems to be particularly inefficient at reporting them (or perhaps Observium is trying to poll too much data from non-existent ports?).
Regardless, I rolled out a script with puppet to restart net-snmp every time ppp0 or tun0 comes up. Now we have poll times for all those hosts < 20 secs and the load on our server doesn't go over about 1.5, even during polls with 32 concurrent pollers.
Regards, Paul
P.S. Device/port counts from our Observium installation:
Total
Up
Down
Ignored
Disabled
Deviceshttp://observium.buq.org.au/devices/
187http://observium.buq.org.au/devices/
165 uphttp://observium.buq.org.au/devices/status=1/
3 downhttp://observium.buq.org.au/devices/status=0/
5 ignoredhttp://observium.buq.org.au/devices/ignore=1/
14 disabledhttp://observium.buq.org.au/devices/disabled=1/
Portshttp://observium.buq.org.au/ports/
2614http://observium.buq.org.au/ports/
1044 up http://observium.buq.org.au/ports/state=up/
12 down http://observium.buq.org.au/ports/state=down/
1275 ignored http://observium.buq.org.au/ports/ignore=1/
173 shutdownhttp://observium.buq.org.au/ports/state=admindown/
_______________________________________________
observium mailing list
observium@observium.orgmailto:observium@observium.org
http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
![](https://secure.gravatar.com/avatar/1ad1fea3507644b06eec6b0404862d2e.jpg?s=120&d=mm&r=g)
Just to throw in my two pence worth;
We have 185 devices being polled and it happens in about 1 minute max. Server is 2x 15k SAS drives in HW RAID1, 2x2.5Ghz quad core Xeon 2x2GB RAM. Faily low end server I'd say!
Our Cacti server polls 10k data sources form 180 devices in about 1.5 - 2 minutes (although that uses the Boost plugin and throws it into memory and flushed to disk later, with Observium we are going strait to rrd's on disk).
If you can't manage the 5 minute poll with RAM disk I'd still continue to investigate your server set up and performance tweaks.
James.
![](https://secure.gravatar.com/avatar/1ad1fea3507644b06eec6b0404862d2e.jpg?s=120&d=mm&r=g)
On 10 October 2013 13:49, James Bensley jwbensley@gmail.com wrote: 2x2.5Ghz quad core Xeon
Should be 1x ^ quad core
Our Cacti server polls 10k data sources form 180 devices in about 1.5
- 2 minutes....
What I missed off there to make that more relevant is that the Cacti server is a bit slower!
James.
![](https://secure.gravatar.com/avatar/855516eddd8ba1dfeb3d67622dc38319.jpg?s=120&d=mm&r=g)
The performance tweaks are what I'm trying to find out. I also monitor my server with munin for basic performance stats but no real problems.
Frederik
-----Original Message----- From: observium [mailto:observium-bounces@observium.org] On Behalf Of James Bensley Sent: donderdag 10 oktober 2013 14:50 To: Observium Network Observation System Subject: Re: [Observium] poller performance
Just to throw in my two pence worth;
We have 185 devices being polled and it happens in about 1 minute max. Server is 2x 15k SAS drives in HW RAID1, 2x2.5Ghz quad core Xeon 2x2GB RAM. Faily low end server I'd say!
Our Cacti server polls 10k data sources form 180 devices in about 1.5 - 2 minutes (although that uses the Boost plugin and throws it into memory and flushed to disk later, with Observium we are going strait to rrd's on disk).
If you can't manage the 5 minute poll with RAM disk I'd still continue to investigate your server set up and performance tweaks.
James. _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
![](https://secure.gravatar.com/avatar/0fa97865a0e1ab36152b6b2299eedb49.jpg?s=120&d=mm&r=g)
On 2013-10-10 13:34, F.Reenders@utwente.nl wrote:
Hi,
I tried the ramdisk. 48 Gb. I only have 64 Gb ram. It slowed down my system compared to the rrd's on disk. Maybe not enough memory left for all the other procs.
Then I added some more switches and my ramdisk wasn't big enough anymore. Now I have 280 switches/routers and 25000 ports a system load of 100+ and it's not making 1 cycle in 5 minutes. I tried 80,100 and 120 threads but when I get past the 80 threads the time to check one switch increases to much. Some routers take up to 200 seconds to check. I could setup a second system to check the routers...
A distributed option is a solution I think as I still need to add more switches.
Disable the fdb-table module globally.
adam.
![](https://secure.gravatar.com/avatar/855516eddd8ba1dfeb3d67622dc38319.jpg?s=120&d=mm&r=g)
I disabled it now. Last time I disabled this I noticed the biggest bottleneck was the port module. That takes about 85% of the total time.
Frederik
-----Original Message----- From: observium [mailto:observium-bounces@observium.org] On Behalf Of Adam Armstrong Sent: donderdag 10 oktober 2013 15:21 To: Observium Network Observation System Subject: Re: [Observium] poller performance
On 2013-10-10 13:34, F.Reenders@utwente.nl wrote:
Hi,
I tried the ramdisk. 48 Gb. I only have 64 Gb ram. It slowed down my system compared to the rrd's on disk. Maybe not enough memory left for all the other procs.
Then I added some more switches and my ramdisk wasn't big enough anymore. Now I have 280 switches/routers and 25000 ports a system load of 100+ and it's not making 1 cycle in 5 minutes. I tried 80,100 and 120 threads but when I get past the 80 threads the time to check one switch increases to much. Some routers take up to 200 seconds to check. I could setup a second system to check the routers...
A distributed option is a solution I think as I still need to add more switches.
Disable the fdb-table module globally.
adam. _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
![](https://secure.gravatar.com/avatar/0fa97865a0e1ab36152b6b2299eedb49.jpg?s=120&d=mm&r=g)
On 2013-10-10 14:40, F.Reenders@utwente.nl wrote:
I disabled it now. Last time I disabled this I noticed the biggest bottleneck was the port module. That takes about 85% of the total time.
That's because ports accounts for 85% of total I/O :)
You can see how much I/O is being caused by running iostat or iotop. You are almost certainly running far more poller processes than your I/O system can sustain.
the fdb-table module is a lot of SNMP due to Cisco being lazy and not implementing a sensible place to get mac address tables from.
You are definitely running too many threads.
adam.
![](https://secure.gravatar.com/avatar/855516eddd8ba1dfeb3d67622dc38319.jpg?s=120&d=mm&r=g)
Hi Adam,
With 50 threads it almost makes the 5 minutes for 1 cycle. And a low load of 45. The iotop shows 6 rrdcached procs who are active. I should try to run 100 threads to see what the i/o impact is.
I have a different question. In the menu under ports if see alerts with 14000 alerts. Will the alert go away by them self or is there an option to clear them?
And a second question regarding the fdb-table. If I disable this in the config.php the web page still remains and so does the data in the mysql. Is there a way to delete this data? Or will this go away by itself?
Regards,
Frederik
-----Original Message----- From: observium [mailto:observium-bounces@observium.org] On Behalf Of Adam Armstrong Sent: donderdag 10 oktober 2013 15:48 To: Observium Network Observation System Subject: Re: [Observium] poller performance
On 2013-10-10 14:40, F.Reenders@utwente.nl wrote:
I disabled it now. Last time I disabled this I noticed the biggest bottleneck was the port module. That takes about 85% of the total time.
That's because ports accounts for 85% of total I/O :)
You can see how much I/O is being caused by running iostat or iotop. You are almost certainly running far more poller processes than your I/O system can sustain.
the fdb-table module is a lot of SNMP due to Cisco being lazy and not implementing a sensible place to get mac address tables from.
You are definitely running too many threads.
adam. _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
![](https://secure.gravatar.com/avatar/1c685a39a957c5e4dd2544f4cdc48c02.jpg?s=120&d=mm&r=g)
On 10/10/2013 10:34 PM, F.Reenders@utwente.nl wrote:
Hi,
I tried the ramdisk. 48 Gb. I only have 64 Gb ram. It slowed down my system compared to the rrd's on disk. Maybe not enough memory left for all the other procs.
Then I added some more switches and my ramdisk wasn't big enough anymore. Now I have 280 switches/routers and 25000 ports a system load of 100+ and it's not making 1 cycle in 5 minutes. I tried 80,100 and 120 threads but when I get past the 80 threads the time to check one switch increases to much. Some routers take up to 200 seconds to check. I could setup a second system to check the routers...
A distributed option is a solution I think as I still need to add more switches.
I found that setting vm.dirty_writeback_centisecs in /etc/sysctl.conf to a value higher than the poll interval gave a lot of the benefits of the RAM disk without the overhead of managing it. I figure the risk of losing power or the system crashing is fairly low and the worst that can happen is the loss of a poll or two.
Paul
participants (8)
-
Adam Armstrong
-
F.Reenders@utwente.nl
-
James Bensley
-
Laurens Vets
-
Moerman, Maarten
-
Paul Gear
-
Peter Persson
-
Tom Laermans