Horizontal Scalability

Ciro Iriarte

23 Aug 2013 23 Aug '13

7:20 p.m.

Hi!, anybody considered scaling adding more servers instead of going to a bigger one?. Using something like Open Grid Scheduler for example.

Currently I'm maxing out our server (2 x Xeon E5-2630, 16 cores) with 37500 ports, 548 devices

cplanning:~ # uptime 13:19pm up 151 days 3:03, 2 users, load average: 43.37, 43.00, 42.60

Regards, CI.-

Show replies by date

Tom Laermans

23 Aug 23 Aug

7:27 p.m.

On 08/23/2013 07:20 PM, Ciro Iriarte wrote:

...

Hi!, anybody considered scaling adding more servers instead of going to a bigger one?. Using something like Open Grid Scheduler for example.

Currently I'm maxing out our server (2 x Xeon E5-2630, 16 cores) with 37500 ports, 548 devices

cplanning:~ # uptime 13:19pm up 151 days 3:03, 2 users, load average: 43.37, 43.00, 42.60

Maxing out I/O I presume, not CPU?

Tom

Ciro Iriarte

8:20 p.m.

2013/8/23 Tom Laermans tom.laermans@powersource.cx:

...

On 08/23/2013 07:20 PM, Ciro Iriarte wrote:

...
Hi!, anybody considered scaling adding more servers instead of going to a bigger one?. Using something like Open Grid Scheduler for example.

Currently I'm maxing out our server (2 x Xeon E5-2630, 16 cores) with 37500 ports, 548 devices

cplanning:~ # uptime 13:19pm up 151 days 3:03, 2 users, load average: 43.37, 43.00, 42.60

Maxing out I/O I presume, not CPU?

Tom _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

Some stats:

procs -----------memory---------- ---swap-- -----io---- -system-- -----cpu------ r b swpd free buff cache si so bi bo in cs us sy id wa st 27 0 29200 1312344 1270820 58582484 0 0 0 19822 20043 30668 54 38 7 0 0 22 1 29200 1330072 1270820 58582496 0 0 0 17254 18590 30620 56 39 5 0 0 27 0 29200 1314016 1270820 58582500 0 0 0 7118 16939 27376 57 41 2 0 0 35 0 29200 1389532 1270820 58582504 0 0 0 29470 20672 29516 61 39 0 0 0 47 1 29200 1427376 1270820 58582568 0 0 39 27003 19989 27377 68 32 0 0 0 49 0 29200 1457816 1270820 58583952 0 0 241 12389 16634 25146 69 31 0 0 0 52 0 29200 1434356 1270820 58584000 0 0 0 8750 16340 25067 68 32 0 0 0 46 0 29200 1476840 1270820 58584104 0 0 1 17590 17662 25817 69 31 0 0 0

Hmm, now that I check, the system has additional load since Ago, 5. I'll have a look at that.

Regards,

-- Ciro Iriarte http://cyruspy.wordpress.com --

Moerman, Maarten

7:34 p.m.

Tried the poller wrapper as well?

But looks like you're likely limited on IO.

Tried rrd storage on ssd yet?

Sent from my iPhone

On 23 aug. 2013, at 19:21, "Ciro Iriarte" cyruspy@gmail.com wrote:

...

Hi!, anybody considered scaling adding more servers instead of going to a bigger one?. Using something like Open Grid Scheduler for example.

Currently I'm maxing out our server (2 x Xeon E5-2630, 16 cores) with 37500 ports, 548 devices

cplanning:~ # uptime 13:19pm up 151 days 3:03, 2 users, load average: 43.37, 43.00, 42.60

Regards, CI.- _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

Ciro Iriarte

8:38 p.m.

Yes, we're using the python poller wrapper. SSD is not available unluckily. CPU i/o wait is near 0. IOPS seems to top at 2k, so it shouldn't be an issue...

2013/8/23 Moerman, Maarten mmoerman@ebay.com:

...

Tried the poller wrapper as well?

But looks like you're likely limited on IO.

Tried rrd storage on ssd yet?

Sent from my iPhone

On 23 aug. 2013, at 19:21, "Ciro Iriarte" cyruspy@gmail.com wrote:

...
Hi!, anybody considered scaling adding more servers instead of going to a bigger one?. Using something like Open Grid Scheduler for example.

Currently I'm maxing out our server (2 x Xeon E5-2630, 16 cores) with 37500 ports, 548 devices

cplanning:~ # uptime 13:19pm up 151 days 3:03, 2 users, load average: 43.37, 43.00, 42.60

Regards, CI.- _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

-- Ciro Iriarte http://cyruspy.wordpress.com --

Moerman, Maarten

7:36 p.m.

Btw, do you have issues? Load is unlikely to be an issue, unless you don't like high numbers...

Sent from my iPhone

On 23 aug. 2013, at 19:21, "Ciro Iriarte" cyruspy@gmail.com wrote:

...

Hi!, anybody considered scaling adding more servers instead of going to a bigger one?. Using something like Open Grid Scheduler for example.

Currently I'm maxing out our server (2 x Xeon E5-2630, 16 cores) with 37500 ports, 548 devices

cplanning:~ # uptime 13:19pm up 151 days 3:03, 2 users, load average: 43.37, 43.00, 42.60

Regards, CI.- _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

Ciro Iriarte

8:40 p.m.

Well, a coworker reported some broken graphs which I'm still researching, also I'm being proactive and I'm looking for the best way to scale.

Regards, CI.-

2013/8/23 Moerman, Maarten mmoerman@ebay.com:

...

Btw, do you have issues? Load is unlikely to be an issue, unless you don't like high numbers...

Sent from my iPhone

On 23 aug. 2013, at 19:21, "Ciro Iriarte" cyruspy@gmail.com wrote:

...
Hi!, anybody considered scaling adding more servers instead of going to a bigger one?. Using something like Open Grid Scheduler for example.

Currently I'm maxing out our server (2 x Xeon E5-2630, 16 cores) with 37500 ports, 548 devices

cplanning:~ # uptime 13:19pm up 151 days 3:03, 2 users, load average: 43.37, 43.00, 42.60

Regards, CI.- _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

-- Ciro Iriarte http://cyruspy.wordpress.com --

Paul Gear

24 Aug 24 Aug

12:26 a.m.

That has definitely not been my experience. Load is the main issue on our server, and it seems to be largely due to contention during polls. I've tuned down vm.dirty_writeback_centisecs to 5000 (default is 500), and I/O is relatively low. But we monitor a lot of systems across bad ADSL links, and this means we have to run a lot of pollers in parallel to get a poll done in 5 minutes (some of our devices actually take more than 300 seconds to poll; they're disabled at the moment). This pushes up the load (we run 5 pollers per core) and makes the system very sluggish. I've ordered a new server to replace this one this week, and I ended up going for 2 x 8 cores for a much smaller install than Ciro's:

Devices 157 139 up 1 down 4 ignored 13 disabled Ports 9150 918 up 6 down 1083 ignored 7006 shutdown

Paul

On 08/24/2013 03:36 AM, Moerman, Maarten wrote:

...

Btw, do you have issues? Load is unlikely to be an issue, unless you don't like high numbers...

Sent from my iPhone

On 23 aug. 2013, at 19:21, "Ciro Iriarte" cyruspy@gmail.com wrote:

...
Hi!, anybody considered scaling adding more servers instead of going to a bigger one?. Using something like Open Grid Scheduler for example.

Currently I'm maxing out our server (2 x Xeon E5-2630, 16 cores) with 37500 ports, 548 devices

cplanning:~ # uptime 13:19pm up 151 days 3:03, 2 users, load average: 43.37, 43.00, 42.60

Regards, CI.-

4363

Age (days ago)

4363

Last active (days ago)

List overview

Download

7 comments

4 participants

tags (0)

participants (4)

Ciro Iriarte
Moerman, Maarten
Paul Gear
Tom Laermans