Almost all recent changes have been web-interface related.

We did change the loop around the rrdtool binary, but that shouldn't affect polling.

Detecting up/down doesn't really have any prerequisites. It does an ICMP ping and a UDP snmpget. If either of those fail to return, and the host isn't already down, it changes the value in the database to down and skips the rest of the poll.

There is nothing really code-related that could cause the effects you're seeing, only the return packets failing to reach the PHP process in time. This kind of thing isn't even observed on very, very heavily loaded systems.

Is this a virtualised system?

adam.

On 26/11/2012 13:27, Robert Williams wrote:

Hi, well any ignoring you think I’m doing is certainly not the case – you said to check for other cron jobs and I did. There aren’t any (literally, this system has only Observium on it). I also checked for any processes running at the point of the ‘host down’ events as was also suggested; the only one is the poller / php process itself. Which takes more CPU on every 6th poll, but nothing more interesting than that. If there were other suggestions then I’m really sorry I’ve obviously missed them (which is quite likely as I’ve noticed before some postings from this list have been dropped by our filtering system). Either way, I’m certainly not ignoring you, I think it would be a bit stupid to ignore any advice from the creator of the product which you are having trouble with :)

 

Always suspect the last change you made broke it – so since there were five or six hosts added last week my initial suspicion was maybe the poller was running out of time. The higher number hosts failing, in order, suggested to me that (assuming it polls them in numerical order) again maybe it is running out of time before completing. This was slightly confirmed by removing hosts at random and suddenly the problem went away.

 

I’m just trying to fault-find at the same time as asking for pointers and I’m happy to check anything that anyone on here may suggest.

 

Incidentally I’m just deploying Job’s poller-wrapper script which I’ve been meaning to do for a while but never got around to it. I’ll know in 30 minutes or so if it’s helped at all and I’ll let you know.

 

In the meantime, please do let me know if I’ve missed or ignored something, as I’ve most certainly not intended to – cheers!

 


Robert Williams

Custodian DataCentre

email: Robert@CustodianDC.com


From: observium-bounces@observium.org [mailto:observium-bounces@observium.org] On Behalf Of Adam Armstrong
Sent: 26 November 2012 18:59
To: Observium Network Observation System
Subject: Re: [Observium] 30 minute poll issue

 

You're looking in completely the wrong place, and you've totally ignored everything i've said, so I give up.

have fun.


On 26/11/2012 12:24, Robert Williams wrote:

Right – I’ve disabled a few of the recent additions and each time I disable one the number of hosts which are ‘down’ decreases. Now, with 3 hosts disabled I only have 2 hosts which are failing.

 

Also, the hosts which fail are always numbered sequentially and are always high-numbered, say above host ID 90.

 

Is it possible the poller is simply running out of time? Can the time be extended? Either way, why only on exactly every 6 polls / 30 minutes? Weird…

 

 

Robert Williams

Custodian DataCentre

email: Robert@CustodianDC.com

 

From: observium-bounces@observium.org [mailto:observium-bounces@observium.org] On Behalf Of Adam Armstrong
Sent: 26 November 2012 16:58
To: Observium Network Observation System
Subject: Re: [Observium] 30 minute poll issue

 

Btw, when one single installation out of thousands starts doing something like this, it's almost never related to the code, and almost always related to the system it's installed on.

adam.

On 26/11/2012 10:38, Robert Williams wrote:

Hi – no other cron jobs on that box, it’s purely running Observium and the only jobs are the poller itself and a selection of jobs which run at weekly or daily intervals for various system functions. There is also a weekly SVN pull for Observium :)

 

As a test we have removed the most recently added device to see if that helps, but I could do with some way of recording what is happening on that particular poll. It’s like clockwork but I can’t see anything that would cause that on the network side, and everything definitely responds (from the Observium console) 100% during the poll itself.

 

Cheers!

 

 

Robert Williams

Custodian DataCentre

email: Robert@CustodianDC.com

 

From: observium-bounces@observium.org [mailto:observium-bounces@observium.org] On Behalf Of Adam Armstrong
Sent: 26 November 2012 16:07
To: Observium Network Observation System
Subject: Re: [Observium] 30 minute poll issue

 

Look for other, newly added cron jobs. Discovery is only run once every 6 hours.

It's pretty difficult to make the poller believe that something is offline without network issues, as all it does to decide is a ping and an snmp get.

adam.

On 26/11/2012 08:14, Robert Williams wrote:

Hi Guys,

 

Got a weird issue which has just started, seemingly by itself but I imagine there was a cause (I just don’t know it yet!).

 

In short, every 30 minute the poller decides that approximately a third of all devices (92 in total) are ‘Offline’. They then magically recover on the next poller interval.

 

The devices are not offline, and the Observium host does not loose connectivity (I’ve tested with numerous pings etc. during this predictable failure period).

 

Interestingly, the host on which Observium runs does record these interesting CPU and RAM metrics during that particular polling run:

 

 

 

Now, I’m guessing that there is maybe a more substantial ‘discovery’ run or similar every 30 minutes. For some reason, this more intensive run seems to be resulting in a load of devices going allegedly offline.

 

The problem started on Friday around 11pm and has repeated like clockwork since. We are running the latest SVN.

 

I’m a bit uncertain where to start with this one as although it’s predictable I can’t really see anything which would cause it to happen. Pointers for diagnosing further very welcome J

 

Cheers as always!

 

Robert Williams

Custodian DataCentre

email: Robert@CustodianDC.com

 






_______________________________________________
observium mailing list
observium@observium.org
http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

 





_______________________________________________
observium mailing list
observium@observium.org
http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

 




_______________________________________________
observium mailing list
observium@observium.org
http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

 



_______________________________________________
observium mailing list
observium@observium.org
http://postman.memetic.org/cgi-bin/mailman/listinfo/observium