Re: [Observium] huge peaks on interfaces 10-100Gig

5 Sep 2011


      If we get a set of rules to detect when something is broken, we can 
return "NA" to rrdtool, which signals it that something is wrong and to 
not use that value for calculation.
If we send zero we get the issue that the *next* polling session will 
return a correct value, yielding twice the bps rate.
I actually see this problem more often on the netstats graphs int he 
graphing tab. I suspect those rrds are being built with less sane defaults.
adam.
On 05/09/2011 16:50, Robert Williams wrote:
...
Just an idea, but can you just check for something like "if polled-value>0 then OK, otherwise use previous value" ?
Robert Williams
Backline / Operations Team
Custodian DataCentre
tel: +44 (0)1622 230382
email: Robert@CustodianDC.com
http://www.custodiandc.com/disclaimer.txt
-----Original Message-----
From: observium-bounces@observium.org [mailto:observium-bounces@observium.org] On Behalf Of Adam Armstrong
Sent: 05 September 2011 16:43
To: Observium Network Observation System
Subject: Re: [Observium] huge peaks on interfaces 10-100Gig
On 05/09/2011 08:48, Nikolay Shopik wrote:
...
We seeing once in while really huge peaks on some devices. In past we
seen such when installing new device or/and doing maintenance on it.
So we just ignore them in past, because they only broke graphs once in
time.
Does anyone seeing this too? Basically this looks like huge peak
(10-100G) for about 5-10 min(2 scans).
I've seen this on different devices, but all of them are Cisco afaik.
This is caused when Observium fails to get data for a specific counter for whatever reason, and records zero. On the next poll it records the real number again, and comes up with the ridiculous calculation of assuming that the total traffic was transferred in the past 5 minutes.
It's very annoying, and usually caused by an SNMP query failing. You can often make it go away by fiddling with the SNMP timeout/retry values.
I rarely have this issue (except on reboots), as my infrastructure is usually high-speed, but if you're monitoring anything across the internet or ont he end of broadband connections, you'll probably see the issue more.
We need to come up with a way of filtering out these incorrect values (and work out the correct way to build to RRD in the first place so that it's resistant to this issue). The normal method is to tell RRD that a measurement can't change by more than a certain amount (say the maximum amount a 10Gbit link can push in 5 mins). Unfortunately we can't limit the RRDs to 10Gbit because there are people who go faster. We have users with multiple 10Gbit trunks, and presumably we already have users with 40Gbit or 100Gbit links, possibly even trunks.
Trying to build the RRDs with the correct maximum value to begin with might work for most people, but we then have the problem of interfaces that don't correctly report their speed, and what happens if speed changes (I.e. a trunk that starts with 2 links, but ends up with 16 or a trunk that begins as 2x1Gbit but ends up as 16x10Gbit over time).
It's annoying, for sure.
adam.
_______________________________________________
observium mailing list
observium@observium.org
http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
This e-mail has been scanned by www.CustodianDC.com for viruses, explicit material and spam

observium mailing list
observium@observium.org
http://postman.memetic.org/cgi-bin/mailman/listinfo/observium