huge peaks on interfaces 10-100Gig

newer
Re: [Observium] Having problems...

Nikolay Shopik

5 Sep 2011 5 Sep '11

8:48 a.m.

We seeing once in while really huge peaks on some devices. In past we seen such when installing new device or/and doing maintenance on it. So we just ignore them in past, because they only broke graphs once in time.

Does anyone seeing this too? Basically this looks like huge peak (10-100G) for about 5-10 min(2 scans).

I've seen this on different devices, but all of them are Cisco afaik.

Show replies by date

GOLLSCHEWSKY, Tim

5 Sep 5 Sep

10:16 a.m.

I think this is a common issue with rrdtool and not specific to observium. When you reboot a device all the packet counters reset, since rrdtool is graphing the differences between the counters at each polling interval, you get a huge spike.

Check out removespikes.(pl|php) in the scripts directory, that will help get rid of them.

Cheers,

Tim.

-----Original Message----- From: observium-bounces@observium.org [mailto:observium-bounces@observium.org] On Behalf Of Nikolay Shopik Sent: Monday, 5 September 2011 5:48 PM To: Observium Network Observation System Subject: [Observium] huge peaks on interfaces 10-100Gig

Does anyone seeing this too? Basically this looks like huge peak (10-100G) for about 5-10 min(2 scans).

I've seen this on different devices, but all of them are Cisco afaik.

_______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

This e-mail is sent by Suncorp Group Limited ABN 66 145 290 124 or one of its related entities "Suncorp". Suncorp may be contacted at Level 18, 36 Wickham Terrace, Brisbane or on 13 11 55 or at suncorp.com.au. The content of this e-mail is the view of the sender or stated author and does not necessarily reflect the view of Suncorp. The content, including attachments, is a confidential communication between Suncorp and the intended recipient. If you are not the intended recipient, any use, interference with, disclosure or copying of this e-mail, including attachments, is unauthorised and expressly prohibited. If you have received this e-mail in error please contact the sender immediately and delete the e-mail and any attachments from your system.

Nikolay Shopik

10:27 a.m.

Thanks Tim,

That clear up issue in somewhat, but right now I'm seeing this on device (Cat3560X to be exactly) which not rebooted in weeks. So this is something different in my case.

On 05/09/11 13:16, GOLLSCHEWSKY, Tim wrote:

...

I think this is a common issue with rrdtool and not specific to observium. When you reboot a device all the packet counters reset, since rrdtool is graphing the differences between the counters at each polling interval, you get a huge spike.

Check out removespikes.(pl|php) in the scripts directory, that will help get rid of them.

Cheers,

Tim.

-----Original Message----- From: observium-bounces@observium.org [mailto:observium-bounces@observium.org] On Behalf Of Nikolay Shopik Sent: Monday, 5 September 2011 5:48 PM To: Observium Network Observation System Subject: [Observium] huge peaks on interfaces 10-100Gig

We seeing once in while really huge peaks on some devices. In past we seen such when installing new device or/and doing maintenance on it. So we just ignore them in past, because they only broke graphs once in time.

Does anyone seeing this too? Basically this looks like huge peak (10-100G) for about 5-10 min(2 scans).

I've seen this on different devices, but all of them are Cisco afaik.

observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

This e-mail is sent by Suncorp Group Limited ABN 66 145 290 124 or one of its related entities "Suncorp". Suncorp may be contacted at Level 18, 36 Wickham Terrace, Brisbane or on 13 11 55 or at suncorp.com.au. The content of this e-mail is the view of the sender or stated author and does not necessarily reflect the view of Suncorp. The content, including attachments, is a confidential communication between Suncorp and the intended recipient. If you are not the intended recipient, any use, interference with, disclosure or copying of this e-mail, including attachments, is unauthorised and expressly prohibited. If you have received this e-mail in error please contact the sender immediately and delete the e-mail and any attachments from your system. _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

Bruce-Young Majola

10:56 a.m.

Hi,

I have seen this odd behaviour before; I actually still have those spikes on some graphs older then a month. It would be nice to have a button with slots for setting up the various removespikes parameters from within the Graphs View page under Ports in order to remove these spike from the UI rather than the command line. Something like "Remove Spike/s from this graph" button would be a wonderful addition.

Regards, Bruce

-----Original Message----- From: Nikolay Shopik shopik@inblock.ru Reply-to: Observium Network Observation System observium@observium.org To: Observium Network Observation System observium@observium.org Subject: Re: [Observium] huge peaks on interfaces 10-100Gig Date: Mon, 05 Sep 2011 13:27:38 +0400 Mailer: Mozilla/5.0 (X11; Linux i686; rv:6.0.1) Gecko/20110830 Thunderbird/6.0.1

Thanks Tim,

That clear up issue in somewhat, but right now I'm seeing this on device (Cat3560X to be exactly) which not rebooted in weeks. So this is something different in my case.

On 05/09/11 13:16, GOLLSCHEWSKY, Tim wrote:

...

I think this is a common issue with rrdtool and not specific to observium. When you reboot a device all the packet counters reset, since rrdtool is graphing the differences between the counters at each polling interval, you get a huge spike.

Check out removespikes.(pl|php) in the scripts directory, that will help get rid of them.

Cheers,

Tim.

-----Original Message----- From: observium-bounces@observium.org [mailto:observium-bounces@observium.org] On Behalf Of Nikolay Shopik Sent: Monday, 5 September 2011 5:48 PM To: Observium Network Observation System Subject: [Observium] huge peaks on interfaces 10-100Gig

We seeing once in while really huge peaks on some devices. In past we seen such when installing new device or/and doing maintenance on it. So we just ignore them in past, because they only broke graphs once in time.

Does anyone seeing this too? Basically this looks like huge peak (10-100G) for about 5-10 min(2 scans).

I've seen this on different devices, but all of them are Cisco afaik.

observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

This e-mail is sent by Suncorp Group Limited ABN 66 145 290 124 or one of its related entities "Suncorp". Suncorp may be contacted at Level 18, 36 Wickham Terrace, Brisbane or on 13 11 55 or at suncorp.com.au. The content of this e-mail is the view of the sender or stated author and does not necessarily reflect the view of Suncorp. The content, including attachments, is a confidential communication between Suncorp and the intended recipient. If you are not the intended recipient, any use, interference with, disclosure or copying of this e-mail, including attachments, is unauthorised and expressly prohibited. If you have received this e-mail in error please contact the sender immediately and delete the e-mail and any attachments from your system. _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

_______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

DISCLAIMER: The information in this message is confidential and may be legally privileged. It is intended solely for the addressee. Access to this message by anyone else is unauthorized. If you are not the intended recipient, any disclosure, copying, or distribution of the message, or any action or omission taken by you in reliance on it, is prohibited and may be unlawful. Please immediately contact the sender if you have received this message in error. SWIFTNET and any of its subsidiaries each reserve the right to monitor all e-mail communications through its networks. Thank you.

Robert Williams

11:36 a.m.

Hi,

I believe the same can occur if someone clears the interface statistics on the switch, or resets SNMP.

Also, you can see the same occur if you poll the non-64 bit counters (not sure which ones Observium uses 32/64) and traffic has caused the counter to roll-over during the polling period.

Best,

Robert Williams

Backline / Operations Team

[cid:image845710.jpg@808f5262.a09a4af7]http://www.custodiandc.com/

Tel:

+44 (0)1622 230 382

Email: Robert@CustodianDC.commailto:support@CustodianDC.com

Web:

www.CustodianDC.comhttp://www.custodiandc.com/

Status:

status.CustDC.nethttp://status.custdc.net/

[cid:image9bbf8a.png@5aea8eb7.9008481c] ISO:27001 IS:567248

Click to view our email disclaimerhttp://www.custodiandc.com/disclaimer.txt

From: observium-bounces@observium.org [mailto:observium-bounces@observium.org] On Behalf Of Bruce-Young Majola Sent: 05 September 2011 10:56 To: Observium Network Observation System Subject: Re: [Observium] huge peaks on interfaces 10-100Gig

Hi,

Regards, Bruce

-----Original Message----- From: Nikolay Shopik <shopik@inblock.rumailto:Nikolay%20Shopik%20%3cshopik@inblock.ru%3e> Reply-to: Observium Network Observation System observium@observium.org To: Observium Network Observation System <observium@observium.orgmailto:Observium%20Network%20Observation%20System%20%3cobservium@observium.org%3e> Subject: Re: [Observium] huge peaks on interfaces 10-100Gig Date: Mon, 05 Sep 2011 13:27:38 +0400 Mailer: Mozilla/5.0 (X11; Linux i686; rv:6.0.1) Gecko/20110830 Thunderbird/6.0.1

Thanks Tim,

That clear up issue in somewhat, but right now I'm seeing this on device

(Cat3560X to be exactly) which not rebooted in weeks. So this is

something different in my case.

On 05/09/11 13:16, GOLLSCHEWSKY, Tim wrote:

...

I think this is a common issue with rrdtool and not specific to observium. When you reboot a device all the packet counters reset, since rrdtool is graphing the differences between the counters at each polling interval, you get a huge spike.

...

Check out removespikes.(pl|php) in the scripts directory, that will help get rid of them.

...

Cheers,

...

Tim.

...

-----Original Message-----

...

From: observium-bounces@observium.orgmailto:observium-bounces@observium.org [mailto:observium-bounces@observium.org] On Behalf Of Nikolay Shopik

...

Sent: Monday, 5 September 2011 5:48 PM

...

To: Observium Network Observation System

...

Subject: [Observium] huge peaks on interfaces 10-100Gig

...

We seeing once in while really huge peaks on some devices. In past we

...

seen such when installing new device or/and doing maintenance on it. So

...

we just ignore them in past, because they only broke graphs once in time.

...

Does anyone seeing this too? Basically this looks like huge peak

...

(10-100G) for about 5-10 min(2 scans).

...

I've seen this on different devices, but all of them are Cisco afaik.

...

observium mailing list

...

observium@observium.orgmailto:observium@observium.org

...

http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

...

This e-mail is sent by Suncorp Group Limited ABN 66 145 290 124 or one of its related entities "Suncorp".

...

Suncorp may be contacted at Level 18, 36 Wickham Terrace, Brisbane or on 13 11 55 or at suncorp.com.au.

...

The content of this e-mail is the view of the sender or stated author and does not necessarily reflect the view of Suncorp. The content, including attachments, is a confidential communication between Suncorp and the intended recipient. If you are not the intended recipient, any use, interference with, disclosure or copying of this e-mail, including attachments, is unauthorised and expressly prohibited. If you have received this e-mail in error please contact the sender immediately and delete the e-mail and any attachments from your system.

...

observium mailing list

...

observium@observium.orgmailto:observium@observium.org

...

http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

_______________________________________________

observium mailing list

observium@observium.orgmailto:observium@observium.org

http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

Website: http://www.fastnet.co.za

This e-mail has been scanned by www.CustodianDC.comhttp://www.CustodianDC.com for viruses, explicit material and spam

Nikolay Shopik

11:52 a.m.

Thanks Robert,

Yes we use lately "clear count" on some interfaces, didn't know this is also affect snmp counters, but sounds logical. Anyway last peak, was just yesterday and nobody cleared counters :(.

On 05/09/11 14:36, Robert Williams wrote:

...

I believe the same can occur if someone clears the interface statistics on the switch, or resets SNMP.

Adam Armstrong

4:42 p.m.

On 05/09/2011 08:48, Nikolay Shopik wrote:

...

We seeing once in while really huge peaks on some devices. In past we seen such when installing new device or/and doing maintenance on it. So we just ignore them in past, because they only broke graphs once in time.

Does anyone seeing this too? Basically this looks like huge peak (10-100G) for about 5-10 min(2 scans).

I've seen this on different devices, but all of them are Cisco afaik.

This is caused when Observium fails to get data for a specific counter for whatever reason, and records zero. On the next poll it records the real number again, and comes up with the ridiculous calculation of assuming that the total traffic was transferred in the past 5 minutes.

It's very annoying, and usually caused by an SNMP query failing. You can often make it go away by fiddling with the SNMP timeout/retry values.

I rarely have this issue (except on reboots), as my infrastructure is usually high-speed, but if you're monitoring anything across the internet or ont he end of broadband connections, you'll probably see the issue more.

We need to come up with a way of filtering out these incorrect values (and work out the correct way to build to RRD in the first place so that it's resistant to this issue). The normal method is to tell RRD that a measurement can't change by more than a certain amount (say the maximum amount a 10Gbit link can push in 5 mins). Unfortunately we can't limit the RRDs to 10Gbit because there are people who go faster. We have users with multiple 10Gbit trunks, and presumably we already have users with 40Gbit or 100Gbit links, possibly even trunks.

Trying to build the RRDs with the correct maximum value to begin with might work for most people, but we then have the problem of interfaces that don't correctly report their speed, and what happens if speed changes (I.e. a trunk that starts with 2 links, but ends up with 16 or a trunk that begins as 2x1Gbit but ends up as 16x10Gbit over time).

It's annoying, for sure.

adam.

Robert Williams

4:50 p.m.

Just an idea, but can you just check for something like "if polled-value >0 then OK, otherwise use previous value" ?

Robert Williams Backline / Operations Team Custodian DataCentre tel: +44 (0)1622 230382 email: Robert@CustodianDC.com http://www.custodiandc.com/disclaimer.txt

-----Original Message----- From: observium-bounces@observium.org [mailto:observium-bounces@observium.org] On Behalf Of Adam Armstrong Sent: 05 September 2011 16:43 To: Observium Network Observation System Subject: Re: [Observium] huge peaks on interfaces 10-100Gig

On 05/09/2011 08:48, Nikolay Shopik wrote:

...

We seeing once in while really huge peaks on some devices. In past we seen such when installing new device or/and doing maintenance on it. So we just ignore them in past, because they only broke graphs once in time.

Does anyone seeing this too? Basically this looks like huge peak (10-100G) for about 5-10 min(2 scans).

I've seen this on different devices, but all of them are Cisco afaik.

It's very annoying, and usually caused by an SNMP query failing. You can often make it go away by fiddling with the SNMP timeout/retry values.

It's annoying, for sure.

adam. _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

This e-mail has been scanned by www.CustodianDC.com for viruses, explicit material and spam

Adam Armstrong

4:52 p.m.

If we get a set of rules to detect when something is broken, we can return "NA" to rrdtool, which signals it that something is wrong and to not use that value for calculation.

If we send zero we get the issue that the *next* polling session will return a correct value, yielding twice the bps rate.

I actually see this problem more often on the netstats graphs int he graphing tab. I suspect those rrds are being built with less sane defaults.

adam.

On 05/09/2011 16:50, Robert Williams wrote:

...

Just an idea, but can you just check for something like "if polled-value>0 then OK, otherwise use previous value" ?

Robert Williams Backline / Operations Team Custodian DataCentre tel: +44 (0)1622 230382 email: Robert@CustodianDC.com http://www.custodiandc.com/disclaimer.txt

-----Original Message----- From: observium-bounces@observium.org [mailto:observium-bounces@observium.org] On Behalf Of Adam Armstrong Sent: 05 September 2011 16:43 To: Observium Network Observation System Subject: Re: [Observium] huge peaks on interfaces 10-100Gig

On 05/09/2011 08:48, Nikolay Shopik wrote:

...
We seeing once in while really huge peaks on some devices. In past we seen such when installing new device or/and doing maintenance on it. So we just ignore them in past, because they only broke graphs once in time.

Does anyone seeing this too? Basically this looks like huge peak (10-100G) for about 5-10 min(2 scans).

I've seen this on different devices, but all of them are Cisco afaik.

This is caused when Observium fails to get data for a specific counter for whatever reason, and records zero. On the next poll it records the real number again, and comes up with the ridiculous calculation of assuming that the total traffic was transferred in the past 5 minutes.

It's very annoying, and usually caused by an SNMP query failing. You can often make it go away by fiddling with the SNMP timeout/retry values.

I rarely have this issue (except on reboots), as my infrastructure is usually high-speed, but if you're monitoring anything across the internet or ont he end of broadband connections, you'll probably see the issue more.

We need to come up with a way of filtering out these incorrect values (and work out the correct way to build to RRD in the first place so that it's resistant to this issue). The normal method is to tell RRD that a measurement can't change by more than a certain amount (say the maximum amount a 10Gbit link can push in 5 mins). Unfortunately we can't limit the RRDs to 10Gbit because there are people who go faster. We have users with multiple 10Gbit trunks, and presumably we already have users with 40Gbit or 100Gbit links, possibly even trunks.

Trying to build the RRDs with the correct maximum value to begin with might work for most people, but we then have the problem of interfaces that don't correctly report their speed, and what happens if speed changes (I.e. a trunk that starts with 2 links, but ends up with 16 or a trunk that begins as 2x1Gbit but ends up as 16x10Gbit over time).

It's annoying, for sure.

adam. _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

This e-mail has been scanned by www.CustodianDC.com for viruses, explicit material and spam

observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

Nikolay Shopik

13 Oct 13 Oct

9:15 a.m.

On 05/09/11 19:52, Adam Armstrong wrote:

...

If we get a set of rules to detect when something is broken, we can return "NA" to rrdtool, which signals it that something is wrong and to not use that value for calculation.

I think this is best approach, and most safe. If something goes wrong always write "NA" to rrdtool. This prevent these peaks in 99% cases.

Adam Armstrong

9:59 a.m.

On Thu, 13 Oct 2011 12:15:30 +0400, Nikolay Shopik shopik@inblock.ru wrote:

...

On 05/09/11 19:52, Adam Armstrong wrote:

...
If we get a set of rules to detect when something is broken, we can return "NA" to rrdtool, which signals it that something is wrong and to not use that value for calculation.

I think this is best approach, and most safe. If something goes wrong always write "NA" to rrdtool. This prevent these peaks in 99% cases.

It's virtually impossible to build such a set of rules though, which is why the problem is so great.

It's difficult to use ifSpeed to 'guess' the speed of the interface, as it's not always correctly reported, particularly on virtual interfaces.

People with 16x10G trunks would be very annoyed if we started returning NAN for measurements above 10G!

adam.

Nikolay Shopik

10:14 a.m.

On 13/10/11 12:59, Adam Armstrong wrote:

...

It's virtually impossible to build such a set of rules though, which is why the problem is so great.

It's difficult to use ifSpeed to 'guess' the speed of the interface, as it's not always correctly reported, particularly on virtual interfaces.

People with 16x10G trunks would be very annoyed if we started returning NAN for measurements above 10G!

That I mean is not build rules to limit "big guys". I'm talking about not writing 0 value to rrd when snmp can't get data. Maybe I miss your point, but why in first place we save 0 value in rrdtool instead NA?

Adam Armstrong

10:25 a.m.

On Thu, 13 Oct 2011 13:14:16 +0400, Nikolay Shopik shopik@inblock.ru wrote:

...

On 13/10/11 12:59, Adam Armstrong wrote:

...
It's virtually impossible to build such a set of rules though, which is why the problem is so great.

It's difficult to use ifSpeed to 'guess' the speed of the interface, as it's not always correctly reported, particularly on virtual interfaces.

People with 16x10G trunks would be very annoyed if we started returning NAN for measurements above 10G!

That I mean is not build rules to limit "big guys". I'm talking about not writing 0 value to rrd when snmp can't get data. Maybe I miss your point, but why in first place we save 0 value in rrdtool instead NA?

See if that's any better. I've changed the ports poller to return U on broken values instead of 0.

adam.

Nikolay Shopik

10:37 a.m.

On 13/10/11 13:25, Adam Armstrong wrote:

...

See if that's any better. I've changed the ports poller to return U on broken values instead of 0.

Thanks, I'd fine to see empty line instead fake peak, even if it's not that big. Sometimes you have to monitor device on slow/unstable/congested link, and can't place observium closer to device.

Adam Armstrong

10:38 a.m.

On Thu, 13 Oct 2011 13:37:12 +0400, Nikolay Shopik shopik@inblock.ru wrote:

...

On 13/10/11 13:25, Adam Armstrong wrote:

...
See if that's any better. I've changed the ports poller to return U on broken values instead of 0.

Thanks, I'd fine to see empty line instead fake peak, even if it's not that big. Sometimes you have to monitor device on slow/unstable/congested link, and can't place observium closer to

device.

Cisco seems to treat some interfaces weirdly.

I have a 7200 that polls perfectly, except for Null0, which randomly loses datapoints for no apparent reason.

IOS XR seems to be particularly badly hit by whatever causes that.

adam.

Nikolay Shopik

10:51 a.m.

On 13/10/11 13:38, Adam Armstrong wrote:

...

I have a 7200 that polls perfectly, except for Null0, which randomly loses datapoints for no apparent reason.

IOS XR seems to be particularly badly hit by whatever causes that.

What processor you have in 7200? We have NPE-400, NPE-G1 and 7201 with G2. Have no problem with G2 or G1 so far from history. This is probably depend on IOS version, as you already seen I've working with Cisco TAC, to fix some snmp issues :).

I heard what same apply to IOS XE, still having troubles, but it solvable by new releases.

5172

Age (days ago)

5210

Last active (days ago)

List overview

Download

15 comments

5 participants

tags (0)

participants (5)

Adam Armstrong
Bruce-Young Majola
GOLLSCHEWSKY, Tim
Nikolay Shopik
Robert Williams