Re: [Observium] Cool Output of APC in-row cooler, random enormous spikes.

10 Jun 2016


      Ah. I haven't used email-alerts for more than a week so haven't paid attention to this until now.
Looking through the alerts now I can see that they're all wrong and what's happening here is that the 'time of recovery' is taken from this sensors/device PREVIOUS recovery.  I can see that with the help of timestamp on the emails and the timestamp of previous recovery. And for the very first recovery of a device/sensor the duration is list as "unknown".
Looking at the alerts.inc.php i assume the "Unknown" comes from here:
'DURATION'        => ($entry['alert_status'] == '1' ? ( $entry['last_recovered'] > 0 ? formatUptime(time() - $entry['last_recovered'])." (".format_unixtime($entry['last_recovered']).")" : "Unknown")
: ( $entry['last_ok'] > 0 ? formatUptime(time() - $entry['last_ok'])." (".format_unixtime($entry['last_ok']).")" : "Unknown")),
I obviously have no idea but 'last_recovered' and 'last_ok' sounds like the previous time it recovered. But for proper duration it feels like it should be the time of the CURRENT recover...? Or...?
So yeah, at least here in my installation the 'duration' is most often based on previous recovery and not the current. This for all my alerts since I've started email-alerting. Going through my slack alerts I see the same behaviour for most of them. But a few actually looks ok, but majority not. Not strange since I guess all use the same alert mechanism.
But all that's a sidetrack to the main discussion here. But still interesting. =)
--
Henrik Cednert
cto | compositor
Filmlance International | www.filmlance.se
mobile [ + 46 (0)704 71 89 54 ]
skype  [ cednert ]
From: observium <observium-bounces@observium.orgmailto:observium-bounces@observium.org> on behalf of Spencer Ryan <sryan@arbor.netmailto:sryan@arbor.net>
Reply-To: Observium Network Observation System <observium@observium.orgmailto:observium@observium.org>
Date: Friday 10 June 2016 at 16:24
To: Observium Network Observation System <observium@observium.orgmailto:observium@observium.org>
Subject: Re: [Observium] Cool Output of APC in-row cooler, random enormous spikes.
I've brought up the duration being wrong in alert emails to adam before and he claims "it works fine". I've never seen it work fine.
Spencer Ryan | Senior Systems Administrator | sryan@arbor.netmailto:sryan@arbor.net
Arbor Networks
+1.734.794.5033 (d) | +1.734.846.2053 (m)
www.arbornetworks.comhttp://www.arbornetworks.com/
On Fri, Jun 10, 2016 at 10:22 AM, Henrik Cednert (Filmlance) <henrik.cednert@filmlance.semailto:henrik.cednert@filmlance.se> wrote:
I'll be damned.
That was probably just pure luck but it alerted faster than expected.
Please see screenshot link displaying the alert mail, because they to look a tad odd: https://www.dropbox.com/s/5ynr67ior7at3e3/mailalert.png. It lasted for 5 minutes, or actually just a brief moment within the polling period, but look at the duration in the recovery email. Really odd. By the way, are there templates where we can configure the layout of the email alerts (subject, what it displays in body and more)?
Here is a screenshot of the graph created from the data in previous command. https://www.dropbox.com/s/nw47m75tj601gmh/csv_graph.png
So yes. There's spikes in the data and I guess they don't show up in data center expert because they run some filter at it or just ignore random peaks or something.
The value of the peaks in the now collected data is ALWAYS 20480 which in some way feels significant since it's constant and feels like some sort of unit max value. Not sure if there's other higher peaks as well because what I see in Observium is higher or if the scale of the axis there is off, not sure but here's another screenshot. https://www.dropbox.com/s/0whwzf7wf34j8uu/observium_graph.png
But, feels like an APC issue then so guess I'll have the pleasure of once again dealing with that support. =/
Cheers and thanks for the help Adam.
--
Henrik Cednert
cto | compositor
Filmlance International | www.filmlance.sehttp://ww.filmlance.se
mobile [ + 46 (0)704 71 89 54 ]
skype  [ cednert ]
From: observium <observium-bounces@observium.orgmailto:observium-bounces@observium.org> on behalf of Henrik Cednert <henrik.cednert@filmlance.semailto:henrik.cednert@filmlance.se>
Reply-To: Observium Network Observation System <observium@observium.orgmailto:observium@observium.org>
Date: Friday 10 June 2016 at 15:27
To: Observium Network Observation System <observium@observium.orgmailto:observium@observium.org>
Subject: Re: [Observium] Cool Output of APC in-row cooler, random enormous spikes.
Thanks Adam
Polling it every second now and storing to a log file for investigation with this.
#!/bin/bash
while true; do
        sleep 1
        timestamp=$(date +"%T")
        sensor_value=$(/usr/bin/snmpbulkwalk -v2c -c 'public' -Pu -OQUs -m PowerNet-MIB -M /opt/observium/mibs/rfc:/opt/observium/mibs/net-snmp:/opt/observium/mibs/apc 'udp':'acrc02':'161' coolingUnitStatusAnalogValue.1.10)
        printf "$timestamp,$sensor_value\n" >>/home/neo/Documents/acrc02_coolingUnitStatusAnalogValue.1.10.txt
done
I'll be back and ask for advice when I get the next alert. =)
Cheers and thanks
--
Henrik Cednert
cto | compositor
Filmlance International | www.filmlance.sehttp://ww.filmlance.se
mobile [ + 46 (0)704 71 89 54 ]
skype  [ cednert ]
From: observium <observium-bounces@observium.orgmailto:observium-bounces@observium.org> on behalf of Adam Armstrong <adama@memetic.orgmailto:adama@memetic.org>
Reply-To: Observium Network Observation System <observium@observium.orgmailto:observium@observium.org>
Date: Friday 10 June 2016 at 14:45
To: "observium@observium.orgmailto:observium@observium.org" <observium@observium.orgmailto:observium@observium.org>
Subject: Re: [Observium] Cool Output of APC in-row cooler, random enormous spikes.
Hi Henrik,
Sensors are all "gauges", so there's no real concept of spikes the way there is for ports (which are counters).
If we're getting a spike, it's either that the device is sending the wrong data, or that the device is sending a number which hasn't been adjusted for the scale the device uses. It'll be the first if the 1.6-1.9 doesn't have any relation to the "real" number, and the latter if it's just off by a few orders of magnitude.
It's pretty unlikely that we could do any bad maths, maths is easy, and such bugs would present much more regularly. These sorts of things are pretty common on many vendor's SNMP implementations. You might be able to see it in action by frequently polling the sensor in question's OID and logging the value.
adam.
Sent from Mailbirdhttp://www.getmailbird.com/?utm_source=Mailbird&utm_medium=email&utm_campaign=sent-from-mailbird
On 10/06/2016 13:40:59, Henrik Cednert (Filmlance) <henrik.cednert@filmlance.semailto:henrik.cednert@filmlance.se> wrote:
Hello
Monitoring two APC in-row coolers. One of them gets some weird 1.6-1.9M watts spikes every now and then. Once a day or such, without regularity. I have started to monitor the same device in Schneider Data Center Expert just to see if it's a device or monitoring issue. There's no spikes there so I do think it's something with Observium and this particular unit. The other unit is fine and no spikes there. Only difference is that the load on this one is smaller and at times down to 0.
I do see that there's an option to log spikes in the config file. But not sure I can monitor a sensor with it since it says port and wants and ID. Can I debug these spikes in some way?
Cheers and thanks
--
Henrik Cednert
cto | compositor
Filmlance International | www.filmlance.sehttp://www.filmlance.se
mobile [ + 46 (0)704 71 89 54 ]
skype [ cednert ]
_______________________________________________
observium mailing list
observium@observium.orgmailto:observium@observium.org
http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
_______________________________________________
observium mailing list
observium@observium.orgmailto:observium@observium.org
http://postman.memetic.org/cgi-bin/mailman/listinfo/observium