This was the event log at the time of the incident:
When it happens next is there any worthwhile in me running some commands from the CLI to get any debug?
Thanks
Richard
On 24/06/2019 09:47, Adam Armstrong via observium wrote:
Thresholds for this MIB come from the device itself. It seems likely that the device returned an incorrect threshold setting for some reason.
Does it always change the threshold to the same value? Does that value match the thresholds of any other similar sensors?
We wouldn't autogenerate a threshold where one already exists.
adam.
On 2019-06-24 09:39:59, Richard Savage via observium observium@observium.org wrote:
Hi Adam
Ive managed to get the issue to re-create. You can see from the screenshot that the alert triggered because the threshold value disappeared @08:24 on 18 June. What happened is the threshold changed to 0V - (∞) 2.76V This meant that the current value of 3.25 threw an alert.
Later on that day @14:35 on 18 June with no action from anyone the alert cleared and the sensor thresholds have returned to the correct values:
This is just 1 example of 1 sensor but at the time I had pretty much alerts for all sensors on the ASR.
Can this please be looked into and fixed?
If you need any more info then please let me know.
Thanks
Richard
On 31/05/2019 10:39, Klimek, Denis via observium wrote:
Hi all,
I think it’s related to the problem that the device is NOT returning any values for limits. So Observium is creating the limits dynamically based on the current sensor value that the device returned.
If the value is changing i.e. voltage, temperatures. We see also a lot of sensor updates in the eventlogs.
Tested with latest 6.3.3 and 6.5.3 on Cisco ASR.
Mit freundlichem Gruß
Stadtwerke Norderstedt
*Denis Klimek*
Professional Network Engineer
IP-Systemtechnik
Tel: 040 / 521 04 – 1049
Mobil: 0151 / 652 219 06
dklimek@stadtwerke-norderstedt.de mailto:dklimek@stadtwerke-norderstedt.de
www.stadtwerke-norderstedt.de http://www.stadtwerke-norderstedt.de/
*Von:*observium [mailto:observium-bounces@observium.org] *Im Auftrag von *Adam Armstrong via observium *Gesendet:* Donnerstag, 30. Mai 2019 11:55 *An:* Richard Savage via observium *Cc:* Adam Armstrong *Betreff:* Re: [Observium] Sensor limits creating false alerts
These limits are likely being updated because the device has returned a different set of limits.
We'd need to know where these sensors came from to know.
adam.
On 2019-05-30 09:07:45, Richard Savage via observium <observium@observium.org> wrote: From what I can see the sensor limits keep being updated on each poll. Surely they should just be a static value and not need to be changed?? Thanks Richard On 29/05/2019 20:22, Richard Savage via observium wrote: Ill try and get a screenshot next time it happens of the sensors, as it shows that the sensor value is still there in terms of the actual voltage, but the threshold limit isn’t there at all and just shows 0 or the infinity symbol from memory Richard On 29 May 2019, at 20:16, Richard Savage via observium <observium@observium.org <mailto:observium@observium.org>> wrote: Its a 9006 running 5.3.4 with 2 x A9K-RSP440-SE with some A9K-40GE-L and A9K-16T/8-B line cards in Thanks Richard On 29 May 2019, at 19:24, Klimek, Denis <DKlimek@Stadtwerke-Norderstedt.de <mailto:DKlimek@Stadtwerke-Norderstedt.de>> wrote: Hi Richard, which ASR9K model do you have and which IOS-XR version are you running? Regards, Denis *Von:*observium <observium-bounces@observium.org <mailto:observium-bounces@observium.org>>*Im Auftrag von*Richard Savage via observium *Gesendet:*Mittwoch, 29. Mai 2019 17:16 *An:*Observium <observium@observium.org <mailto:observium@observium.org>> *Cc:*Richard Savage <richard@zananet.com <mailto:richard@zananet.com>>; Adam Armstrong <adama@memetic.org <mailto:adama@memetic.org>> *Betreff:*Re: [Observium] Sensor limits creating false alerts It seems to have become more of an issue on the ASR9K's in the v19.5.9916 that we are running. Its happening more frequently now and is causing lots of false positives. Is it possible once the threshold are polled, then just set them in stone? I cant see any reason why the thresholds need to change? For example a Voltage should always be between 2 values and those threshold values should change? From what I can see its the actual limits that disappear and not the values. Thanks Richard On 29/05/2019 14:55, Adam Armstrong via observium wrote: Ahh ok. I assume this is related to unstable polling. These are all from CISCO-ENTITY-SENSOR-MIB, which is pretty heavy, and Nexus has iffy SNMP. I'm not sure how we can work around this. It's hard to know if something is not there forever, or just for now, and storing state on that stuff is pretty messy. Is this is a more recent thing, or has it always happened? Disabling faster polling modes on a device like this would probably make it unpollable. adam. On 2019-05-29 14:52:14, Richard Savage via observium<observium@observium.org> <mailto:observium@observium.org>wrote: Hi Adam This is mainly seen on Cisco ASR9K's but we have also seen this on Cisco Nexus switches as well. No this affects Voltage, Temperature, and dbm levels on SFP's Happy to provide debug if required. Those are a few examples. It seems to be a mass load of sensors at any one time. Thanks Richard On 29/05/2019 14:46, Adam Armstrong via observium wrote: Hi Richard, I assume this is because sometimes the OIDs from which the limits come aren't polled sometimes? Is this only a single (os) device? Only a single type of sensor, or multiple? (you should now be able to see the mib/oid on the sensor page) adam. On 2019-05-28 10:24:00, Richard Savage via observium<observium@observium.org> <mailto:observium@observium.org>wrote: Hi All just wonder if anyone has any ideas on this? (Mike / Adam) Thanks Richard On 21/05/2019 14:50, Richard Savage via observium wrote: Hi All I have found a bit of a potential bug with the sensor data output which doesnt seem to be related to any particular version of Observium or device being polled (useful I know!) However it looks like some sensors limits are being updated to NULL regularly, which is throwing alerts and scaring staff ;-) Noticed on both 19.3.9774 and 19.4.9840. e.g from a device event log: 2019-05-10 12:36:07 Ethernet1/49 Lane 2 Transceiver Temperature <https://stats.goodwood.com/device/device=170/tab=health/metric=temperature/id=55094/> Sensor updated (limits): limit_high -> "0", limit_high_warn -> "NULL", limit_low -> "0", limit_low_warn -> "NULL" 2019-05-10 13:08:08 Ethernet1/49 Lane 2 Transceiver Transmit Power <https://stats.goodwood.com/device/device=170/tab=health/metric=dbm/id=55096/> Dbm Ethernet1/49 Lane 2 Transceiver Transmit Power above threshold: 0.03 dBm (> 0 dBm) A re-discovery of the affected device will fix the issue, until the next time it happens. Happy to provide any further debug required. Thanks Richard _______________________________________________ observium mailing list observium@observium.org <mailto:observium@observium.org> http://postman.memetic.org/cgi-bin/mailman/listinfo/observium _______________________________________________ observium mailing list observium@observium.org <mailto:observium@observium.org> http://postman.memetic.org/cgi-bin/mailman/listinfo/observium _______________________________________________ observium mailing list observium@observium.org <mailto:observium@observium.org> http://postman.memetic.org/cgi-bin/mailman/listinfo/observium _______________________________________________ observium mailing list observium@observium.org <mailto:observium@observium.org> http://postman.memetic.org/cgi-bin/mailman/listinfo/observium _______________________________________________ observium mailing list observium@observium.org <mailto:observium@observium.org> http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium