
On 15 Nov 2017, at 17:15, Robert Williams Robert@CustodianDC.com wrote:
Hi all,
Just noticed a strange problem with reporting alarms on the ASR9001 (XR5.3.4).
A lab router was reloaded and when it came back up there is a “Major” chassis alarm as one of the two PSUs has failed. However, neither the PSU failure nor the ‘MAJ’ alarm status are detected by Observium.
Even after a full discovery and polling, it is all green and happy on Observium it seems. In fact, it shows only 1 PSU now, even though two are installed and one is in a failed state:
<Picture (Device Independent Bitmap) 1.jpg>
(All other 9001 have 2 x “ASR-9001 AC Power Supply” entries here)
Console output confirms the fault and the presence of the major alarm, as well as a few different things (like N+1 resilience lost and ‘Capacity’) which could be used to detect this issue (if they are in Cisco-SNMP-land that is).
#show env power R/S/I Modules Capacity Status (W) 0/PS0/M0/* host PM 750 Ok 0/PS0/M1/* host PM 0 Failed <<<<<<<<
#show env power
<snip> N+1 Supply Protected Capacity Available: Not Protected <<<<<<<<
#show environment leds R/S/I Modules LED Status 0/RSP0/* host Critical-Alarm Off host Major-Alarm On <<<<<<<< host Minor-Alarm Off
RP/0/RSP0/CPU0:RTR-123(admin)#show env power states Wed Nov 15 16:59:15.135 GMT R/S/I State MaxPower Time Count (1-ON/2-OFF) (W) (YY:WK:DD:HH:MIN:SS)
0/PS0/M0/* 1 750 00:00:00:00:59:27 1 2 0 00:00:00:00:00:00 0
0/PS0/M1/* 1 0 00:00:00:00:00:00 0 2 0 00:00:00:00:59:27 1 <<<<<< ‘2 = Off’
so……Can anything be done to tune-up the detection of such issues? In this case it was pure chance I noticed the fault after the reload as I had to be physically in front of it to patch something in. Otherwise I’d not have known.
Any ideas? Cheers!
Robert Williams Custodian Data Centres https://www.CustodianDC.com
I bet if you run the poller debug SNMP from the device is reporting only one power supply present now rather than this being an Observium issue specifically?
I actually noticed this same thing recently when I yanked a fan tray out of my lab 9010 doing similar testing.
This is probably something that you need to use Syslog alerting for to capture as if the 9k doesn’t report the power supply or fan tray alarms, or the module at all, through SNMP when it fails then it’s a Cisco problem rather than an Observium one