On 15 Nov 2017, at 17:15, Robert Williams <Robert@CustodianDC.com> wrote:

Hi all,
 
Just noticed a strange problem with reporting alarms on the ASR9001 (XR5.3.4).
 
A lab router was reloaded and when it came back up there is a “Major” chassis alarm as one of the two PSUs has failed. However, neither the PSU failure nor the ‘MAJ’ alarm status are detected by Observium.
 
Even after a full discovery and polling, it is all green and happy on Observium it seems. In fact, it shows only 1 PSU now, even though two are installed and one is in a failed state:
 
<Picture (Device Independent Bitmap) 1.jpg>
 
(All other 9001 have 2 x “ASR-9001 AC Power Supply” entries here)
 
Console output confirms the fault and the presence of the major alarm, as well as a few different things (like N+1 resilience lost and ‘Capacity’) which could be used to detect this issue (if they are in Cisco-SNMP-land that is).
 
#show env power
R/S/I   Modules         Capacity        Status
                        (W)
0/PS0/M0/*
        host    PM      750             Ok
0/PS0/M1/*
        host    PM      0               Failed          <<<<<<<<
 
#show env power
<snip>
N+1 Supply Protected Capacity Available: Not Protected  <<<<<<<<
 
#show environment leds
R/S/I   Modules LED             Status
0/RSP0/*
        host    Critical-Alarm  Off
        host    Major-Alarm     On                      <<<<<<<<
        host    Minor-Alarm     Off
 
RP/0/RSP0/CPU0:RTR-123(admin)#show env power states
Wed Nov 15 16:59:15.135 GMT
R/S/I   State           MaxPower        Time                    Count
        (1-ON/2-OFF)    (W)             (YY:WK:DD:HH:MIN:SS)
----------------------------------------------------------------------
0/PS0/M0/*
        1               750             00:00:00:00:59:27       1
        2               0               00:00:00:00:00:00       0
----------------------------------------------------------------------
0/PS0/M1/*
        1               0               00:00:00:00:00:00       0
        2               0               00:00:00:00:59:27       1      <<<<<< ‘2 = Off’
 
so……Can anything be done to tune-up the detection of such issues? In this case it was pure chance I noticed the fault after the reload as I had to be physically in front of it to patch something in. Otherwise I’d not have known.
 
Any ideas? Cheers!
 
Robert Williams
Custodian Data Centres
https://www.CustodianDC.com

I bet if you run the poller debug SNMP from the device is reporting only one power supply present now rather than this being an Observium issue specifically? 

I actually noticed this same thing recently when I yanked a fan tray out of my lab 9010 doing similar testing. 

This is probably something that you need to use Syslog alerting for to capture as if the 9k doesn’t report the power supply or fan tray alarms, or the module at all, through SNMP when it fails then it’s a Cisco problem rather than an Observium one