Hi all,
 
Just noticed a strange problem with reporting alarms on the ASR9001 (XR5.3.4).
 
A lab router was reloaded and when it came back up there is a “Major” chassis alarm as one of the two PSUs has failed. However, neither the PSU failure nor the ‘MAJ’ alarm status are detected by Observium.
 
Even after a full discovery and polling, it is all green and happy on Observium it seems. In fact, it shows only 1 PSU now, even though two are installed and one is in a failed state:
 
 
(All other 9001 have 2 x “ASR-9001 AC Power Supply” entries here)
 
Console output confirms the fault and the presence of the major alarm, as well as a few different things (like N+1 resilience lost and ‘Capacity’) which could be used to detect this issue (if they are in Cisco-SNMP-land that is).
 
#show env power
R/S/I   Modules         Capacity        Status
                        (W)
0/PS0/M0/*
        host    PM      750             Ok
0/PS0/M1/*
        host    PM      0               Failed          <<<<<<<<
 
#show env power
<snip>
N+1 Supply Protected Capacity Available: Not Protected  <<<<<<<<
 
#show environment leds
R/S/I   Modules LED             Status
0/RSP0/*
        host    Critical-Alarm  Off
        host    Major-Alarm     On                      <<<<<<<<
        host    Minor-Alarm     Off
 
RP/0/RSP0/CPU0:RTR-123(admin)#show env power states
Wed Nov 15 16:59:15.135 GMT
R/S/I   State           MaxPower        Time                    Count
        (1-ON/2-OFF)    (W)             (YY:WK:DD:HH:MIN:SS)
----------------------------------------------------------------------
0/PS0/M0/*
        1               750             00:00:00:00:59:27       1
        2               0               00:00:00:00:00:00       0
----------------------------------------------------------------------
0/PS0/M1/*
        1               0               00:00:00:00:00:00       0
        2               0               00:00:00:00:59:27       1      <<<<<< ‘2 = Off’
 
so……Can anything be done to tune-up the detection of such issues? In this case it was pure chance I noticed the fault after the reload as I had to be physically in front of it to patch something in. Otherwise I’d not have known.
 
Any ideas? Cheers!
 
Robert Williams
Custodian Data Centres
https://www.CustodianDC.com