ASR-9000 PSU Status / Alarms

Hi all,
Just noticed a strange problem with reporting alarms on the ASR9001 (XR5.3.4).
A lab router was reloaded and when it came back up there is a "Major" chassis alarm as one of the two PSUs has failed. However, neither the PSU failure nor the 'MAJ' alarm status are detected by Observium.
Even after a full discovery and polling, it is all green and happy on Observium it seems. In fact, it shows only 1 PSU now, even though two are installed and one is in a failed state:
(All other 9001 have 2 x "ASR-9001 AC Power Supply" entries here)
Console output confirms the fault and the presence of the major alarm, as well as a few different things (like N+1 resilience lost and 'Capacity') which could be used to detect this issue (if they are in Cisco-SNMP-land that is).
#show env power R/S/I Modules Capacity Status (W) 0/PS0/M0/* host PM 750 Ok 0/PS0/M1/* host PM 0 Failed <<<<<<<<
#show env power <snip> N+1 Supply Protected Capacity Available: Not Protected <<<<<<<<
#show environment leds R/S/I Modules LED Status 0/RSP0/* host Critical-Alarm Off host Major-Alarm On <<<<<<<< host Minor-Alarm Off
RP/0/RSP0/CPU0:RTR-123(admin)#show env power states Wed Nov 15 16:59:15.135 GMT R/S/I State MaxPower Time Count (1-ON/2-OFF) (W) (YY:WK:DD:HH:MIN:SS) ---------------------------------------------------------------------- 0/PS0/M0/* 1 750 00:00:00:00:59:27 1 2 0 00:00:00:00:00:00 0 ---------------------------------------------------------------------- 0/PS0/M1/* 1 0 00:00:00:00:00:00 0 2 0 00:00:00:00:59:27 1 <<<<<< '2 = Off'
so......Can anything be done to tune-up the detection of such issues? In this case it was pure chance I noticed the fault after the reload as I had to be physically in front of it to patch something in. Otherwise I'd not have known.
Any ideas? Cheers!
Robert Williams Custodian Data Centres https://www.CustodianDC.com

On 15 Nov 2017, at 17:15, Robert Williams Robert@CustodianDC.com wrote:
Hi all,
Just noticed a strange problem with reporting alarms on the ASR9001 (XR5.3.4).
A lab router was reloaded and when it came back up there is a “Major” chassis alarm as one of the two PSUs has failed. However, neither the PSU failure nor the ‘MAJ’ alarm status are detected by Observium.
Even after a full discovery and polling, it is all green and happy on Observium it seems. In fact, it shows only 1 PSU now, even though two are installed and one is in a failed state:
<Picture (Device Independent Bitmap) 1.jpg>
(All other 9001 have 2 x “ASR-9001 AC Power Supply” entries here)
Console output confirms the fault and the presence of the major alarm, as well as a few different things (like N+1 resilience lost and ‘Capacity’) which could be used to detect this issue (if they are in Cisco-SNMP-land that is).
#show env power R/S/I Modules Capacity Status (W) 0/PS0/M0/* host PM 750 Ok 0/PS0/M1/* host PM 0 Failed <<<<<<<<
#show env power
<snip> N+1 Supply Protected Capacity Available: Not Protected <<<<<<<<
#show environment leds R/S/I Modules LED Status 0/RSP0/* host Critical-Alarm Off host Major-Alarm On <<<<<<<< host Minor-Alarm Off
RP/0/RSP0/CPU0:RTR-123(admin)#show env power states Wed Nov 15 16:59:15.135 GMT R/S/I State MaxPower Time Count (1-ON/2-OFF) (W) (YY:WK:DD:HH:MIN:SS)
0/PS0/M0/* 1 750 00:00:00:00:59:27 1 2 0 00:00:00:00:00:00 0
0/PS0/M1/* 1 0 00:00:00:00:00:00 0 2 0 00:00:00:00:59:27 1 <<<<<< ‘2 = Off’
so……Can anything be done to tune-up the detection of such issues? In this case it was pure chance I noticed the fault after the reload as I had to be physically in front of it to patch something in. Otherwise I’d not have known.
Any ideas? Cheers!
Robert Williams Custodian Data Centres https://www.CustodianDC.com
I bet if you run the poller debug SNMP from the device is reporting only one power supply present now rather than this being an Observium issue specifically?
I actually noticed this same thing recently when I yanked a fan tray out of my lab 9010 doing similar testing.
This is probably something that you need to use Syslog alerting for to capture as if the 9k doesn’t report the power supply or fan tray alarms, or the module at all, through SNMP when it fails then it’s a Cisco problem rather than an Observium one

Hi Robert,
please attach debug log for this device discovery:
./discovery.php -d -m sensors -h <device>
Robert Williams wrote:
Hi all,
Just noticed a strange problem with reporting alarms on the ASR9001 (XR5.3.4).
A lab router was reloaded and when it came back up there is a “Major” chassis alarm as one of the two PSUs has failed. However, neither the PSU failure nor the ‘MAJ’ alarm status are detected by Observium.
Even after a full discovery and polling, it is all green and happy on Observium it seems. In fact, it shows only 1 PSU now, even though two are installed and one is in a failed state:
(All other 9001 have 2 x “ASR-9001 AC Power Supply” entries here)
Console output confirms the fault and the presence of the major alarm, as well as a few different things (like N+1 resilience lost and ‘Capacity’) which could be used to detect this issue (if they are in Cisco-SNMP-land that is).
#show env power R/S/I Modules Capacity Status (W) 0/PS0/M0/* host PM 750 Ok 0/PS0/M1/* host PM 0 Failed <<<<<<<<
#show env power
<snip> N+1 Supply Protected Capacity Available: Not Protected <<<<<<<<
#show environment leds R/S/I Modules LED Status 0/RSP0/* host Critical-Alarm Off host Major-Alarm On <<<<<<<< host Minor-Alarm Off
RP/0/RSP0/CPU0:RTR-123(admin)#show env power states Wed Nov 15 16:59:15.135 GMT R/S/I State MaxPower Time Count (1-ON/2-OFF) (W) (YY:WK:DD:HH:MIN:SS)
0/PS0/M0/* 1 750 00:00:00:00:59:27 1 2 0 00:00:00:00:00:00 0
0/PS0/M1/* 1 0 00:00:00:00:00:00 0 2 0 00:00:00:00:59:27 1 <<<<<< ‘2 = Off’
so……Can anything be done to tune-up the detection of such issues? In this case it was pure chance I noticed the fault after the reload as I had to be physically in front of it to patch something in. Otherwise I’d not have known.
Any ideas? Cheers!
Robert Williams Custodian Data Centres https://www.CustodianDC.com
observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

Hi Mike,
Please find the output attached. Let me know if I can do anything else?
I guess ultimately the ability to detect the MAJ and CRIT alarms on the chassis would be useful as these will always throw for any hardware issues.
Even though it was dead during the reload and it never started showing it, the device is still present and in a 'failed' state according to the CLI so you'd think that this would be different from 'not present'. Even by Cisco standards...
Cheers!
Robert Williams Custodian Data Centres https://www.CustodianDC.com -----Original Message----- From: observium [mailto:observium-bounces@observium.org] On Behalf Of Mike Stupalov Sent: 16 November 2017 09:46 To: Observium observium@observium.org Subject: Re: [Observium] ASR-9000 PSU Status / Alarms
Hi Robert,
please attach debug log for this device discovery:
./discovery.php -d -m sensors -h <device>
Robert Williams wrote:
Hi all,
Just noticed a strange problem with reporting alarms on the ASR9001 (XR5.3.4).
A lab router was reloaded and when it came back up there is a “Major” chassis alarm as one of the two PSUs has failed. However, neither the PSU failure nor the ‘MAJ’ alarm status are detected by Observium.
Even after a full discovery and polling, it is all green and happy on Observium it seems. In fact, it shows only 1 PSU now, even though two are installed and one is in a failed state:
(All other 9001 have 2 x “ASR-9001 AC Power Supply” entries here)
Console output confirms the fault and the presence of the major alarm, as well as a few different things (like N+1 resilience lost and ‘Capacity’) which could be used to detect this issue (if they are in Cisco-SNMP-land that is).
#show env power R/S/I Modules Capacity Status (W) 0/PS0/M0/* host PM 750 Ok 0/PS0/M1/* host PM 0 Failed <<<<<<<<
#show env power
<snip> N+1 Supply Protected Capacity Available: Not Protected <<<<<<<<
#show environment leds R/S/I Modules LED Status 0/RSP0/* host Critical-Alarm Off host Major-Alarm On <<<<<<<< host Minor-Alarm Off
RP/0/RSP0/CPU0:RTR-123(admin)#show env power states Wed Nov 15 16:59:15.135 GMT R/S/I State MaxPower Time Count (1-ON/2-OFF) (W) (YY:WK:DD:HH:MIN:SS)
0/PS0/M0/* 1 750 00:00:00:00:59:27 1 2 0 00:00:00:00:00:00 0
0/PS0/M1/* 1 0 00:00:00:00:00:00 0 2 0 00:00:00:00:59:27 1 <<<<<< ‘2 = Off’
so……Can anything be done to tune-up the detection of such issues? In this case it was pure chance I noticed the fault after the reload as I had to be physically in front of it to patch something in. Otherwise I’d not have known.
Any ideas? Cheers!
Robert Williams Custodian Data Centres https://www.CustodianDC.com
observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
-- Mike Stupalov Observium Limited, http://observium.org
_______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

As I see in debug, this sensors discovered by CISCO-ENTITY-FRU-CONTROL-MIB::cefcFRUPowerStatusEntry
cefcFRUPowerAdminStatus.10906792 = on cefcFRUPowerAdminStatus.31415605 = on cefcFRUPowerAdminStatus.59316821 = on cefcFRUPowerAdminStatus.66531208 = on cefcFRUPowerOperStatus.10906792 = on cefcFRUPowerOperStatus.31415605 = on cefcFRUPowerOperStatus.59316821 = on cefcFRUPowerOperStatus.66531208 = on cefcFRUCurrent.10906792 = 0 cefcFRUCurrent.31415605 = 0 cefcFRUCurrent.59316821 = 0 cefcFRUCurrent.66531208 = 750
corresponding entPhysical entries:
entPhysicalDescr.10906792 = 10GBASE-LR SFP+ Module for SMF entPhysicalName.10906792 = module mau 0/0/2/3 entPhysicalClass.10906792 = module
entPhysicalDescr.31415605 = 10GBASE-LR SFP+ Module for SMF entPhysicalName.31415605 = module mau 0/0/2/2 entPhysicalClass.31415605 = module
entPhysicalDescr.59316821 = ASR-9001 Fan Tray entPhysicalName.59316821 = fantray 0/FT0/SP entPhysicalClass.59316821 = fan
entPhysicalDescr.66531208 = ASR-9001 AC Power Supply entPhysicalName.66531208 = power-module 0/PS0/M0/SP entPhysicalClass.66531208 = powerSupply
where you can see, that device reported in SNMP only one Power Supply and one Fan (as displayed in your screenshot).
nothing surprising for Cisco devices :)
1. Try update firmware 2. if not fixed by firmware, write to Cisco TAC
Robert Williams wrote:
Hi Mike,
Please find the output attached. Let me know if I can do anything else?
I guess ultimately the ability to detect the MAJ and CRIT alarms on the chassis would be useful as these will always throw for any hardware issues.
Even though it was dead during the reload and it never started showing it, the device is still present and in a 'failed' state according to the CLI so you'd think that this would be different from 'not present'. Even by Cisco standards...
Cheers!
Robert Williams Custodian Data Centres https://www.CustodianDC.com -----Original Message----- From: observium [mailto:observium-bounces@observium.org] On Behalf Of Mike Stupalov Sent: 16 November 2017 09:46 To: Observium observium@observium.org Subject: Re: [Observium] ASR-9000 PSU Status / Alarms
Hi Robert,
please attach debug log for this device discovery:
./discovery.php -d -m sensors -h <device>
Robert Williams wrote:
Hi all,
Just noticed a strange problem with reporting alarms on the ASR9001 (XR5.3.4).
A lab router was reloaded and when it came back up there is a “Major” chassis alarm as one of the two PSUs has failed. However, neither the PSU failure nor the ‘MAJ’ alarm status are detected by Observium.
Even after a full discovery and polling, it is all green and happy on Observium it seems. In fact, it shows only 1 PSU now, even though two are installed and one is in a failed state:
(All other 9001 have 2 x “ASR-9001 AC Power Supply” entries here)
Console output confirms the fault and the presence of the major alarm, as well as a few different things (like N+1 resilience lost and ‘Capacity’) which could be used to detect this issue (if they are in Cisco-SNMP-land that is).
#show env power R/S/I Modules Capacity Status (W) 0/PS0/M0/* host PM 750 Ok 0/PS0/M1/* host PM 0 Failed <<<<<<<<
#show env power
<snip> N+1 Supply Protected Capacity Available: Not Protected <<<<<<<<
#show environment leds R/S/I Modules LED Status 0/RSP0/* host Critical-Alarm Off host Major-Alarm On <<<<<<<< host Minor-Alarm Off
RP/0/RSP0/CPU0:RTR-123(admin)#show env power states Wed Nov 15 16:59:15.135 GMT R/S/I State MaxPower Time Count (1-ON/2-OFF) (W) (YY:WK:DD:HH:MIN:SS)
0/PS0/M0/* 1 750 00:00:00:00:59:27 1 2 0 00:00:00:00:00:00 0
0/PS0/M1/* 1 0 00:00:00:00:00:00 0 2 0 00:00:00:00:59:27 1 <<<<<< ‘2 = Off’
so……Can anything be done to tune-up the detection of such issues? In this case it was pure chance I noticed the fault after the reload as I had to be physically in front of it to patch something in. Otherwise I’d not have known.
Any ideas? Cheers!
Robert Williams Custodian Data Centres https://www.CustodianDC.com
observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
-- Mike Stupalov Observium Limited, http://observium.org
observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium

Hi Mike,
Thanks for taking the time to look at this. Shame they don't seem to export the fact that there is a PSU and that it is in a failed state...!
I'll poke TAC and see if anything can be done. In the meantime is there a way to get the chassis 'status' LEDs? Or is that not exported either?
Cheers!
Robert Williams Custodian Data Centres https://www.CustodianDC.com -----Original Message----- From: observium [mailto:observium-bounces@observium.org] On Behalf Of Mike Stupalov Sent: 16 November 2017 16:38 To: Observium observium@observium.org Subject: Re: [Observium] ASR-9000 PSU Status / Alarms
As I see in debug, this sensors discovered by CISCO-ENTITY-FRU-CONTROL-MIB::cefcFRUPowerStatusEntry
cefcFRUPowerAdminStatus.10906792 = on cefcFRUPowerAdminStatus.31415605 = on cefcFRUPowerAdminStatus.59316821 = on cefcFRUPowerAdminStatus.66531208 = on cefcFRUPowerOperStatus.10906792 = on cefcFRUPowerOperStatus.31415605 = on cefcFRUPowerOperStatus.59316821 = on cefcFRUPowerOperStatus.66531208 = on cefcFRUCurrent.10906792 = 0 cefcFRUCurrent.31415605 = 0 cefcFRUCurrent.59316821 = 0 cefcFRUCurrent.66531208 = 750
corresponding entPhysical entries:
entPhysicalDescr.10906792 = 10GBASE-LR SFP+ Module for SMF entPhysicalName.10906792 = module mau 0/0/2/3 entPhysicalClass.10906792 = module
entPhysicalDescr.31415605 = 10GBASE-LR SFP+ Module for SMF entPhysicalName.31415605 = module mau 0/0/2/2 entPhysicalClass.31415605 = module
entPhysicalDescr.59316821 = ASR-9001 Fan Tray entPhysicalName.59316821 = fantray 0/FT0/SP entPhysicalClass.59316821 = fan
entPhysicalDescr.66531208 = ASR-9001 AC Power Supply entPhysicalName.66531208 = power-module 0/PS0/M0/SP entPhysicalClass.66531208 = powerSupply
where you can see, that device reported in SNMP only one Power Supply and one Fan (as displayed in your screenshot).
nothing surprising for Cisco devices :)
1. Try update firmware 2. if not fixed by firmware, write to Cisco TAC
Robert Williams wrote:
Hi Mike,
Please find the output attached. Let me know if I can do anything else?
I guess ultimately the ability to detect the MAJ and CRIT alarms on the chassis would be useful as these will always throw for any hardware issues.
Even though it was dead during the reload and it never started showing it, the device is still present and in a 'failed' state according to the CLI so you'd think that this would be different from 'not present'. Even by Cisco standards...
Cheers!
Robert Williams Custodian Data Centres https://www.CustodianDC.com -----Original Message----- From: observium [mailto:observium-bounces@observium.org] On Behalf Of Mike Stupalov Sent: 16 November 2017 09:46 To: Observium observium@observium.org Subject: Re: [Observium] ASR-9000 PSU Status / Alarms
Hi Robert,
please attach debug log for this device discovery:
./discovery.php -d -m sensors -h <device>
Robert Williams wrote:
Hi all,
Just noticed a strange problem with reporting alarms on the ASR9001 (XR5.3.4).
A lab router was reloaded and when it came back up there is a “Major” chassis alarm as one of the two PSUs has failed. However, neither the PSU failure nor the ‘MAJ’ alarm status are detected by Observium.
Even after a full discovery and polling, it is all green and happy on Observium it seems. In fact, it shows only 1 PSU now, even though two are installed and one is in a failed state:
(All other 9001 have 2 x “ASR-9001 AC Power Supply” entries here)
Console output confirms the fault and the presence of the major alarm, as well as a few different things (like N+1 resilience lost and ‘Capacity’) which could be used to detect this issue (if they are in Cisco-SNMP-land that is).
#show env power R/S/I Modules Capacity Status (W) 0/PS0/M0/* host PM 750 Ok 0/PS0/M1/* host PM 0 Failed <<<<<<<<
#show env power
<snip> N+1 Supply Protected Capacity Available: Not Protected <<<<<<<<
#show environment leds R/S/I Modules LED Status 0/RSP0/* host Critical-Alarm Off host Major-Alarm On <<<<<<<< host Minor-Alarm Off
RP/0/RSP0/CPU0:RTR-123(admin)#show env power states Wed Nov 15 16:59:15.135 GMT R/S/I State MaxPower Time Count (1-ON/2-OFF) (W) (YY:WK:DD:HH:MIN:SS)
0/PS0/M0/* 1 750 00:00:00:00:59:27 1 2 0 00:00:00:00:00:00 0
0/PS0/M1/* 1 0 00:00:00:00:00:00 0 2 0 00:00:00:00:59:27 1 <<<<<< ‘2 = Off’
so……Can anything be done to tune-up the detection of such issues? In this case it was pure chance I noticed the fault after the reload as I had to be physically in front of it to patch something in. Otherwise I’d not have known.
Any ideas? Cheers!
Robert Williams Custodian Data Centres https://www.CustodianDC.com
observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
-- Mike Stupalov Observium Limited, http://observium.org
observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium _______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
-- Mike Stupalov Observium Limited, http://observium.org
_______________________________________________ observium mailing list observium@observium.org http://postman.memetic.org/cgi-bin/mailman/listinfo/observium
participants (3)
-
Mike Stupalov
-
Robert Williams
-
Tim Cooper