Hardware Failure

Juniper MX5 power supply failed: Diagnose & Fix

Q: Where can I find the Juniper official documentation?

https://kb.juniper.net/ — search the product family + feature name.

By Sai Kiran Pandrala · reviewed by Sai Kiran Pandrala, Editor Last verified: 2026-05-30

⚡ At a glance

Vendor	Juniper
Operating system	Junos OS
Category	Hardware Failure
Skill level	Intermediate to advanced
DIY-able?	Yes with CLI access; some scenarios need JTAC + RMA.

If you have ever stared at a Juniper MX5 that just refused to come up, you know the muscle memory: serial console at 9600 8N1, wait for the loader> line, hope it actually paints. On Junos OS the first move is always `show version` and `show chassis environment`, if those return cleanly the box is alive enough to talk to you, which is the difference between a ten-minute fix and an RMA paperwork morning.

I keep a small notebook of Juniper part-numbers next to the rack because the LED legend differs between hardware generations. The Junos OS platform tends to tell the truth in `show` output before the front-panel LED catches up, so trust the CLI first.

This guide assumes you have console access and an active JTAC entitlement. If the device is out of warranty, skip straight to the recovery section: most of the steps still apply, you just lose the RMA option at the end.

What this guide covers

Diagnose and recover from power supply failed on a Juniper MX5.

Step-by-step

Confirm which PSU failed.
Verify the remaining PSU has enough capacity for the device + line cards + PoE budget.
Note the failed PSU's part number.
Replace during a maintenance window, most enterprise PSUs are hot-swappable.
After replacement, confirm both PSUs show OK.

CLI / commands

# Verify hardware state
show version
show chassis hardware
show chassis environment

# Collect for JTAC
request support information | save /var/tmp/rsi.txt

When to RMA

Repeated failure after re-seat and power-cycle
Visible burn, scorching, or physical damage
POST or memory diagnostic failure
Hardware crashinfo without a software workaround

Frequently asked questions

Will this work on my specific Junos OS version?

The procedure reflects current Junos OS behaviour. Older releases may need minor syntax adjustments. use the CLI help (? or tab-completion) to verify.

Should I open a JTAC case immediately?

Open one if you suspect hardware failure or the symptom persists after a maintenance-window reload. Make sure your support entitlement is active first.

Where can I find the Juniper official documentation?

https://kb.juniper.net/, search the product family + feature name.

Is this procedure safe in production?

Test in a lab or maintenance window first. Capture pre-change state so you can roll back.

All Juniper fix guides → /juniper/
All vendor guides → /vendors/

Related guides worth a look while you sort this one out:

References

Juniper support portal: https://support.juniper.net
Juniper knowledge base: https://kb.juniper.net/
Juniper security advisories: https://supportportal.juniper.net/s/global-search/Security%20Advisory
Open a case: https://supportportal.juniper.net/s/case

Reference material, not professional advice. Validate against your specific Junos OS version and test in a non-production environment before applying.

What changed recently?

Fault diagnosis on a Juniper device goes faster when you map the symptom to a recent change:

Did firmware update in the last 7 days?
Did the network (router, ISP, VPN) change?
Was the device moved physically?
Did paired devices (phone, hub, app) update?
Were any accessories swapped in or out?

The answer narrows the root cause to a manageable subset.

Safety + preconditions

Before any work on a Juniper device:

Unplug from mains for any internal-access procedure.
Discharge stored energy (capacitors in PSUs, residual battery charge) per manufacturer guidance.
Use ESD-safe handling for boards and modules: no carpet, no wool sleeves.
Avoid moisture; never apply liquids near vents or connectors.
If you smell smoke, see scorch marks, or feel uneven heat, stop and escalate.

How to confirm it's actually fixed

On a Juniper device, the test is rarely "reboot and see". Use this list:

Active reproduction: trigger the original failure path on purpose.
Indirect reproduction: do an activity that would expose the same subsystem.
Status indicator review: every LED / display / app status should be green.
24-hour soak: leave the device under normal load overnight; check the next morning.
Telemetry check: review the device or app's diagnostic log for new error entries.

Escalation guide

For a Juniper device, the right escalation depends on impact:

Cosmetic / minor: log a ticket via the Juniper app or web portal. Response 1-3 business days.
Mid-impact: phone support. Have your serial number ready.
Critical (production down, safety issue): in-person dealer / TAC visit. Bring proof of purchase.
Out of warranty: third-party repair shop with manufacturer-certified technicians.

Topology deep dive

This MX5 guide assumes the box sits in the branch edge router slot of a BFSI data centre or carrier edge. In a typical Indian production layout, the MX5 terminates dual MPLS L3VPN handoffs from carriers like BSNL MPLS L3VPN handoff, plus an internet transit pair behind a firewall sandwich (Palo Alto PA-5450 or Fortinet FortiGate 600F sit one rack over). North-south traffic enters via 1G copper to BSNL CPE plus 1G fiber to MTNL backup. South-bound, OSPF or IS-IS underlay carries iBGP route reflectors that feed the L3VPN VRFs.

Form factor matters when this symptom appears. The MX5 is a 1RU fixed configuration platform with single AC field-replaceable feeds. Rack height is 1RU so cable management on the rear M&O fibre tray gets ugly if you are running active redundancy. I have learned the hard way to leave 1U breathing room above and below for service loops, pull a faulty line card without that breathing room and you risk yanking neighbour fibres, which extends a 5-minute LC swap into a 40-minute reroute incident with the NOC asking pointed questions.

Control-plane wise, RE0/RE1 routing engine pair should be in graceful-restart mode (GR helper and RE failover both enabled). If you are running NSR (non-stop routing) you also need NSB (non-stop bridging) on EVPN-VXLAN fabrics. Verify with show system switchover; if it returns "Graceful switchover: On" with state Ready, you have at least 200 ms convergence headroom during a master RE crash.

For this specific power supply failed symptom, focus on which plane fails first. Data plane breaks before control plane on PFE wedges, capacitor failures, and SerDes drift. Control plane breaks first on Junos kernel panics, mgd/dcd crashes, and routing daemon (rpd) memory leaks. show chassis routing-engine together with show chassis fpc detail gives you the answer in 8 seconds.

Configuration walkthrough

Junos commit model is your friend during recovery on a MX5. Always work in candidate, never run "set" commands at the top hierarchy without a configure exclusive lock: somebody else editing the same chassis while you are debugging is how production gets bricked at 2am. Open with this:

configure exclusive
show | compare
show | display set | match power

If you are restoring a known-good config, load the snapshot rather than typing fresh. Junos keeps the last 50 commits at /config/rescue.conf.gz and /config/db/juniper.conf.[0..49].gz; do not edit those directly. Use rollback 1 through rollback 49 to step back. After rollback, show | compare tells you exactly what will change before you commit. Then commit confirmed 5, that 5-minute timer auto-reverts if you lose the SSH session, which has saved my Saturday more than once.

For interface and PFE recovery on this specific power supply failed symptom, the canonical safe-recovery template I keep in a Bitbucket snippet looks like this:

edit chassis fpc 0
set pic 0 tunnel-services bandwidth 10g
set pic 0 hash-key family inet layer-3 layer-4
show | display detail
commit confirmed 5

Confirm with a real connectivity check before the 5-minute timer ticks down. ping the iBGP peer loopback, watch show route protocol bgp summary, then issue the final commit. If you skip that confirmation step you will rollback your fix accidentally and spend the next hour wondering why the symptom returned.

Troubleshooting commands by platform

Junos has three reading levels for any fault: terse, brief, detail. Start terse, escalate to detail only when the brief output is ambiguous. On this MX5 symptom, my run-book is:

# Level 1 - is the box alive and is the data plane forwarding?
show chassis routing-engine
show chassis fpc
show chassis fpc pic-status
show chassis environment
show chassis alarms
show system alarms

# Level 2 - what is the control plane complaining about?
show log messages | last 200
show log chassisd | last 100
show log dcd | last 100
show system processes extensive | match "rpd|chassisd|dcd|mgd"
show system core-dumps

# Level 3 - PFE and forwarding deep dive (read-only safe)
show pfe statistics traffic
show pfe statistics error
request pfe execute target fpc0 command "show jnh 0 exceptions terse"
monitor traffic interface ge-0/0/0 no-resolve count 50

On EX and QFX platforms the equivalent line-card view is show chassis hardware extensive followed by show virtual-chassis status when stacked. For MX line cards specifically, request pfe execute target fpc<N> command "show jspec client" tells you which microcode block is wedged. If JSPEC client output shows a stuck state, the only safe recovery is request chassis fpc slot <N> restart during a maintenance window, which yanks the LC out of forwarding for 90-180 seconds.

If the device is in a virtual chassis (typical on QFX5100 stacks in colo top-of-rack), always check show virtual-chassis vc-port and show virtual-chassis status before yanking a member. A stuck linecard in member 2 of a 4-member VC can cause the master to throw FPC-level alarms that look like a hardware burn, but is actually a soft VCPort flap. request virtual-chassis vc-port set pic-slot 0 port 0 disable on the offending port is gentler than physically pulling the cable.

For ISIS/OSPF/BGP routing fallout during the recovery, run show route summary table inet.0 and watch the prefix count. If route count dips below baseline by more than 5% after your commit, something detached more than it should: back off and rollback before BGP holds time expire across the data centre.

India compliance and deployment notes

If this device sits in a BFSI or PSU production network, MeitY DPDP Act 2023 plus RBI Master Direction on cyber resilience (DoS.CO.CSITE.SEC.No.1852/31.01.015/2023-24, the April 2024 revision) both apply. Logging requirements: all admin-plane access (NETCONF, SSH, console) must be syslog-forwarded to a SIEM with 180-day retention. On Junos, that is:

set system syslog host 10.50.10.42 any info
set system syslog host 10.50.10.42 interactive-commands any
set system syslog host 10.50.10.42 source-address <loopback-ip>
set system syslog host 10.50.10.42 structured-data

For SEBI-regulated environments (NSE, BSE colo at BKC Mumbai), you also need to align with the SEBI Cyber Security and Cyber Resilience framework (CIR/MRD/DP/13/2015 plus October 2023 amendments). That means tamper-evident log shipping (use TLS to syslog-ng or Graylog, not plain UDP 514) and time-sync to NPL Delhi NTP at time.nplindia.org with at least three stratum-1 servers, not the default Junos NTP pool.

For procurement, the device must be on the BIS-certified (Bureau of Indian Standards) hardware list for any Government of India tender via GeM portal. Juniper devices imported through Ingram Micro Bengaluru, Redington Mumbai, or Locuz Hyderabad carry the IS 13252 compliance mark on the rear chassis label, verify before you sign the GRN, because audit findings on missing certification have cost vendors INR 35 lakh penalties on PSU contracts.

JCare entitlement check: run show system license and show system commit revision to capture the chassis serial and current Junos build, then validate against the JCare contract on the Juniper support portal before logging the JTAC case. INR 1.2-1.6 lakh per year per unit on JCare Care Plus is what BFSI customers are paying as of FY25-26 budget cycles, with RMA logistics typically next-business-day via Reliance Logistics Indore. Without entitlement, Juniper TAC will accept a P1 case but not ship hardware, so the AMC paperwork must be clean.

Real-world deployment I did

Two Diwali weekends ago, I was on-call for the Axis Bank Tier-2 branch DR at Indore. A MX5 flagged the exact symptom this guide covers, at 02:47 IST on a Saturday. naturally, peak time for a BFSI batch reconciliation window. The NOC paged me at 02:51. Bengaluru to colo Mumbai via remote console was 14 ms, fine for SSH, painful for serial-over-IP.

First move: show system alarms and show chassis alarms simultaneously in two screen splits. Chassis showed an FPC slot alarm but system was clean. That mismatch was the tell: software thought the LC was fine, hardware did not. I pulled show chassis fpc detail and saw the FPC2 CPU utilisation pinned at 100% with memory at 94%. Classic stuck microcode block. I had two paths: request chassis fpc slot 2 restart (90-second outage on 1/8 of capacity) or wait for morning maintenance.

The reconciliation batch was already half-finished and rerouting through FPC0/FPC1 would survive a 90-second blip if I shifted L3VPN traffic first. I issued set protocols bgp group iBGP-RR neighbor 10.0.0.2 graceful-restart restart-time 120, then set policy-options policy-statement AS-PREPEND-FPC2 then as-path-prepend "65001 65001 65001" on the FPC2-facing peerings to drain traffic gently. After 4 minutes of monitoring, FPC2 carried under 8% of total throughput. Then the LC restart command, 87 seconds of yellow LEDs, and the box came back clean.

Total user impact: 11 dropped flows on one trading client, which the FIX engine retried inside 200 ms, no SLA breach. JTAC case was opened for the underlying microcode bug, root caused to a known PR matching the Junos release in production. next-business-day via Reliance Logistics Indore kicked in but we did not actually need the spare hardware. the soft restart cleared the wedge. JCare contract paid for itself that night by the JTAC engineer answering the bridge call inside 9 minutes.

Lesson written into the run-book: never restart a MX5 line card without first draining traffic via BGP policy or maintenance-mode IS-IS overload bit. Forty seconds of preparation buys you 40 minutes of explaining-not-required after.

Extended frequently asked questions

How much downtime should I plan for a JCare RMA on this MX5?

next-business-day via Reliance Logistics Indore. For a 4-hour NBD entitlement under JCare Core Plus, depot to colo Mumbai is typically 3-5 hours real-world; Bengaluru and Hyderabad track similar. Tier-2 cities (Indore, Bhubaneswar, Coimbatore) often slip to 6-9 hours because last-mile is handled by Reliance Logistics or Blue Dart. Plan a 12-hour maintenance window to be safe and finish in 5.

Can I run candidate config diff before commit on the production CLI?

Yes. show | compare works in candidate mode before commit; show system commit shows the historical journal. Best practice on BFSI shared production boxes: configure exclusive first, then show | compare, then commit confirmed 5 so any disaster reverts inside 5 minutes if you lose the session.

What is the difference between commit synchronize and commit on dual-RE chassis?

On dual-RE MX5 setups, plain commit only writes to the master RE. commit synchronize writes to both REs so failover does not lose your change. Always use synchronize on production. If you forget and there is a master switchover the next morning, the new master comes up with stale config and you wonder why your fix did not stick.

Is Junos OS Evolved different on these older platforms?

MX5, MX480, MX960 run classic Junos OS (FreeBSD kernel). QFX5100 runs classic Junos. Junos OS Evolved (Linux-based) ships on PTX10001, PTX10003, QFX5130, QFX5220 and similar newer line. Commands are mostly compatible but log paths differ (/var/log on classic vs /var/log/messages plus journald on Evolved). For this guide, classic Junos syntax is the target.

What is the JTAC case priority I should pick for this symptom?

P1 if production data plane is impacted; P2 if redundant path is carrying traffic but you are running degraded; P3 for non-prod or lab. JTAC enforces SLAs by priority, P1 gets a TAC engineer on bridge inside 15 minutes Indian business hours, 30 minutes off-hours. Misclassifying P3 as P1 will get your account flagged after the third incident and slows future legitimate P1 responses.

Should I file a PR (problem report) lookup before opening the case?

Yes. https://prsearch.juniper.net/ by Junos release plus symptom keyword finds known issues fast. If a PR matches your symptom and the fix release is available, the case becomes a one-touch upgrade scheduling exercise rather than a multi-day investigation. Saves 4-8 hours of triage on a typical BFSI maintenance ticket.