Nvidia (Mellanox) SN3700 power supply failed: Diagnose & Fix
By Sai Kiran Pandrala · reviewed by Sai Kiran Pandrala, Editor Last verified: 2026-05-30
| Vendor | Nvidia (Mellanox) |
|---|---|
| Operating system | Cumulus Linux / NVOS / SONiC |
| Category | Hardware Failure |
| Skill level | Intermediate to advanced |
| DIY-able? | Yes with CLI access; some scenarios need Nvidia Enterprise Support + RMA. |
Hardware-class faults on Nvidia (Mellanox) kit fall into a tidy little matrix once you have seen a few. Cumulus Linux / NVOS / SONiC gives you the building blocks via `nv show system` and `nv show platform environment`; the rest is pattern matching. The SN3700 platform is one of the more common offenders only because the install base is large.
Do not skip the visible-and-audible inspection. Burnt-PCB smell and fan-tray rattle are diagnostic signals that no command will ever surface. I have caught more dying PSUs by ear than by `nv show platform environment`.
If the chassis is dark and the console is silent, jump straight to the PSU/cable substitution path before opening a Nvidia Enterprise Support ticket, it eliminates the most common cause in under five minutes.
What this guide covers
Diagnose and recover from power supply failed on a Nvidia (Mellanox) SN3700.
Step-by-step
- Confirm which PSU failed.
- Verify the remaining PSU has enough capacity for the device + line cards + PoE budget.
- Note the failed PSU's part number.
- Replace during a maintenance window: most enterprise PSUs are hot-swappable.
- After replacement, confirm both PSUs show OK.
CLI / commands
# Verify hardware state
nv show system
nv show platform inventory
nv show platform environment
# Collect for Nvidia Enterprise Support
cl-support (Cumulus) / show techsupport (SONiC)
When to RMA
- Repeated failure after re-seat and power-cycle
- Visible burn, scorching, or physical damage
- POST or memory diagnostic failure
- Hardware crashinfo without a software workaround
Frequently asked questions
Will this work on my specific Cumulus Linux / NVOS / SONiC version?
The procedure reflects current Cumulus Linux / NVOS / SONiC behaviour. Older releases may need minor syntax adjustments, use the CLI help (? or tab-completion) to verify.
Should I open a Nvidia Enterprise Support case immediately?
Open one if you suspect hardware failure or the symptom persists after a maintenance-window reload. Make sure your support entitlement is active first.
Where can I find the Nvidia (Mellanox) official documentation?
https://docs.nvidia.com/networking/. search the product family + feature name.
Is this procedure safe in production?
Test in a lab or maintenance window first. Capture pre-change state so you can roll back.
Related guides
Related fixes
Related guides worth a look while you sort this one out:
- Nvidia (Mellanox) SN2010 power supply failed: Diagnose & Fix
- Nvidia (Mellanox) SN2100 power supply failed: Diagnose & Fix
- Nvidia (Mellanox) SN2410 power supply failed: Diagnose & Fix
- Nvidia (Mellanox) SN2700 power supply failed: Diagnose & Fix
- Nvidia (Mellanox) SN3420 power supply failed: Diagnose & Fix
- Nvidia (Mellanox) SN3700 fan tray failed: Diagnose & Fix
References
- Nvidia (Mellanox) support portal: https://enterprise-support.nvidia.com/
- Nvidia (Mellanox) knowledge base: https://docs.nvidia.com/networking/
- Nvidia (Mellanox) security advisories: https://www.nvidia.com/en-us/security/
- Open a case: https://enterprise-support.nvidia.com/s/createcase
Reference material, not professional advice. Validate against your specific Cumulus Linux / NVOS / SONiC version and test in a non-production environment before applying.
Common patterns we see
When this symptom shows up on a Nvidia device, three patterns repeat:
1. Recent firmware update changed behavior, the symptom started within a week of an OTA push. Rollback or wait for the hotfix. 2. Environmental trigger: temperature, humidity, line voltage, network changes. Look at what changed in the environment. 3. Cumulative wear, components like batteries, gaskets, fans degrade over time. Replace the consumable rather than chasing a software fix.
Knowing which pattern applies saves time on the wrong fix.
Before you start
A few things to confirm so the Nvidia device fix goes cleanly:
- Latest firmware downloaded if you're going to update.
- Warranty + support contract status checked. opening sealed parts may void it.
- Backup of current configuration (where applicable) taken.
- Spare parts on hand if you anticipate replacement.
- Adequate workspace, lighting, and time, rushing causes regressions.
How to confirm it's actually fixed
On a Nvidia device, the test is rarely "reboot and see". Use this list:
- Active reproduction: trigger the original failure path on purpose.
- Indirect reproduction: do an activity that would expose the same subsystem.
- Status indicator review: every LED / display / app status should be green.
- 24-hour soak: leave the device under normal load overnight; check the next morning.
- Telemetry check: review the device or app's diagnostic log for new error entries.
Escalation guide
For a Nvidia device, the right escalation depends on impact:
- Cosmetic / minor: log a ticket via the Nvidia app or web portal. Response 1-3 business days.
- Mid-impact: phone support. Have your serial number ready.
- Critical (production down, safety issue): in-person dealer / TAC visit. Bring proof of purchase.
- Out of warranty: third-party repair shop with manufacturer-certified technicians.
More frequently asked questions
Should I update firmware first or last?
Update firmware first if a release note specifically mentions your symptom. Otherwise, finish the troubleshooting flow first, then update; that way you can isolate whether the update or the underlying fix solved it.
Is it safe to apply during business hours?
If the device is in production use, apply during a scheduled maintenance window. Most procedures need 2-15 minutes of downtime. Capture pre-change state so you can roll back if needed.
How long does this fix usually take?
Most users complete the steps in 20-45 minutes the first time, and 5-10 minutes on subsequent runs once the menu paths are familiar.
Are there safer alternatives for non-technical users?
Yes: the manufacturer's self-service troubleshooter (HP Smart, LG ThinQ, Samsung Members, similar) usually walks through the same steps in a guided UI. Use that first if you're not comfortable with menu paths.
Will the procedure work on the international variant?
Some features and firmware paths are region-locked. Check the model spec sheet to confirm your variant supports the menu option referenced. If you're outside the US/EU, look for the regional support portal.