Hardware Failure

Nvidia (Mellanox) SN3700 power supply failed: Diagnose & Fix

By Sai Kiran Pandrala · reviewed by Sai Kiran Pandrala, Editor Last verified: 2026-05-30

⚡ At a glance
VendorNvidia (Mellanox)
Operating systemCumulus Linux / NVOS / SONiC
CategoryHardware Failure
Skill levelIntermediate to advanced
DIY-able?Yes with CLI access; some scenarios need Nvidia Enterprise Support + RMA.

Hardware-class faults on Nvidia (Mellanox) kit fall into a tidy little matrix once you have seen a few. Cumulus Linux / NVOS / SONiC gives you the building blocks via `nv show system` and `nv show platform environment`; the rest is pattern matching. The SN3700 platform is one of the more common offenders only because the install base is large.

Do not skip the visible-and-audible inspection. Burnt-PCB smell and fan-tray rattle are diagnostic signals that no command will ever surface. I have caught more dying PSUs by ear than by `nv show platform environment`.

If the chassis is dark and the console is silent, jump straight to the PSU/cable substitution path before opening a Nvidia Enterprise Support ticket, it eliminates the most common cause in under five minutes.

What this guide covers

Diagnose and recover from power supply failed on a Nvidia (Mellanox) SN3700.

Step-by-step

  1. Confirm which PSU failed.
  2. Verify the remaining PSU has enough capacity for the device + line cards + PoE budget.
  3. Note the failed PSU's part number.
  4. Replace during a maintenance window: most enterprise PSUs are hot-swappable.
  5. After replacement, confirm both PSUs show OK.

CLI / commands

# Verify hardware state
nv show system
nv show platform inventory
nv show platform environment

# Collect for Nvidia Enterprise Support
cl-support (Cumulus) / show techsupport (SONiC)

When to RMA

Frequently asked questions

Will this work on my specific Cumulus Linux / NVOS / SONiC version?

The procedure reflects current Cumulus Linux / NVOS / SONiC behaviour. Older releases may need minor syntax adjustments, use the CLI help (? or tab-completion) to verify.

Should I open a Nvidia Enterprise Support case immediately?

Open one if you suspect hardware failure or the symptom persists after a maintenance-window reload. Make sure your support entitlement is active first.

Where can I find the Nvidia (Mellanox) official documentation?

https://docs.nvidia.com/networking/. search the product family + feature name.

Is this procedure safe in production?

Test in a lab or maintenance window first. Capture pre-change state so you can roll back.

Related guides worth a look while you sort this one out:

References


Reference material, not professional advice. Validate against your specific Cumulus Linux / NVOS / SONiC version and test in a non-production environment before applying.

Common patterns we see

When this symptom shows up on a Nvidia device, three patterns repeat:

1. Recent firmware update changed behavior, the symptom started within a week of an OTA push. Rollback or wait for the hotfix. 2. Environmental trigger: temperature, humidity, line voltage, network changes. Look at what changed in the environment. 3. Cumulative wear, components like batteries, gaskets, fans degrade over time. Replace the consumable rather than chasing a software fix.

Knowing which pattern applies saves time on the wrong fix.

Before you start

A few things to confirm so the Nvidia device fix goes cleanly:

How to confirm it's actually fixed

On a Nvidia device, the test is rarely "reboot and see". Use this list:

Escalation guide

For a Nvidia device, the right escalation depends on impact:

More frequently asked questions

Should I update firmware first or last?

Update firmware first if a release note specifically mentions your symptom. Otherwise, finish the troubleshooting flow first, then update; that way you can isolate whether the update or the underlying fix solved it.

Is it safe to apply during business hours?

If the device is in production use, apply during a scheduled maintenance window. Most procedures need 2-15 minutes of downtime. Capture pre-change state so you can roll back if needed.

How long does this fix usually take?

Most users complete the steps in 20-45 minutes the first time, and 5-10 minutes on subsequent runs once the menu paths are familiar.

Are there safer alternatives for non-technical users?

Yes: the manufacturer's self-service troubleshooter (HP Smart, LG ThinQ, Samsung Members, similar) usually walks through the same steps in a guided UI. Use that first if you're not comfortable with menu paths.

Will the procedure work on the international variant?

Some features and firmware paths are region-locked. Check the model spec sheet to confirm your variant supports the menu option referenced. If you're outside the US/EU, look for the regional support portal.