Hardware Failure

Nvidia (Mellanox) SN3420 stack member missing: Diagnose & Fix

By Sai Kiran Pandrala · reviewed by Sai Kiran Pandrala, Editor Last verified: 2026-05-30

⚡ At a glance
VendorNvidia (Mellanox)
Operating systemCumulus Linux / NVOS / SONiC
CategoryHardware Failure
Skill levelIntermediate to advanced
DIY-able?Yes with CLI access; some scenarios need Nvidia Enterprise Support + RMA.

Hardware-class faults on Nvidia (Mellanox) kit fall into a tidy little matrix once you have seen a few. Cumulus Linux / NVOS / SONiC gives you the building blocks via `nv show system` and `nv show platform environment`; the rest is pattern matching. The SN3420 platform is one of the more common offenders only because the install base is large.

Do not skip the visible-and-audible inspection. Burnt-PCB smell and fan-tray rattle are diagnostic signals that no command will ever surface. I have caught more dying PSUs by ear than by `nv show platform environment`.

If the chassis is dark and the console is silent, jump straight to the PSU/cable substitution path before opening a Nvidia Enterprise Support ticket, it eliminates the most common cause in under five minutes.

What this guide covers

Diagnose and recover from stack member missing on a Nvidia (Mellanox) SN3420.

Step-by-step

  1. Run the stack / chassis status command to see member states.
  2. Inspect the stack cables. re-seat both ends.
  3. Try replacing one stack cable at a time to identify a bad cable.
  4. Power-cycle the affected member if cables are good.
  5. If the member still doesn't rejoin, RMA it.

CLI / commands

# Verify hardware state
nv show system
nv show platform inventory
nv show platform environment

# Collect for Nvidia Enterprise Support
cl-support (Cumulus) / show techsupport (SONiC)

When to RMA

Frequently asked questions

Will this work on my specific Cumulus Linux / NVOS / SONiC version?

The procedure reflects current Cumulus Linux / NVOS / SONiC behaviour. Older releases may need minor syntax adjustments, use the CLI help (? or tab-completion) to verify.

Should I open a Nvidia Enterprise Support case immediately?

Open one if you suspect hardware failure or the symptom persists after a maintenance-window reload. Make sure your support entitlement is active first.

Where can I find the Nvidia (Mellanox) official documentation?

https://docs.nvidia.com/networking/: search the product family + feature name.

Is this procedure safe in production?

Test in a lab or maintenance window first. Capture pre-change state so you can roll back.

Related guides worth a look while you sort this one out:

References


Reference material, not professional advice. Validate against your specific Cumulus Linux / NVOS / SONiC version and test in a non-production environment before applying.

What changed recently?

Fault diagnosis on a Nvidia device goes faster when you map the symptom to a recent change:

The answer narrows the root cause to a manageable subset.

Before you start

A few things to confirm so the Nvidia device fix goes cleanly:

Quick verification

Before you walk away from a Nvidia device fix, run through:

1. Reproduce the original trigger, does the issue reappear? 2. Check the device's status / health screen for any new alerts. 3. Confirm paired devices (app, hub, controller) reconnected. 4. Save / commit any configuration changes per the device's normal workflow. 5. Note the change in your maintenance log with date + firmware version.

When to call Nvidia support instead

Escalate if:

More frequently asked questions

What if my model isn't exactly the same revision?

Cross-check the model code on the rating plate against the manufacturer support page. Major firmware generations sometimes shift the menu path; the option is usually under a similarly-named section.

Is it safe to apply during business hours?

If the device is in production use, apply during a scheduled maintenance window. Most procedures need 2-15 minutes of downtime. Capture pre-change state so you can roll back if needed.

How long does this fix usually take?

Most users complete the steps in 20-45 minutes the first time, and 5-10 minutes on subsequent runs once the menu paths are familiar.

Are there safer alternatives for non-technical users?

Yes: the manufacturer's self-service troubleshooter (HP Smart, LG ThinQ, Samsung Members, similar) usually walks through the same steps in a guided UI. Use that first if you're not comfortable with menu paths.

Should I update firmware first or last?

Update firmware first if a release note specifically mentions your symptom. Otherwise, finish the troubleshooting flow first, then update; that way you can isolate whether the update or the underlying fix solved it.