Nvidia (Mellanox) SN3420 stack member missing: Diagnose & Fix
By Sai Kiran Pandrala · reviewed by Sai Kiran Pandrala, Editor Last verified: 2026-05-30
| Vendor | Nvidia (Mellanox) |
|---|---|
| Operating system | Cumulus Linux / NVOS / SONiC |
| Category | Hardware Failure |
| Skill level | Intermediate to advanced |
| DIY-able? | Yes with CLI access; some scenarios need Nvidia Enterprise Support + RMA. |
Hardware-class faults on Nvidia (Mellanox) kit fall into a tidy little matrix once you have seen a few. Cumulus Linux / NVOS / SONiC gives you the building blocks via `nv show system` and `nv show platform environment`; the rest is pattern matching. The SN3420 platform is one of the more common offenders only because the install base is large.
Do not skip the visible-and-audible inspection. Burnt-PCB smell and fan-tray rattle are diagnostic signals that no command will ever surface. I have caught more dying PSUs by ear than by `nv show platform environment`.
If the chassis is dark and the console is silent, jump straight to the PSU/cable substitution path before opening a Nvidia Enterprise Support ticket, it eliminates the most common cause in under five minutes.
What this guide covers
Diagnose and recover from stack member missing on a Nvidia (Mellanox) SN3420.
Step-by-step
- Run the stack / chassis status command to see member states.
- Inspect the stack cables. re-seat both ends.
- Try replacing one stack cable at a time to identify a bad cable.
- Power-cycle the affected member if cables are good.
- If the member still doesn't rejoin, RMA it.
CLI / commands
# Verify hardware state
nv show system
nv show platform inventory
nv show platform environment
# Collect for Nvidia Enterprise Support
cl-support (Cumulus) / show techsupport (SONiC)
When to RMA
- Repeated failure after re-seat and power-cycle
- Visible burn, scorching, or physical damage
- POST or memory diagnostic failure
- Hardware crashinfo without a software workaround
Frequently asked questions
Will this work on my specific Cumulus Linux / NVOS / SONiC version?
The procedure reflects current Cumulus Linux / NVOS / SONiC behaviour. Older releases may need minor syntax adjustments, use the CLI help (? or tab-completion) to verify.
Should I open a Nvidia Enterprise Support case immediately?
Open one if you suspect hardware failure or the symptom persists after a maintenance-window reload. Make sure your support entitlement is active first.
Where can I find the Nvidia (Mellanox) official documentation?
https://docs.nvidia.com/networking/: search the product family + feature name.
Is this procedure safe in production?
Test in a lab or maintenance window first. Capture pre-change state so you can roll back.
Related guides
Related fixes
Related guides worth a look while you sort this one out:
- Nvidia (Mellanox) SN2010 stack member missing: Diagnose & Fix
- Nvidia (Mellanox) SN2100 stack member missing: Diagnose & Fix
- Nvidia (Mellanox) SN2410 stack member missing: Diagnose & Fix
- Nvidia (Mellanox) SN2700 stack member missing: Diagnose & Fix
- Nvidia (Mellanox) SN3700 stack member missing: Diagnose & Fix
- Nvidia (Mellanox) SN3420 all ports dead: Diagnose & Fix
References
- Nvidia (Mellanox) support portal: https://enterprise-support.nvidia.com/
- Nvidia (Mellanox) knowledge base: https://docs.nvidia.com/networking/
- Nvidia (Mellanox) security advisories: https://www.nvidia.com/en-us/security/
- Open a case: https://enterprise-support.nvidia.com/s/createcase
Reference material, not professional advice. Validate against your specific Cumulus Linux / NVOS / SONiC version and test in a non-production environment before applying.
What changed recently?
Fault diagnosis on a Nvidia device goes faster when you map the symptom to a recent change:
- Did firmware update in the last 7 days?
- Did the network (router, ISP, VPN) change?
- Was the device moved physically?
- Did paired devices (phone, hub, app) update?
- Were any accessories swapped in or out?
The answer narrows the root cause to a manageable subset.
Before you start
A few things to confirm so the Nvidia device fix goes cleanly:
- Latest firmware downloaded if you're going to update.
- Warranty + support contract status checked, opening sealed parts may void it.
- Backup of current configuration (where applicable) taken.
- Spare parts on hand if you anticipate replacement.
- Adequate workspace, lighting, and time. rushing causes regressions.
Quick verification
Before you walk away from a Nvidia device fix, run through:
1. Reproduce the original trigger, does the issue reappear? 2. Check the device's status / health screen for any new alerts. 3. Confirm paired devices (app, hub, controller) reconnected. 4. Save / commit any configuration changes per the device's normal workflow. 5. Note the change in your maintenance log with date + firmware version.
When to call Nvidia support instead
Escalate if:
- The same symptom returns within 24 hours of a clean fix.
- You see physical damage (burn marks, swollen battery, cracked PCB).
- The device is in warranty and a hardware replacement is the cheaper outcome.
- Repair requires specialised tools you don't own (alignment jigs, calibration software).
- Following the official path keeps the warranty intact, which matters more than the time spent.
More frequently asked questions
What if my model isn't exactly the same revision?
Cross-check the model code on the rating plate against the manufacturer support page. Major firmware generations sometimes shift the menu path; the option is usually under a similarly-named section.
Is it safe to apply during business hours?
If the device is in production use, apply during a scheduled maintenance window. Most procedures need 2-15 minutes of downtime. Capture pre-change state so you can roll back if needed.
How long does this fix usually take?
Most users complete the steps in 20-45 minutes the first time, and 5-10 minutes on subsequent runs once the menu paths are familiar.
Are there safer alternatives for non-technical users?
Yes: the manufacturer's self-service troubleshooter (HP Smart, LG ThinQ, Samsung Members, similar) usually walks through the same steps in a guided UI. Use that first if you're not comfortable with menu paths.
Should I update firmware first or last?
Update firmware first if a release note specifically mentions your symptom. Otherwise, finish the troubleshooting flow first, then update; that way you can isolate whether the update or the underlying fix solved it.