Fix Azure Operator Nexus Storage Appliance Issues
Why This Is Happening
I've worked with Azure Operator Nexus deployments across a range of telecom and enterprise environments, and I can tell you: the storage appliance is the component that generates the most support tickets. Not because it's poorly designed , it's actually quite solid , but because the error messages Azure surfaces when something goes wrong with Nexus storage are almost deliberately unhelpful. You'll see a provisioning state stuck on InProgress, a PVC that refuses to bind, or a workload suddenly throwing out-of-disk-space errors with no obvious cause. And the Azure portal just stares back at you.
Here's the core thing to understand. Azure Operator Nexus storage appliances aren't resources you create or delete yourself. They exist as a direct result of the cluster lifecycle, meaning Microsoft's internal orchestration creates and removes them. If you try to issue a create or delete operation manually, the platform blocks it. That surprises a lot of engineers who are used to the normal Azure resource model. You're working with resources you can view and update, but not fully control.
The most common pain points I see operators hit are:
- Storage appliance stuck in
Provisioningstate, usually a sign that the cluster deployment hasn't completed cleanly, or that the underlying hardware didn't pass the Hardware Validation (HWV) phase. - PVC provisioning failures, requesting storage outside the allowed min/max range for your storage class, or trying to use a storage class in the wrong access mode.
- Out-of-disk-space errors on running workloads, thin provisioning means the storage appliance doesn't reserve capacity upfront. If you haven't monitored consumption, volumes fill up silently until pods start crashing.
- Confusion between rack slot 1 and rack slot 2 appliances, especially when a second storage appliance is added and operators aren't sure which one is which.
- Remote vendor management won't enable, usually a permissions or configuration issue at the resource level.
The frustrating part is that Azure Operator Nexus is an on-premises deployment model, which means you're dealing with a hybrid of physical hardware constraints and cloud-side resource management. That combination creates failure modes that neither a pure cloud engineer nor a pure datacenter engineer immediately recognizes. I know this is a stressful situation, especially when storage issues cascade into workload downtime that affects your tenants. Browse all Microsoft fix guides →
Let's work through this systematically. Most of these issues are fixable once you know exactly where to look.
The Quick Fix, Try This First
Before you go deep into diagnostics, run this check first. The majority of Azure Operator Nexus storage appliance issues I see in the field trace back to one of two things: the storage appliance is in an Error status that nobody noticed, or a PVC is requesting a size that falls outside the allowed range. Both of these are fast to check.
Step 1: Check storage appliance status via Azure CLI.
Open your terminal and run:
az networkcloud storageappliance list \
--resource-group <YOUR_RESOURCE_GROUP> \
--subscription <SUBSCRIPTION_ID> \
--output table
Look at three columns in the output: status, provisioningState, and capacity. If status shows Error or provisioningState shows Failed, that's your culprit. If status is Available and provisioningState is Succeeded, the appliance itself is healthy and you should focus on the PVC or workload layer.
Step 2: Check your PVC size against the allowed limits.
If you're seeing PVC provisioning failures, the most likely cause is a size request that's out of range. For nexus-volume (ReadWriteOnce), the minimum is 1 MiB and the maximum is 12 TiB. For nexus-shared (ReadWriteMany), there's no minimum, but the backing NFS server has a hard capacity of 1 TiB, even if your PVC requests more, you can't actually consume beyond that limit.
kubectl get pvc -A
If any PVC shows Pending rather than Bound, describe it to see the exact failure reason:
kubectl describe pvc <PVC_NAME> -n <NAMESPACE>
The events section at the bottom of that output will tell you exactly why binding failed. Nine times out of ten it's either an unsupported size, a missing storage class, or the storage appliance being in an error state upstream.
remoteVendorManagementStatus on your storage appliance resource before opening a support ticket. If remote vendor management is disabled, Microsoft's support team won't be able to perform remote diagnostics on your Pure storage hardware, which significantly slows down resolution time. Enable it proactively during your initial setup, not after things break.
One of the most common sources of confusion in Azure Operator Nexus environments, especially after a second storage appliance is added, is not knowing which physical appliance maps to which Azure resource. The rule is straightforward once you know it, but Microsoft doesn't surface it prominently in the portal.
The aggregator rack in every Azure Operator Nexus instance has two dedicated rack slots for storage appliances. Rack slot 1 always holds the first storage appliance. If your deployment includes a second storage appliance, it occupies rack slot 2. There is no ambiguity here, this is a physical assignment enforced by the hardware design.
To view your storage appliance details including capacity and status, run:
az networkcloud storageappliance show \
--name <STORAGE_APPLIANCE_NAME> \
--resource-group <YOUR_RESOURCE_GROUP> \
--subscription <SUBSCRIPTION_ID>
The output will include the rackId and slot information that maps the Azure resource back to the physical hardware. Cross-reference this against your physical rack documentation to confirm which unit you're looking at.
If you're evaluating whether you need a second storage appliance, check your BOM version first. A second appliance is only supported on instances running BOM version 2.0.x or later, and all Pure storage appliances in the instance must have R4 controllers. Attempting to add a second appliance on older hardware will fail at validation. Run the following to check your cluster's BOM version:
az networkcloud cluster show \
--name <CLUSTER_NAME> \
--resource-group <YOUR_RESOURCE_GROUP> \
--query "clusterVersion" \
--output tsv
If the command returns the storage appliance with status Available and provisioningState Succeeded, you've confirmed the appliance is healthy and correctly identified. Move to step 2.
When your Azure Operator Nexus storage appliance lands in Error status, the first instinct is to try recreating or deleting the resource. Don't. The platform actively blocks those operations, storage appliances are created and deleted only through internal cluster lifecycle management. Attempting to force it will just produce another error.
Instead, your path forward is to gather the right diagnostic data and either resolve the underlying cluster issue or escalate with the right information. Here's how to get that data.
First, check the cluster's overall health. The storage appliance status is downstream of the cluster state:
az networkcloud cluster show \
--name <CLUSTER_NAME> \
--resource-group <YOUR_RESOURCE_GROUP> \
--query "{status:detailedStatus, message:detailedStatusMessage}" \
--output json
If the cluster itself shows a Failed or degraded state, that's almost certainly what's driving the storage appliance error. The cluster's Hardware Validation phase checks every machine's hardware health before deployment proceeds, if that phase detected issues with BIOS boot settings or other hardware components, those need to be resolved first.
Next, pull the activity log for the storage appliance resource to see what operations were attempted and when:
az monitor activity-log list \
--resource-id <STORAGE_APPLIANCE_RESOURCE_ID> \
--start-time 2026-04-14T00:00:00Z \
--output table
Look for failed operations and their error codes. The most actionable errors will have a clear message, things like hardware validation failures, network fabric connectivity issues, or internal provisioning timeouts. If you see internal errors with no clear message, that's typically a signal to escalate to Microsoft support with the full activity log output.
Once the underlying issue is resolved, usually through a cluster repair or redeployment operation, the storage appliance status should transition back to Available on its own within a few minutes.
PVC provisioning failures on Azure Operator Nexus storage are almost always one of three things: wrong size, wrong storage class, or wrong access mode. Let me break each one down.
Wrong size for nexus-volume: If you request less than 1 MiB or more than 12 TiB, the provisioning request fails outright. Check your PVC spec:
kubectl get pvc <PVC_NAME> -n <NAMESPACE> -o yaml | grep -A5 resources
Fix the size in your manifest and reapply. Example of a valid nexus-volume PVC:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: my-rwo-volume
spec:
accessModes:
- ReadWriteOnce
volumeMode: Filesystem
resources:
requests:
storage: 10Gi
storageClassName: nexus-volume
nexus-shared capacity trap: You can provision a nexus-shared PVC requesting more than 1 TiB, the API won't reject it. But the backing NFS server has a hard limit of 1 TiB. Your pod will eventually hit an out-of-disk-space error. Keep all nexus-shared PVC requests at or below 1 TiB, and track total consumption across all nexus-shared PVCs because the 1 TiB limit is shared across all of them.
CSI ephemeral volumes: Azure Operator Nexus does not support CSI ephemeral volumes at all. If your workload manifest uses csi inline volume definitions instead of PVCs, you'll get a scheduling failure. Convert these to PVC-backed volumes before deploying.
After correcting your manifest, apply it and immediately watch PVC status:
kubectl get pvc -w -n <NAMESPACE>
You should see the status transition from Pending to Bound within 30–60 seconds if the fix was correct.
ReadWriteMany (RWX) workloads on Azure Operator Nexus use the nexus-shared storage class, which is backed by an NFS server. These setups let multiple pods on different nodes read from and write to the same volume simultaneously, which is powerful, but also the source of some specific failure modes.
The most common issue I see: a deployment is configured for RWX but pods end up on the same node, which works but defeats the purpose, or, worse, pods fail to schedule because pod anti-affinity rules are misconfigured.
Here's the correct affinity configuration to ensure pods land on different nodes:
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: kubernetes.azure.com/agentpool
operator: Exists
values: []
topologyKey: "kubernetes.io/hostname"
This tells the scheduler that no two pods from this deployment should land on the same node. If you have more replicas than available nodes, some pods will stay in Pending state, that's expected behavior, not a storage issue.
To verify that your RWX setup is actually sharing data correctly across pods, exec into two different pods and check that they see each other's writes:
kubectl exec <POD_1_NAME> -- cat /mnt/hostname.txt
kubectl exec <POD_2_NAME> -- cat /mnt/hostname.txt
Both commands should show identical content that includes writes from all pods. If one pod's output only shows its own writes, your PVC isn't actually in RWX mode, check that accessModes is set to ReadWriteMany and the storage class is nexus-shared, not nexus-volume.
Also remember: each pod in an RWX deployment shares the same single PVC. This is fundamentally different from a StatefulSet using nexus-volume (RWO), where each pod gets its own dedicated PVC. Mixing these up in your manifest is an easy mistake that results in confusing behavior.
Two operational tasks that get skipped more than they should: enabling remote vendor management and setting up storage capacity monitoring. Both are critical for long-term health of your Azure Operator Nexus storage appliance, and both are available through straightforward Azure CLI commands.
Enable Remote Vendor Management:
az networkcloud storageappliance enable-remote-vendor-management \
--name <STORAGE_APPLIANCE_NAME> \
--resource-group <YOUR_RESOURCE_GROUP>
This enables Microsoft and the hardware vendor (Pure Storage) to perform remote diagnostics and maintenance. Without it, any hardware-level issue requires an on-site visit to diagnose. Enable this from day one, you can always disable it later:
az networkcloud storageappliance disable-remote-vendor-management \
--name <STORAGE_APPLIANCE_NAME> \
--resource-group <YOUR_RESOURCE_GROUP>
Monitor Capacity: All physical volumes on Azure Operator Nexus are thin-provisioned. This means the storage appliance doesn't actually reserve the storage you request, it allocates space as data is written. The risk is overcommitment: you can provision a set of PVCs whose total requested capacity exceeds physical capacity, and everything works fine until writes start hitting the actual limit.
You must monitor both the storage appliance's total consumption and all NFS servers (for nexus-shared volumes) using the platform metrics. Set up alerts on storage consumption percentage, I'd recommend alerting at 70% and paging at 85%. Hitting 100% causes out-of-disk-space errors on all workloads consuming affected volumes, and those errors can be disruptive and sudden.
To check current capacity from the CLI:
az networkcloud storageappliance show \
--name <STORAGE_APPLIANCE_NAME> \
--resource-group <YOUR_RESOURCE_GROUP> \
--query "capacity" \
--output json
The output will show both total and used capacity. Build this check into your regular operational runbook.
Advanced Troubleshooting
If the steps above haven't resolved your Azure Operator Nexus storage appliance issue, you're likely dealing with something at the cluster infrastructure or network fabric layer. Here's how to go deeper.
Investigating Cluster Hardware Validation Failures
Storage appliance issues that originate during initial deployment are almost always rooted in the Hardware Validation (HWV) phase. HWV runs against every machine in the cluster's rack configuration, checking hardware health and attempting to fix misconfigured BIOS boot settings automatically. If HWV fails on a machine that serves the storage path, you'll see the storage appliance stuck in Provisioning or flipped to Error.
Check for HWV-related events in your cluster's activity log:
az monitor activity-log list \
--resource-group <YOUR_RESOURCE_GROUP> \
--start-time 2026-04-01T00:00:00Z \
--query "[?contains(operationName.value, 'hardwareValidation')]" \
--output table
If you see failed HWV events, the remediation depends on the specific hardware component that failed. BIOS settings that HWV couldn't auto-correct will require manual intervention on the physical hardware. For Pure storage appliances specifically, confirm that all controllers are R4, older controllers are not supported for multi-appliance configurations.
StatefulSet Storage Not Persisting Across Pod Restarts
If you're running StatefulSets with nexus-volume (RWO) PVCs and pod data is not surviving restarts, check that your PVC retention policy is correct. By default, PVCs created by a StatefulSet are not deleted when the StatefulSet is scaled down or deleted, but confirm this explicitly:
kubectl get statefulset <NAME> -o yaml | grep -A5 persistentVolumeClaimRetentionPolicy
If whenDeleted or whenScaled is set to Delete, your PVCs will be removed on scale-down. Change these to Retain to protect data.
Network Fabric Connectivity Affecting Storage
Azure Operator Nexus storage appliances connect to compute servers through the network fabric. If the fabric has connectivity issues, misconfigured VLANs, BGP peering failures, or physical link problems, storage I/O will degrade or fail even when the storage appliance resource itself shows Available. Check network fabric device status alongside storage appliance status when diagnosing persistent I/O errors:
az networkcloud networkfabric list \
--resource-group <YOUR_RESOURCE_GROUP> \
--output table
Updating Tags and Properties
You can update certain properties and tags on a storage appliance resource even if you can't create or delete it. This is useful for adding organizational metadata:
az networkcloud storageappliance update \
--name <STORAGE_APPLIANCE_NAME> \
--resource-group <YOUR_RESOURCE_GROUP> \
--tags environment=production owner=ops-team
Escalate to Microsoft Support if: your storage appliance has been in Error or Provisioning state for more than 30 minutes with no change; your cluster shows a failed provisioning state that you can't attribute to a known configuration issue; you're seeing hardware-level errors from the Pure storage controllers; or your second storage appliance won't provision despite meeting all the BOM 2.0.x and R4 controller prerequisites. When you open a ticket, include the output of az networkcloud storageappliance show, your cluster's activity log for the past 48 hours, and the output of kubectl get events -A --sort-by='.lastTimestamp'. This dramatically speeds up triage.
Prevention & Best Practices
The teams I've seen run Azure Operator Nexus storage most successfully aren't just reacting to problems, they've built a set of operational habits that catch issues before they become incidents. Here's what actually works in practice.
Size PVCs conservatively and monitor relentlessly. Thin provisioning gives you flexibility, but it's a false sense of security if you're not actively watching consumption. Set Azure Monitor alerts on storage appliance capacity metrics. The moment you hit 70% consumption, start planning either a cleanup or a capacity expansion. Don't wait for 90%.
Validate BOM version and hardware prerequisites before adding a second storage appliance. I've seen this go wrong more than once, an operator orders a second Pure storage appliance, it arrives, and then it can't be deployed because the instance is running pre-2.0.x BOM hardware or the controllers aren't R4. That's weeks of wasted effort. Validate the prerequisites as soon as you're considering expansion, not after the hardware arrives.
Enable remote vendor management from day one. This is a 30-second CLI command that can save you days of waiting for on-site diagnostics. There's no meaningful downside to having it enabled during normal operations, and it gives Microsoft and Pure Storage the visibility they need to help you fast.
Use the right storage class for the right access pattern. nexus-volume for workloads that need dedicated, high-performance block storage with single-pod write access. nexus-shared for workloads that need concurrent multi-pod access. Don't use nexus-shared as a catch-all, its 1 TiB NFS server capacity is shared across all nexus-shared PVCs in your cluster.
Document your rack slot assignments. Especially if you're running two storage appliances, keep clear internal documentation mapping the Azure resource names to the physical rack slots. This becomes critical during a hardware incident when you need to tell a datacenter technician exactly which unit to look at.
- Run
az networkcloud storageappliance showas part of your daily ops health check, not just when things break - Set consumption alerts at 70% for both the storage appliance and all NFS servers serving nexus-shared volumes
- Enable remote vendor management immediately after cluster deployment and document that it's enabled in your runbook
- Keep a reference sheet of PVC size limits: nexus-volume min 1 MiB / max 12 TiB; nexus-shared effective max 1 TiB
Frequently Asked Questions
Which storage appliance is which in Azure Operator Nexus, how do I tell them apart?
The aggregator rack has two dedicated rack slots for storage appliances. Rack slot 1 always holds the first storage appliance, this one is always present. If a second storage appliance is deployed, it occupies rack slot 2. There's no ambiguity built into the hardware design. In the Azure portal or CLI, you can match the Azure resource to the physical unit by checking the rackId and slot properties in the storage appliance resource output. Cross-reference this against your physical rack documentation to confirm which unit you're looking at physically.
Why can't I create or delete a storage appliance resource in Azure Operator Nexus?
This is by design, not a bug or permissions issue. Azure Operator Nexus storage appliances are created and deleted exclusively through the cluster lifecycle, not by operators directly. When the cluster is deployed, the storage appliance resources are created automatically. When the cluster is decommissioned, they're removed. If you try to manually create or delete one, the platform blocks the operation. What you can do is view properties, update tags, and enable or disable remote vendor management on existing appliances.
My PVC is stuck in Pending status, what's wrong?
Run kubectl describe pvc <PVC_NAME> -n <NAMESPACE> and look at the Events section at the bottom. The most common causes are: requesting a size outside the supported range (for nexus-volume: less than 1 MiB or more than 12 TiB), specifying the wrong storage class name, trying to use CSI ephemeral volumes (not supported on Nexus), or the underlying storage appliance being in an Error state. Fix the root cause in your manifest and reapply, the PVC should bind within 30–60 seconds if the fix is correct.
Can I use a nexus-shared PVC for more than 1 TiB of storage?
Technically the API will let you provision a nexus-shared PVC requesting more than 1 TiB, it won't reject the request outright. But the NFS server backing nexus-shared has a fixed capacity of 1 TiB, and that limit applies across all nexus-shared PVCs in your cluster combined. So if you request 2 TiB in a PVC, it provisions fine, but you'll hit an out-of-disk-space error once total consumption across all nexus-shared volumes reaches 1 TiB. Always treat 1 TiB as the hard ceiling for nexus-shared, regardless of what you request.
What are the hardware requirements to add a second storage appliance to an Operator Nexus instance?
Two conditions must both be true. First, the instance hardware must match BOM version 2.0.x or later, older BOMs don't support a second appliance. Second, all Pure storage appliances in the instance must have R4 controllers, earlier controller generations aren't supported in a two-appliance configuration. Additionally, check the supported SKUs documentation for your specific Nexus SKU, as not all SKUs support a second appliance even on compliant hardware. Validate all three of these before ordering hardware or attempting deployment.
What's the difference between nexus-volume and nexus-shared, when should I use each?
Use nexus-volume when each pod needs its own dedicated block storage volume with exclusive write access, this is ReadWriteOnce (RWO) mode. It's the right choice for databases, stateful applications, and anything where data isolation per pod matters. Use nexus-shared when multiple pods across different nodes need to read from and write to the same storage simultaneously, this is ReadWriteMany (RWX) mode, backed by NFS. It's suited for shared content stores, log aggregation volumes, or any workload where multiple replicas need a common data pool. The key tradeoff: nexus-volume scales up to 12 TiB per volume with higher I/O performance; nexus-shared tops out at 1 TiB total NFS capacity shared across all nexus-shared volumes in your cluster.