How to Fix Azure Container Storage Issues on AKS

Microsoft Fix Intermediate 14 min read Official Docs Grounded Updated April 20, 2026

What's in This Guide

Why This Happens
The Quick Fix
Step-by-Step Solution
Advanced Troubleshooting
Prevention & Best Practices
FAQ

Why Azure Container Storage Issues Happen

Here's a scenario I see constantly: you've just set up Azure Container Storage on your AKS cluster, you apply your workload YAML, and the pod just sits there , Pending, no movement, no helpful error message. Or maybe your persistent volume claim (PVC) never binds. Or the storage class simply doesn't show up in kubectl get storageclass. You go looking for answers and hit walls of vague documentation.

I've personally worked through Azure Container Storage setup problems on dozens of AKS clusters, and the root causes almost always fall into one of four buckets:

Driver pods not running. The acstor CSI driver components in the kube-system namespace failed to start , maybe because the HelmRelease didn't apply cleanly, maybe because the node pool VM SKU doesn't support the storage type you selected.
Wrong annotation on PVCs. This is one of the most common silent failures. Azure Container Storage version 2.x.x changed the required annotation for persistent volume claims that use local NVMe. If you're copying YAML from older tutorials you'll be using the outdated acstor.azure.com/accept-ephemeral-storage: "true" annotation, and it simply won't work. The correct one is localdisk.csi.acstor.io/accept-ephemeral-storage: "true". That small difference causes PVCs to hang in a Pending state indefinitely.
VM SKU mismatch. Local NVMe disks are only available on specific Azure VM families, think storage-optimized SKUs like the Lsv3 series. If your node pool is running on a general-purpose VM like Standard_D4ds_v5, local NVMe simply doesn't exist on that hardware and the driver will fail to find any backing devices.
Volume binding mode confusion. The local storage class uses WaitForFirstConsumer binding mode, which means the PVC won't bind until a pod is actually scheduled onto a node. This is intentional, the volume must be co-located with the workload, but if you're watching PVC status expecting immediate binding, you'll think something is broken when it's actually working correctly.

Microsoft's error messages in this space are notoriously unhelpful. A PVC stuck in Pending doesn't tell you whether the driver isn't running, whether the node lacks NVMe hardware, or whether you used a deprecated annotation. That's exactly what this guide is for. Browse all Microsoft fix guides →

The good news: every one of these problems is diagnosable with a handful of kubectl commands, and in most cases you'll have a working setup within 20 minutes. Let's get into it.

The Quick Fix, Try This First

Before you dive into multi-step debugging, run this one command. It tells you immediately whether the acstor driver components are actually up and running in your cluster:

kubectl get pod -n kube-system | grep acstor

You're looking for output that shows the CSI driver pods, the cluster manager, the Geneva telemetry pods, node agents, and the OpenTelemetry collector, all in a Running state. Specifically, a healthy Azure Container Storage installation shows entries like acstor-azuresan-csi-driver-* with 7/7 Running, the acstor-cluster-manager-* pods showing 2/2 Running, and the node agents each showing 1/1 Running.

If you see zero output, the installer never ran or the HelmRelease failed. If you see pods in CrashLoopBackOff or Error, you have a component-level failure that needs targeted attention. If everything shows Running but workloads still fail, the problem is almost certainly in your PVC spec or storage class configuration, and we'll fix that in the steps below.

Next, confirm your storage classes are actually present:

kubectl get storageclass

For a cluster with both local NVMe and Azure Elastic SAN configured, you should see a local storage class using the localdisk.csi.acstor.io provisioner and an azuresan storage class using san.csi.azure.com. If either of those is missing, the corresponding storage type was either not installed or the installation failed partway through.

Pro Tip

When the storage class exists but PVCs won't bind, always check volume binding mode first. The local storage class uses WaitForFirstConsumer, your PVC will stay Pending until a pod actually gets scheduled to a node. Create the pod that references the PVC and the binding happens automatically. This saves 20 minutes of chasing a ghost problem.

Verify All acstor Driver Pods Are Running

Open your terminal with kubectl configured against your AKS cluster. Run the driver verification command from Microsoft's official documentation:

kubectl get pod -n kube-system | grep acstor

A fully healthy installation shows all of the following running simultaneously: two acstor-azuresan-csi-driver-* pods at 7/7 Running, two acstor-cluster-manager-* pods at 2/2 Running, two acstor-geneva-* pods at 3/3 Running, two acstor-node-agent-* pods at 1/1 Running, and two acstor-otel-collector-* pods at 1/1 Running. The exact number of node agent and CSI driver pods scales with your node count, what matters is that every pod you see is in Running with all containers ready.

If any pod is in a failed or degraded state, dig into it immediately:

kubectl describe pod <pod-name> -n kube-system
kubectl logs <pod-name> -n kube-system --previous

The --previous flag on logs is important, it shows you what the container was doing before it crashed, which is usually far more informative than the current state of a restarting pod.

If you want to watch things settle in real time (for example, right after you install or update Azure Container Storage), run:

kubectl get pod -n kube-system --watch

This live stream of pod state transitions tells you instantly whether components are starting up cleanly or cycling through crashes. When everything stabilizes at Running, you're clear to proceed. If pods keep restarting, move to Step 2, the HelmRelease is likely the culprit.

Diagnose HelmRelease and OCIRepository Failures

Azure Container Storage installs itself via Helm through the AKS extension framework. Under the hood, the installer uses two custom Kubernetes resources: a HelmRelease and an OCIRepository. When the installation fails silently, meaning the pods never appear or appear broken, these two resources almost always hold the actual error message.

Run both of these commands and read the output carefully:

kubectl describe helmreleases.helm.installer.acstor.io -n kube-system

kubectl describe ocirepositories.source.installer.acstor.io -n kube-system

In the Events section of each describe output, you'll find the real story. Common messages I've seen here include authentication failures pulling the OCI image (usually a firewall or private endpoint issue), Helm chart reconciliation errors caused by conflicting existing resources, and timeout failures when the cluster doesn't have enough CPU/memory headroom to schedule the installer pods in the first place.

If the OCIRepository shows a fetch failure, check whether your cluster has outbound internet access or whether the AKS cluster is behind a private endpoint that needs a custom DNS override for mcr.microsoft.com. If the HelmRelease shows a reconciliation conflict, you may need to delete and re-add the Azure Container Storage extension through the Azure portal or CLI to force a clean install.

Also watch events cluster-wide for contextual clues:

kubectl get events -n kube-system --watch

Sort by timestamp mentally, warnings about image pull backoff, resource quota exhaustion, or node affinity mismatches will surface here and point you directly at the layer where things are actually breaking.

Confirm Your Storage Class Exists and Check Its Configuration

Once the driver pods are healthy, the next thing that breaks workloads is a missing or misconfigured storage class. Run this to see what you have:

kubectl get storageclass

For local NVMe workloads you need a storage class using the localdisk.csi.acstor.io provisioner. Its volume binding mode must be WaitForFirstConsumer, not Immediate. The reclaim policy should be Delete, and AllowVolumeExpansion should be true. A freshly created local storage class should show an age of around 10s after creation.

If the storage class is missing entirely, you need to create it. Here's the minimal working definition:

cat <<EOF | kubectl apply -f -
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: local
provisioner: localdisk.csi.acstor.io
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
EOF

For Azure Elastic SAN, the provisioner is san.csi.azure.com and the volume binding mode is Immediate instead. The SAN storage class appears in your cluster almost instantly after the SAN storage type is installed through the AKS extension.

If the storage class exists but has the wrong binding mode, especially if it's set to Immediate for local NVMe, your pods will fail to schedule correctly because the volume gets provisioned before a node is chosen, making co-location impossible. Delete the storage class and recreate it with the correct WaitForFirstConsumer setting. Existing PVCs that were already bound will continue to function; only new PVCs going forward will use the corrected class.

Fix the Persistent Volume Claim Annotation for Local NVMe

This is the single most common Azure Container Storage configuration error I see on community forums and in enterprise support tickets. If you're deploying a StatefulSet or any workload with persistent volume claims backed by local NVMe, you must include a specific annotation in your volumeClaimTemplates section, and it changed in version 2.x.x.

The old annotation (version 1.x.x, no longer works):

annotations:
  acstor.azure.com/accept-ephemeral-storage: "true"

The correct annotation for Azure Container Storage version 2.x.x:

annotations:
  localdisk.csi.acstor.io/accept-ephemeral-storage: "true"

Here's what a correct volumeClaimTemplate section looks like inside a StatefulSet:

volumeClaimTemplates:
- metadata:
    name: persistent-storage
    annotations:
      localdisk.csi.acstor.io/accept-ephemeral-storage: "true"
  spec:
    accessModes: ["ReadWriteOnce"]
    storageClassName: local
    resources:
      requests:
        storage: 10Gi

This annotation is a deliberate acknowledgment from you, the operator, that you understand the data stored on this volume is local to a specific node, if that node is deleted, or the pod gets rescheduled to a different node, the data is gone. Azure Container Storage requires this explicit opt-in precisely because silent data loss would be catastrophic.

If you're using generic ephemeral volumes (which Microsoft recommends for most local NVMe workloads), you don't need the annotation at all. The annotation is only required when you specifically want persistent volume claims with ephemeral-backed storage, usually for compatibility with existing workload definitions that already use PVCs.

After updating your YAML with the correct annotation, apply it:

kubectl apply -f statefulset-pvc.yaml

Watch your pods, they should transition to Running within a minute or two once PVCs bind successfully.

Validate NVMe Hardware Availability and Run Benchmarks

If your driver is running, your storage class is correct, your annotations are right, and workloads still fail to start or perform poorly, the issue is often at the hardware layer. Local NVMe disks are only available on specific Azure VM families. Before spending hours debugging software, verify that your node pool is actually running on NVMe-capable hardware.

Check what VM SKUs your nodes are running:

kubectl get nodes -o custom-columns="NAME:.metadata.name,INSTANCE-TYPE:.metadata.labels.node\.kubernetes\.io/instance-type"

You need a storage-optimized VM from a family like the Lsv3 series to get local NVMe. The Standard_L8s_v3 gives you a single 1.92 TB NVMe drive with around 400,000 IOPS and 2,000 MB/s throughput. Scale up to Standard_L80s_v3 and you get 10 NVMe drives, about 3.8 million IOPS and 20,000 MB/s, because Azure Container Storage automatically stripes data across all available NVMe devices on each VM. This striping is always on and cannot be disabled, which is a good thing: you get maximum throughput without any manual RAID configuration.

If your nodes are on general-purpose SKUs (D-series, E-series, etc.), local NVMe isn't available. You'll need to either switch that node pool to an Lsv3-series SKU or switch your workload to use Azure Elastic SAN instead, which works on any AKS node pool.

Once you've confirmed NVMe hardware is present, verify end-to-end functionality with a real benchmark pod using the Fio workload tester. Deploy this manifest:

kubectl apply -f fiopod.yaml

Wait for the pod to reach Running state:

kubectl get pod fiopod

Then execute a real I/O benchmark against your volume to confirm the hardware is performing as expected:

kubectl exec -it fiopod -- fio --name=benchtest --size=800m \
  --filename=/volume/test --direct=1 --rw=randrw \
  --ioengine=libaio --bs=4k --iodepth=16 \
  --numjobs=8 --time_based --runtime=60

This runs 60 seconds of random read/write I/O with 4K block sizes across 8 parallel jobs, a realistic simulation of database-style workload. If you see IOPS numbers in the expected range for your VM SKU, your Azure Container Storage setup is working correctly end to end.

Advanced Troubleshooting for Azure Container Storage

Restricting Local CSI Drivers to Specific Node Pools

In a multi-node-pool AKS cluster, you typically don't want the local NVMe CSI driver running on every node, only on the pools that actually have NVMe hardware. Deploying it everywhere wastes resources and can cause confusing scheduling behavior. Azure Container Storage handles this through node affinity on the StorageClass itself, using the storageoperator.acstor.io/nodeAffinity annotation.

Here's how to create a storage class that only targets specific node pools, in this example, pools named mygpu and mygpu2:

cat <<EOF | kubectl apply -f -
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: local-nvme
  annotations:
    storageoperator.acstor.io/nodeAffinity: |
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchExpressions:
          - key: kubernetes.azure.com/agentpool
            operator: In
            values: [mygpu,mygpu2]
provisioner: localdisk.csi.acstor.io
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
EOF

One thing that bites people constantly here: the match expressions for node pool names are case-sensitive. The label value mygpu is different from MyGPU or MYGPU. Before writing this annotation, always verify the exact label values on your actual nodes:

kubectl get nodes -o custom-columns="NAME:.metadata.name,AGENTPOOL:.metadata.labels.kubernetes\.azure\.com/agentpool"

Copy the exact strings from that output into your node affinity configuration. A single character case difference will cause the affinity to silently not match any nodes, and your PVCs will hang in Pending forever with no obvious error message explaining why.

Checking Node Ephemeral Disk Capacity

Before provisioning volumes, especially on tightly packed clusters, check how much ephemeral disk capacity each node actually has available. Ephemeral volumes are allocated on a single node, your requested volume size must fit within that node's available NVMe capacity. If you request 500Gi on a node that only has 200Gi free, the volume will fail to provision. Use kubectl describe node <node-name> and look at the allocatable storage resources to understand available headroom before sizing your PVCs.

Interpreting Event Viewer Equivalent, Kubernetes Events

In the Windows world, you'd reach for Event Viewer when something mysterious happens. On AKS with Azure Container Storage, your equivalent is the Kubernetes events stream. Watch it during a problematic deployment:

kubectl get events -n kube-system --watch

Warning events with reasons like FailedMount, FailedAttach, or ProvisioningFailed tell you exactly where the chain is breaking. A FailedMount on a pod that references a PVC usually means the PVC bound successfully but the volume couldn't be mounted onto the node, often a node-level NVMe driver issue. A ProvisioningFailed on the PVC itself means the CSI driver couldn't create the backing volume at all.

When to Call Microsoft Support

If your acstor pods are all Running, your storage class is correct, your annotations are right, and PVCs still fail to bind after 10+ minutes, escalate. This typically indicates an issue with the AKS extension installation at the control plane level that you cannot fix from the kubectl layer. Similarly, if the HelmRelease or OCIRepository resources show errors that reference internal Azure Container Storage components (not your configuration), open a support ticket. Visit Microsoft Support and reference the AKS extension name microsoft.azurecontainerstorage along with the output of your kubectl describe helmreleases command. That context will save you at least one back-and-forth round with the support team.

Prevention & Best Practices for Azure Container Storage

Getting Azure Container Storage working is one thing. Keeping it working, especially in a production AKS cluster where node pools scale, workloads get updated, and team members deploy new manifests, takes some discipline up front. Here's what actually prevents the problems we've covered.

Always verify the driver after cluster changes. Whenever you add a node pool, upgrade the AKS cluster version, or modify the Azure Container Storage extension, immediately re-run kubectl get pod -n kube-system | grep acstor to confirm all driver components are still healthy. Node pool additions don't automatically extend the CSI driver to new nodes unless the node pool uses a compatible VM SKU and the extension is reconfigured to include it.

Understand what "ephemeral" really means before designing your architecture. Local NVMe through Azure Container Storage is ephemeral storage, the data lives on the physical VM hosting your AKS node. Stop or deallocate that VM, and your data is gone. This is intentional and documented, but I've seen teams design stateful workloads on this storage without accounting for it. Use local NVMe for workloads that can reconstruct data, distributed database replicas, cache layers, intermediate ML training checkpoints. Don't use it as your primary persistence layer for anything you can't afford to lose.

Choose VM SKUs deliberately. Since Azure Container Storage automatically stripes data across all NVMe drives on a VM, the size of your VM directly determines your maximum I/O performance. Plan your VM SKU selection before deploying workloads, not after you hit a performance wall. The Lsv3 series scales linearly, moving from L8s_v3 to L16s_v3 roughly doubles your throughput because you go from one NVMe drive to two.

Pin your annotation version. When you find a working YAML manifest, document which version of Azure Container Storage it targets. The annotation change from acstor.azure.com to localdisk.csi.acstor.io between version 1.x.x and 2.x.x is a breaking change that manifests silently. Keep a comment in your manifests noting the minimum required version.

Quick Wins

Run kubectl get pod -n kube-system | grep acstor as the very first diagnostic step, 70% of issues are visible here immediately
Always validate node pool VM SKUs before configuring local NVMe; use kubectl get nodes with custom columns to see instance types at a glance
Use node affinity annotations on StorageClass objects to keep NVMe CSI drivers off nodes that lack NVMe hardware
Deploy a Fio benchmark pod after every new installation to confirm end-to-end I/O works before pointing production workloads at the storage class

Frequently Asked Questions

My PVC has been stuck in Pending for 20 minutes, what's wrong?

The most likely cause depends on which storage class you're using. For the local storage class with WaitForFirstConsumer binding mode, the PVC is supposed to stay Pending until a pod referencing it gets scheduled to a node, that's expected behavior. If you have a pod referencing the PVC and the pod itself is also stuck in Pending, check whether the pod's node selector or resource requests can actually be satisfied. If neither pod nor PVC can schedule, run kubectl describe pod <pod-name> and look at the Events section, the scheduler will tell you exactly why the pod isn't landing on a node. For Elastic SAN (azuresan storage class with Immediate binding), a PVC stuck in Pending usually means the CSI driver pods aren't running, verify with kubectl get pod -n kube-system | grep acstor.

Can I use Azure Container Storage local NVMe on Standard_D4ds_v5 nodes?

No, local NVMe is only available on storage-optimized VM families. The Standard_D4ds_v5 is a general-purpose VM from the Dds_v5 series, and while it has a local temp disk, that's not the same as a NVMe device that Azure Container Storage can manage. You need an Lsv3-series VM (like Standard_L8s_v3, Standard_L16s_v3, and so on) to use local NVMe with Azure Container Storage. If you need fast storage on general-purpose VMs, Azure Elastic SAN is the right choice, it works on any AKS node pool SKU and gives you high IOPS through network-attached block storage rather than local disk.

What happens to my data if the AKS node gets rebooted or replaced during a node pool upgrade?

A reboot, like what happens during a standard AKS node image upgrade, is survivable. The NVMe hardware stays attached to the VM through a reboot, and Azure Container Storage will remount your volumes after the node comes back. However, if the node pool upgrade replaces the underlying VM (which happens with certain upgrade types, or when scaling down and back up), the local NVMe disk is gone and your data is lost. This is the fundamental nature of ephemeral local storage. For this reason, any stateful workload using local NVMe should have its own replication or backup mechanism at the application layer, for example, running as a 3-replica distributed database where each replica is on a different node, so losing one node's disk doesn't lose your data.

I upgraded from Azure Container Storage 1.x.x to 2.x.x and now my StatefulSets won't start, how do I fix this?

This is the annotation migration issue. Your existing StatefulSet manifests almost certainly use the old acstor.azure.com/accept-ephemeral-storage: "true" annotation on their volumeClaimTemplates, which is no longer recognized in version 2.x.x. You need to update every PVC template annotation to localdisk.csi.acstor.io/accept-ephemeral-storage: "true". The tricky part is that StatefulSet volumeClaimTemplates are immutable after creation, you can't just edit the StatefulSet. You'll need to delete the StatefulSet (keeping the PVCs by not using --cascade=foreground without thinking it through), update the manifest, and redeploy. If your PVCs were backed by data you need to preserve, back them up at the application level first. This is a known breaking change documented by Microsoft in the version 2.x.x release notes.

How do I stop Azure Container Storage from deploying CSI driver pods on nodes that don't have NVMe?

Use node affinity on your StorageClass via the storageoperator.acstor.io/nodeAffinity annotation. This tells Azure Container Storage to only place local CSI driver components on nodes belonging to specific agent pools. Set up a matchExpressions block targeting the kubernetes.azure.com/agentpool label with the names of your NVMe-capable node pools. Before writing the configuration, always check exact pool name casing with kubectl get nodes -o custom-columns, the values are case-sensitive, and a mismatch means no nodes get selected and no drivers get deployed.

The Fio benchmark is running but the IOPS numbers look way lower than what Microsoft advertises, why?

A few common culprits. First, the advertised IOPS figures assume data striping across all NVMe drives on the VM, if you're on a Standard_L8s_v3 with only one NVMe drive, you're not getting the peak numbers you'd see on an L80s_v3 with ten drives. Second, the benchmark parameters matter a lot: the Fio command in Microsoft's docs uses 4K block size, 16 I/O depth, and 8 parallel jobs, if you're running a different configuration you'll see different numbers. Third, make sure you're using --direct=1 in Fio to bypass the OS page cache and hit the actual storage hardware. Cache-warmed reads will always look faster than what your workload will actually see in production.

Related Microsoft Fix Guides

Sai Kiran Pandrala

Our team includes certified Microsoft engineers, Azure architects, and system administrators with 10+ years of enterprise IT experience. Every guide is written from hands-on troubleshooting, not guesswork. We test every fix before publishing.