Troubleshoot Azure Azure Kubernetes CrashLoopBackOff, ImagePullBackOff & Pod Fix Guide
Why This Is Happening
I know how disorienting it is to open your AKS dashboard and see a wall of pods all sitting in Pending state , coredns not running, konnectivity-agent frozen, metrics-server going nowhere. Your whole cluster feels dead in the water, and the error messages from Kubernetes are spectacularly unhelpful. "No nodes available to schedule pods." Okay. But why are there no nodes? That's the part it never tells you.
Azure Kubernetes Service pod errors like CrashLoopBackOff, ImagePullBackOff, and the dreaded Pending status aren't random. They almost always trace back to one of four root causes: node pool exhaustion after a failed upgrade, a PodDisruptionBudget blocking drain operations, memory pressure introduced by the jump to cgroup v2 in Kubernetes 1.25, or a misconfigured workload crashing on startup because the runtime environment changed under it.
Here's the thing most guides don't explain: when you upgrade an AKS cluster, every node in the pool gets cordoned and drained one at a time. If a PodDisruptionBudget , those safety rules that say "at least N replicas must stay alive", has zero allowed disruptions, Kubernetes will refuse to drain that node. It just sits there, blocked. Your upgrade never finishes. And until those nodes come back online, system pods like coredns and konnectivity-agent have nowhere to go, so they queue up in Pending indefinitely.
The Kubernetes 1.25 upgrade path adds another layer of pain. AKS moved to Ubuntu 22.04 at that release, which defaults to cgroup version 2, a fundamentally different kernel interface for managing resource limits. Java apps running on JRE versions older than 11.0.18 or 1.8.0_372, .NET versions before 5.0, and some Node.js workloads all read memory limits from the old cgroup v1 API paths. When those paths disappear, the JVM or runtime gets confused, thinks it has access to the entire host's memory, allocates accordingly, and the kernel kills it. That's your OOM error. That's your CrashLoopBackOff.
The error messages from kubectl get pods are blunt, a status code with no explanation attached. You have to go deeper with kubectl describe pod to see the events that actually explain what broke. I'll show you exactly what to look for and how to read those events like a senior engineer would.
Troubleshoot Azure Kubernetes Pod Errors, The Quick Fix
If your AKS pods are stuck in Pending right after an upgrade attempt, the fastest single diagnostic is a two-command sequence. First, check whether you even have nodes available:
kubectl get nodes
If that returns an empty list or all nodes show NotReady or SchedulingDisabled, your upgrade stalled mid-drain. The second command tells you whether a PodDisruptionBudget is the culprit:
kubectl get events | grep -i drain
Scan the output for the phrase "Eviction blocked by Too Many Requests (usually a pdb)". If you see that, you've found your problem, a PDB with zero allowed disruptions is holding the entire node drain hostage. The fix I'll walk you through in Step 2 handles this in under five minutes for most clusters.
If your nodes are online but pods are still crashing with CrashLoopBackOff, run:
kubectl describe pod <pod-name> -n <namespace>
Scroll to the Events section at the bottom. You're looking for FailedScheduling, OOMKilled, BackOff, or ErrImagePull entries. The reason code tells you which path to follow.
For ImagePullBackOff specifically, the container runtime is failing to pull the image from its registry. The most common culprits are: the image tag doesn't exist, the registry requires authentication and the secret isn't attached to the pod's service account, or there's a network policy blocking egress to the registry endpoint. Check the imagePullSecrets in your deployment spec first. Nine times out of ten, that's it.
kubectl get pdb --all-namespaces before touching anything else. Look at the ALLOWED DISRUPTIONS column, if any row shows 0 and your cluster upgrade stalled, that PDB is almost certainly your blocker. You can fix that in minutes versus spending hours looking in the wrong direction.
Start here, every single time. Run kubectl get pods -n kube-system to get the names of system pods showing 0/1 Pending. You might see something like this, coredns, coredns-autoscaler, konnectivity-agent, and metrics-server all pending simultaneously. That pattern is a strong signal the entire node pool is unavailable, not that individual pods are broken.
kubectl get pods -n kube-system
# Look for rows where STATUS = Pending and READY = 0/1
kubectl describe pod coredns-845757d86-7xjqb -n kube-system
In the describe output, find the Node field near the top. If it shows <none>, the scheduler never assigned this pod to a node, it couldn't find one. Scroll to the Events section at the bottom. You'll see Warning FailedScheduling entries like:
Warning FailedScheduling 29m (x57 over 84m) default-scheduler
no nodes available to schedule pods
The x57 multiplier tells you the scheduler tried 57 times and failed every attempt. This is not a transient blip, it's a sustained infrastructure problem. The fix is upstream: get nodes back online.
Also note the Priority Class Name field. System pods like coredns carry system-node-critical priority, meaning Kubernetes would preempt nearly anything to place them, if there were a node to place them on. The fact that even critical-priority pods are stuck confirms no schedulable nodes exist.
If your describe output shows OOMKilled in the Last State section, skip ahead to Step 4, that's the cgroup v2 memory issue, not a node availability problem.
PodDisruptionBudgets are designed to protect your application availability during node maintenance. A PDB saying "keep at least 2 replicas running" is a good thing in production. But when you have exactly 2 replicas and you're trying to drain a node, the math doesn't work, you can't evict either pod without dropping below the minimum. Allowed disruptions hits zero. The drain stalls.
# See all PDBs and their current allowed disruptions
kubectl get pdb --all-namespaces
# Example output showing the problem:
# NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
# nginx-pdb 2 N/A 0 24s
When ALLOWED DISRUPTIONS is 0, that PDB is actively blocking any eviction. You have three paths forward, choose based on your situation:
Option A, Increase your replica count. If you can scale the affected deployment up, do it. Adding one more replica pushes Allowed Disruptions to 1 and unblocks the drain without changing any safety configuration.
kubectl scale deployment/<name> --replicas=3 -n <namespace>
Option B, Temporarily relax the PDB. Edit the PDB to reduce minAvailable by 1, let the drain finish, then restore it.
Option C, Back up and delete the PDB. This is the fastest path during a stuck upgrade. Back it up first:
kubectl get pdb <pdb-name> -n <namespace> -o yaml > pdb-backup.yaml
kubectl delete pdb <pdb-name> -n <namespace>
After the upgrade finishes, restore it with kubectl apply -f pdb-backup.yaml. Once the PDB is out of the way, re-trigger the upgrade, this kicks off a reconciliation that gets the cluster moving again:
az aks upgrade --name <aksName> --resource-group <resourceGroupName>
When prompted about upgrading an already-failed cluster state, answer y. That's expected behavior.
Sometimes you can't or don't want to modify the PDB itself. Maybe it's owned by another team, or it's protecting a stateful workload where you genuinely can't afford to relax the minimum. In that case, you can drain the node by removing the workload itself temporarily.
First, identify which pods are causing the block. The drain events tell you exactly:
kubectl get events | grep -i drain
# Look for: "Eviction blocked by Too Many Requests (usually a pdb): <pod-name>"
Once you have the pod name, find whether it's owned by a Deployment or StatefulSet:
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.metadata.ownerReferences}'
If it's owned by a ReplicaSet, the parent is a Deployment. Back it up before touching anything:
kubectl get deployment.apps <name> -n <namespace> -o yaml > deployment-backup.yaml
Then scale it to zero:
kubectl scale --replicas=0 deployment.apps/<name> -n <namespace>
With zero pods running, the PDB has nothing to protect and the eviction goes through. The node drains, the upgrade proceeds, and once the new node comes up you scale back to your normal replica count:
kubectl scale --replicas=2 deployment.apps/<name> -n <namespace>
Re-trigger the upgrade immediately after scaling down, don't let the cluster sit in a degraded state longer than necessary. Run az aks upgrade again to start the reconciliation process. You should see node drain progress within a few minutes once the blocking pods are gone.
This one catches a lot of teams off guard. You upgrade your AKS cluster to 1.25 or later, everything looks fine for a day, then Java services start crashing with OOM errors, .NET apps get CPU-throttled, and pods cycle through CrashLoopBackOff over and over. Your resource limits haven't changed. Your code hasn't changed. What happened?
The answer is cgroup v2. AKS on Kubernetes 1.25+ uses Ubuntu 22.04, which ships with cgroup version 2 as the default kernel control group interface. The old cgroup v1 filesystem paths that Java runtimes and .NET used to read memory limits from, they're gone, or they behave differently. Old JVM versions see the host machine's total RAM instead of the container limit and allocate a heap to match. The kernel then kills the container for exceeding its actual limit. That's the OOM cycle.
The symptoms are specific. You'll see these in kubectl describe pod:
Last State: Terminated
Reason: OOMKilled
Exit Code: 137
Exit code 137 is 128 + 9, the process received SIGKILL from the kernel's out-of-memory handler. Exit code 1 on a Java app after a 1.25 upgrade is also suspicious and worth investigating for cgroup v2 issues.
The fix depends on your runtime. For Java, upgrade to a version with full cgroup v2 support. The minimum versions you need are: OpenJDK/HotSpot jdk8u372, 11.0.16, or 15+. IBM Semeru users need 8.0.382.0, 11.0.20.0, or 17.0.8.0+. For Azure customers, Microsoft officially backs Eclipse Temurin (Java 8) and Microsoft Build of OpenJDK (Java 11+), use those. For .NET, anything below version 5.0 needs to be upgraded to 5.0 or later. No patch exists for older .NET versions, you have to upgrade.
As an interim measure while you upgrade runtimes, increase your pod resource limits and requests. Higher limits give the kernel more headroom and reduce the eviction rate while you roll out the runtime fixes.
ImagePullBackOff means the container runtime on your AKS node tried to pull the image and failed. It backs off and retries on an exponential timer, that's why you see it cycling rather than staying in a fixed error state. The kubectl describe output will show you the exact pull failure reason under Events:
kubectl describe pod <pod-name> -n <namespace>
# Events section will show something like:
Warning Failed 2m kubelet Failed to pull image
"myregistry.azurecr.io/myapp:v2.1": rpc error:
code = Unknown desc = failed to pull and unpack image:
unauthorized: authentication required
The unauthorized: authentication required message points directly to a missing or misconfigured image pull secret. Check your deployment spec:
kubectl get deployment <name> -n <namespace> -o yaml | grep -A5 imagePullSecrets
If imagePullSecrets is absent or points to a secret that doesn't exist, create it from your Azure Container Registry credentials:
kubectl create secret docker-registry acr-secret \
--docker-server=<registry-name>.azurecr.io \
--docker-username=<service-principal-id> \
--docker-password=<service-principal-password> \
-n <namespace>
Then patch the deployment to use it:
kubectl patch deployment <name> -n <namespace> -p \
'{"spec":{"template":{"spec":{"imagePullSecrets":[{"name":"acr-secret"}]}}}}'
If the image tag simply doesn't exist, you pushed :v2.1 but tagged it :v2.1.0 by mistake, the error will read not found. Check your registry directly. If you're using AKS with the ACR attachment feature (az aks update --attach-acr), verify the managed identity has AcrPull permission on the registry. Missing role assignments are a common AKS ImagePullBackOff cause that doesn't produce an obvious "permission denied" message.
Once you've fixed the pull secret or tag, delete the stuck pod, Kubernetes will recreate it from the deployment spec and the fresh pull attempt will succeed.
Advanced Troubleshooting for Azure Kubernetes Pod Errors
When the standard fixes don't resolve your AKS CrashLoopBackOff or Pending pod issues, you need to go deeper. These are the techniques I use when the obvious answers have already been ruled out.
Analyze Kubernetes Events Cluster-Wide
Don't just look at events on a single pod. Get the full picture across the cluster, filtered by warning severity:
kubectl get events --all-namespaces --field-selector type=Warning \
--sort-by='.lastTimestamp'
Sort by timestamp so the most recent failures appear last. Look for repeated warnings on the same object, that repetition count tells you how long the problem has been active. A FailedScheduling x57 means the scheduler tried 57 times over 84 minutes and gave up each time. That's not a transient issue.
Third-Party Monitoring Agents and cgroup v2 Incompatibility
This is one of the most underdiagnosed causes of post-upgrade instability. If you have Datadog, Dynatrace, New Relic, Falco, or similar security/monitoring agents running as DaemonSets on your AKS nodes, they may directly access the cgroup filesystem. After the cgroup v2 migration on Kubernetes 1.25+, agents that read /sys/fs/cgroup/memory/ paths (the v1 hierarchy) will either fail silently, report garbage metrics, or crash entirely, which can cascade into pod failures on affected nodes.
Check your DaemonSet agents:
kubectl get daemonset --all-namespaces
kubectl describe daemonset <agent-name> -n <namespace>
Look for agents with high restart counts (READY column cycling). Update each affected agent to a version that explicitly supports cgroup v2. Check the vendor's release notes for "cgroup v2 support", most major vendors added it between 2022 and 2023.
Node Resource Pressure Debugging
If nodes exist but pods still won't schedule, the nodes may be under memory or disk pressure, causing Kubernetes to mark them as unschedulable:
kubectl describe node <node-name> | grep -A5 "Conditions:"
# Look for: MemoryPressure=True, DiskPressure=True, or PIDPressure=True
Memory pressure on a node triggers the kubelet eviction manager, which will terminate lower-priority pods to reclaim memory. If your workloads don't have resource requests set, Kubernetes can't make good scheduling decisions and will pack too many pods onto a single node, leading to cascading pressure.
Checking for Node Pool Scaling Issues
If your cluster uses the Cluster Autoscaler, verify it's actually running and not stuck itself:
kubectl get pods -n kube-system | grep cluster-autoscaler
kubectl logs -n kube-system -l app=cluster-autoscaler --tail=50
Look for scale-up failure messages, quota exhaustion in your Azure subscription, unavailable VM SKUs in the region, or subnet IP exhaustion all prevent autoscaler from adding nodes even when pods are pending.
kubectl commands (indicating a management plane issue, not a workload issue). For billing and subscription-level problems preventing resource creation, Microsoft Support is the only path, these can't be self-resolved.
Prevention & Best Practices for Azure Kubernetes Stability
I've seen the same AKS pod errors hit teams repeatedly because they fixed the symptom without addressing the underlying configuration gaps. Here's how to make sure you don't end up back here next quarter.
Always audit PodDisruptionBudgets before an upgrade. Run kubectl get pdb --all-namespaces and check every single PDB's Allowed Disruptions value. If any are sitting at zero with no headroom, you will hit a drain blockage. Fix it before you start the upgrade, not after. Either scale up replicas to create headroom or document a PDB relaxation procedure as part of your upgrade runbook.
Test your workloads against cgroup v2 before upgrading to Kubernetes 1.25+. Spin up a dev cluster on 1.25, deploy your applications, and watch for OOMKilled events or unexpected memory spikes over 48 hours. Java and .NET workloads that pass all tests on 1.24 can behave completely differently under cgroup v2. Catching it in dev costs you an afternoon; catching it in production costs you an incident.
Set resource requests and limits on every workload. Pods without resource requests are scheduled unpredictably, the scheduler has no basis for placement decisions. Pods without limits can consume unlimited node memory, triggering OOM conditions that evict neighbors. This is the single most common AKS configuration gap I've seen across enterprise clusters.
Use maintenance windows and planned upgrade slots. AKS lets you configure maintenance windows via az aks maintenanceconfiguration add. Scheduling upgrades during low-traffic windows gives you time to catch PDB and scheduling issues before they affect users.
Watch the Kubernetes version support calendar. AKS moves quickly through minor versions, unsupported versions stop receiving security patches and you lose Microsoft support coverage. Staying within one or two minor versions of the latest GA release keeps your upgrade paths short and your options open.
- Run
kubectl get pdb --all-namespacesweekly, zero-disruption PDBs are a ticking clock during any upgrade - Set up Azure Monitor alerts on node NotReady status so you catch node pool failures before pods start piling up in Pending
- Pin all production container images to specific SHA digests, not floating tags, this eliminates an entire class of ImagePullBackOff errors from unexpected tag changes
- Keep all third-party DaemonSet agents (monitoring, security) updated on the same schedule as your Kubernetes version, stale agents are one of the most common sources of post-upgrade instability
Frequently Asked Questions
Why are all my pods stuck in Pending after an AKS cluster upgrade?
When every system pod, coredns, konnectivity-agent, metrics-server, shows Pending simultaneously, it almost always means the node pool is offline. Run kubectl get nodes to confirm, then run kubectl get events | grep -i drain to check for "Eviction blocked by Too Many Requests (usually a pdb)" messages. A PodDisruptionBudget with zero allowed disruptions is the most common cause, it prevents node drain from completing during the upgrade, leaving nodes cordoned with no available schedulable targets. Fix the PDB and re-run az aks upgrade to trigger reconciliation and restore the cluster.
Where do I find information about debugging Kubernetes problems on AKS?
Your first stop should be the official Kubernetes troubleshooting guide at kubernetes.io/docs/tasks/debug, it covers pod, node, and cluster-level debugging systematically. Microsoft also publishes an AKS-specific troubleshooting guide written by their engineers that addresses Azure-specific edge cases. The kubectl describe command is your best diagnostic tool in practice: run it against any failing pod and read the Events section carefully, the reason codes and messages there contain far more useful information than the status column in kubectl get pods. For AKS-specific issues involving the control plane or infrastructure layer, the Azure portal's AKS diagnostics blade (under the cluster resource → Diagnose and solve problems) runs automated checks and often identifies the root cause directly.
Can I move my AKS cluster to a different Azure subscription or tenant?
No, AKS does not support moving a cluster to a different subscription or a subscription to a different tenant. The cluster's identity (managed identity or service principal) has permissions scoped to the original subscription and tenant. After a move, those identity bindings break and the cluster stops functioning. Microsoft explicitly doesn't support this operation. If you need your workloads in a different subscription, the correct path is to provision a new AKS cluster in the target subscription, migrate your workloads using GitOps or Flux, and decommission the old cluster. It's more work upfront but it avoids a broken cluster in an unsupported state.
My pods are crashing with OOMKilled after upgrading to Kubernetes 1.25, what changed?
Kubernetes 1.25 on AKS moved to Ubuntu 22.04, which uses cgroup version 2 by default. This is a different kernel interface than cgroup v1, and runtimes that directly read memory limit information from the old cgroup v1 filesystem paths now read incorrect values, often the entire host's RAM. Java versions older than jdk8u372 or 11.0.16 are especially affected, as are .NET versions below 5.0. The JVM allocates a heap based on what it thinks the memory limit is, the allocation far exceeds the actual container limit, and the kernel sends SIGKILL (exit code 137). Fix this by upgrading your Java runtime to a cgroup v2-compatible version, or .NET to 5.0+. As a short-term bridge, increase your pod memory limits and requests to reduce eviction frequency while you roll out the runtime updates.
Can I enable Kubernetes RBAC on an existing AKS cluster?
No, enabling Kubernetes role-based access control on an already-running AKS cluster is not supported. RBAC can only be configured at cluster creation time. When you create a cluster via the Azure CLI, the Azure portal, or any API version newer than 2020-03-01, Kubernetes RBAC is enabled by default. If you need RBAC on a cluster that was created without it, you'll need to provision a new cluster with RBAC enabled and migrate your workloads. This is one of those "plan ahead" decisions, if you're spinning up a new cluster today, RBAC is on by default so you don't need to do anything special.
How do I fix the "Eviction blocked by Too Many Requests (usually a pdb)" error during an AKS upgrade?
That message appears in Kubernetes drain events when a PodDisruptionBudget's allowed disruptions count hits zero, meaning no pods on that node can be evicted without violating the PDB's minimum availability guarantee. You have three options: scale up your deployment replicas to create eviction headroom, temporarily back up and delete the blocking PDB (kubectl get pdb <name> -n <ns> -o yaml > backup.yaml && kubectl delete pdb <name> -n <ns>), or scale the blocked workload down to zero replicas before the drain. After resolving the blockage, re-run az aks upgrade against the same version, this triggers a cluster reconciliation that resumes the stalled upgrade and brings nodes back online.