Fix Azure Kubernetes Service Troubleshooting Errors
Why This Is Happening
I've watched experienced Azure engineers spend the better part of a morning staring at a terminal output that shows nothing but Pending , over and over , and have no idea where to start. That feeling is exactly what Azure Kubernetes Service troubleshooting looks like at 2 AM when a production deployment has gone sideways. The error messages are terse, the logs are buried, and the path to resolution isn't obvious. You're not doing anything wrong. AKS just doesn't make it easy.
The three most common failure modes I see engineers hit are: pods permanently stuck in Pending state because there are no schedulable nodes, cluster upgrades blocked by a PodDisruptionBudget (PDB) refusing to allow evictions, and a wave of out-of-memory (OOM) crashes that appear a few hours after upgrading to Kubernetes 1.25 or later. Each one looks completely different on the surface, but they share a common thread, the default error messages give you almost nothing actionable to work with.
The Pending pod problem almost always means your cluster has no available nodes to schedule work onto. This can happen right after a cluster creation that didn't complete correctly, after a failed scale operation, or after an upgrade left node pools in a degraded state. The scheduler tried, repeatedly, as many as 57 times in cases I've seen documented, and gave up each time with FailedScheduling: no nodes available.
The PDB issue is subtler. A PodDisruptionBudget is a Kubernetes resource that protects your workloads during voluntary disruptions, like node drains. It's a good thing to have. But when your minAvailable setting means zero disruptions are allowed and AKS tries to drain a node during an upgrade, the entire process halts. The cluster gets stuck in a failed upgrade state, and every subsequent operation compounds the problem.
The cgroup v2 memory issue is the most insidious because it's a consequence of something that looks like a straightforward OS upgrade. Starting with Kubernetes 1.25, AKS moved to Ubuntu 22.04, which defaults to cgroup version 2. Older Java runtimes (pre-11.0.18 or pre-1.8.0_372), older .NET versions (pre-5.0), and some Node.js environments don't correctly read memory limits from the cgroup v2 API. The result: they see the full host memory instead of their container limit, allocate too much, and get OOM-killed.
None of these are obscure edge cases. They're the most common Azure Kubernetes Service troubleshooting scenarios that show up in support queues week after week. The good news is that every one of them has a documented, reproducible fix, and that's exactly what this guide walks through. Browse all Microsoft fix guides →
The Quick Fix, Try This First
Before you go deep into diagnostics, run this one command. It tells you immediately whether you have a node availability problem or something else entirely:
kubectl get pods -n kube-system
If you see output like this, with every system pod showing 0/1 Pending, you have a node scheduling problem, not an application problem:
NAME READY STATUS RESTARTS AGE
coredns-845757d86-7xjqb 0/1 Pending 0 78m
coredns-autoscaler-5f85dc856b-mxkrj 0/1 Pending 0 77m
konnectivity-agent-67f7f5554f-nsw2g 0/1 Pending 0 77m
metrics-server-6bc97b47f7-dfhbr 0/1 Pending 0 77m
When every pod in kube-system is pending, not just your application pods, the cluster itself is broken, not your deployment. Don't waste time rebuilding your manifests. The fix starts at the node pool level, not the workload level.
Next, drill into one of the pending pods to confirm the root cause:
kubectl describe pod coredns-845757d86-7xjqb -n kube-system
Scroll to the Events section at the bottom. If you see Warning FailedScheduling ... no nodes available to schedule pods repeated multiple times, you've confirmed the diagnosis. The fix is to restore node availability, either by checking why the node pool didn't provision, manually scaling the node pool back up via the Azure portal or CLI, or re-triggering the upgrade to force reconciliation.
For the fastest path when the cluster is in a failed upgrade state, re-run the upgrade targeting the same version. This is intentional, it triggers AKS to attempt reconciliation:
az aks upgrade --name <aksName> --resource-group <resourceGroupName>
When prompted, confirm with y. If the cluster is already showing as failed, the CLI will tell you it's proceeding to resolve the failed state. That's expected behavior, not an error.
kubectl get pods -n kube-system before looking at your application namespace. If system pods are pending, nothing in your app namespace will work either, and you'll burn time debugging the wrong layer. System pods pending means the cluster infrastructure itself needs attention first.
Azure Kubernetes Service troubleshooting starts with triage. The three main failure patterns each need a different approach, so identifying which one you have before taking any action saves you from making things worse.
Run these three commands in sequence. The outputs together will tell you what's actually broken:
# Check overall cluster pod health
kubectl get pods -n kube-system
# Check for PDB-related events
kubectl get events | grep -i drain
# Check node pool status
az aks nodepool list --cluster-name <aksName> --resource-group <resourceGroupName> -o table
Here's how to read what you get back:
- All kube-system pods show Pending: Node availability problem. Go to Step 2.
- Events show "Eviction blocked by Too Many Requests (usually a pdb)": PDB is blocking node drain. Go to Step 3.
- Pods are running but crashing with OOMKilled, or you upgraded recently to K8s 1.25+: cgroup v2 memory issue. Go to Step 4.
- Node pools show failed provisioning state in the CLI: Retry the upgrade as described in Step 5.
One thing I see people miss: the kubectl get events | grep -i drain command is only useful during or shortly after an upgrade attempt. Events have a relatively short retention window in Kubernetes. If you're troubleshooting an upgrade that failed hours ago and the events have rolled off, check the Azure Activity Log in the portal under your AKS resource, it retains operation history much longer.
You should also note the Age column from kubectl get pods. Pods that have been pending for over an hour (like the 78-minute example in the documentation) indicate a persistent infrastructure problem, not a transient scheduling hiccup that will self-resolve.
When kubectl describe pod confirms the FailedScheduling: no nodes available error, your node pool either didn't provision correctly or got into a bad state during an upgrade or scale operation. Here's the methodical way to fix it.
First, check your node pool state directly:
az aks nodepool show \
--cluster-name <aksName> \
--resource-group <resourceGroupName> \
--name <nodepoolName> \
--query "provisioningState"
If it returns Failed or Canceled, you need to trigger a reconciliation. The fastest way is to re-run the upgrade to the same version that's currently deployed. AKS will detect the failed state and attempt to resolve it:
az aks upgrade \
--name <aksName> \
--resource-group <resourceGroupName>
If the node pool provisioned but no nodes are ready because the scale count is zero, which can happen if something automated scaled the pool down, manually scale it back up:
az aks nodepool scale \
--cluster-name <aksName> \
--resource-group <resourceGroupName> \
--name <nodepoolName> \
--node-count 3
After either operation, give it 5–10 minutes and then re-run kubectl get pods -n kube-system. You should see pods transition from Pending to ContainerCreating and then Running. If they stay in Pending after nodes come up, there may be a resource quota issue, check with kubectl describe nodes to see allocated vs. available CPU and memory on each node.
This one catches a lot of teams off guard because the PDB was doing exactly what it was designed to do, protecting workload availability. The problem is that it's set too aggressively for the cluster's current state, meaning the Allowed Disruptions value is zero.
Check your PDBs and their current allowed disruption count:
kubectl get pdb --all-namespaces
Output that shows trouble looks like this:
NAME MIN AVAILABLE MAX UNAVAILABLE ALLOWED DISRUPTIONS AGE
nginx-pdb 2 N/A 0 24s
When ALLOWED DISRUPTIONS is 0, no pods can be evicted from the node AKS is trying to drain, and the upgrade halts. You have three options depending on your risk tolerance:
Option A, Adjust the PDB (preferred for production). Edit the PDB to allow at least one disruption. Either reduce minAvailable or increase your replica count so the math works out to at least 1 allowed disruption. Then re-trigger the upgrade.
Option B, Backup, delete, and redeploy the PDB. This is the cleanest approach if you need the upgrade to complete right now and you're comfortable rebuilding the PDB afterward:
# Back up the PDB first
kubectl get pdb <pdb-name> -n <namespace> -o yaml > pdb-backup.yaml
# Delete it
kubectl delete pdb <pdb-name> -n <namespace>
# Run the upgrade
az aks upgrade --name <aksName> --resource-group <resourceGroupName>
# Restore afterward
kubectl apply -f pdb-backup.yaml
Option C, Scale down the workload to zero before upgrading. If the workload can tolerate downtime and you want a clean drain:
kubectl scale --replicas=0 deployment.apps/<name> -n <namespace>
After the upgrade completes, scale back up to your normal replica count. In all three cases, after resolving the PDB blocker, re-run az aks upgrade targeting the same version, this triggers AKS to reconcile the failed upgrade state.
If your pods started crashing with OOMKilled after upgrading to Kubernetes 1.25 or you're seeing dramatically higher memory usage than before the upgrade, cgroup v2 is almost certainly the cause. This is not a Kubernetes bug, it's a runtime compatibility issue. Kubernetes 1.25 moved to Ubuntu 22.04 on AKS, and Ubuntu 22.04 uses cgroup v2 by default. Older runtimes don't know how to read memory limits from the cgroup v2 API correctly.
Here's what's actually happening under the hood: your Java or .NET process starts up, asks the OS "how much memory do I have available?", and instead of reading its container memory limit from cgroup v2, it reads the host machine's total RAM. A container with a 512 MB limit now thinks it has 16 GB to work with. It allocates accordingly, hits the actual container limit, and gets killed.
The fix depends on your runtime:
Java applications: Upgrade to a version that natively supports cgroup v2. The minimum versions are OpenJDK/HotSpot 8u372, 11.0.16, or any version from 15 onward. For IBM Semeru Runtimes, the minimum versions are 8.0.382.0, 11.0.20.0, or 17.0.8.0. For Azure customers, Microsoft officially supports Eclipse Temurin binaries (Java 8) and Microsoft Build of OpenJDK (Java 11+).
.NET applications: Upgrade to .NET 5.0 or later. Anything older than 5.0 does not correctly handle cgroup v2 memory accounting.
Node.js: Upgrade to a recent LTS version. Most modern Node.js versions handle cgroup v2 correctly, but check your base image.
Third-party monitoring or security agents: This is the one people forget. Some APM agents and security tools access the cgroup filesystem directly. Check with your vendor for a cgroup v2-compatible version and update those agents first, they're often the silent cause of elevated memory pressure that shows up as general instability rather than OOMKilled events.
If you absolutely cannot upgrade the runtime immediately, increase your pod memory limits and requests as a temporary mitigation. It won't fix the root cause but will reduce the eviction rate while you work on the proper fix.
Once you've resolved the underlying issue, whether that was fixing the PDB, restoring node availability, or updating runtimes, the cluster is likely still sitting in a failed upgrade state. AKS won't automatically recover from this. You need to explicitly re-trigger the upgrade to the same version to force a reconciliation pass.
The command is the same one you'd use for a normal upgrade, AKS is smart enough to detect the failed state:
az aks upgrade \
--name <aksName> \
--resource-group <resourceGroupName>
When prompted, confirm with y. You'll see a message like:
Cluster currently in failed state. Proceeding with upgrade to existing version
1.28.3 to attempt resolution of failed cluster state.
That message is telling you exactly what's happening, it's not an error, it's the recovery mechanism working as designed. The upgrade process will proceed through the control plane first, then each node pool.
Monitor the progress with:
az aks show \
--name <aksName> \
--resource-group <resourceGroupName> \
--query "provisioningState"
Expect to see Updating for a while, then Succeeded. If it returns to Failed, the blocker hasn't been fully resolved, go back and re-check your PDBs and node pool states. One thing to verify: if you had multiple PDBs across different namespaces, make sure you addressed all of them, not just the one that showed up in the first kubectl get events output.
After the upgrade reconciles successfully, run kubectl get pods -n kube-system one more time. All system pods should show Running with a READY value of 1/1. If they do, your cluster is healthy again.
Advanced Troubleshooting
When the standard fixes don't get you there, these deeper diagnostic approaches cover the scenarios that don't have an obvious surface-level symptom.
Diagnosing Node-Level Resource Pressure
If pods are scheduling onto nodes but getting evicted almost immediately, the problem isn't scheduling, it's resource pressure. Run kubectl describe nodes and look for the Conditions section on each node. Conditions like MemoryPressure: True, DiskPressure: True, or PIDPressure: True tell you exactly what the kubelet is reacting to. After a Kubernetes 1.25 upgrade, MemoryPressure appearing on nodes that were previously healthy almost always points back to the cgroup v2 issue, a process is consuming far more memory than its limit suggests it should.
Using kubectl top to Spot Memory Runaway
The metrics-server needs to be running for this to work, and if it's also stuck in Pending, fix that first. Once it's up:
# Node-level resource consumption
kubectl top nodes
# Pod-level resource consumption across all namespaces
kubectl top pods --all-namespaces --sort-by=memory
If a pod is consuming memory that's dramatically higher than its configured limit, that's a cgroup v2 accounting issue. The container thinks it has more headroom than it does, and the kubelet is about to kill it.
Enabling Kubernetes RBAC on Existing Clusters
This is a question that comes up constantly: "I created a cluster without RBAC enabled, can I turn it on now?" The honest answer is no. Enabling Kubernetes RBAC on an existing AKS cluster is not supported. It can only be configured at cluster creation time. If you need RBAC and your cluster doesn't have it, the path forward is to create a new cluster with RBAC enabled (it's on by default when using the Azure CLI or any API version after 2020-03-01), migrate your workloads to it, and decommission the old one. There is no in-place upgrade path for this setting.
Investigating Upgrade History in the Azure Activity Log
When kubectl get events has rolled off and you need to understand what happened during a failed upgrade, go to the Azure portal, navigate to your AKS resource, select Activity log from the left menu, and filter by Operation: Upgrade Kubernetes Cluster. Each operation entry shows the start time, end time, status, and the initiating identity. This is also where you'd look if you suspect an automated process (like Azure Policy or a CI/CD pipeline) triggered an unintended upgrade.
AKS Cluster Moves Between Subscriptions
If someone on your team suggests moving an AKS cluster to a different subscription, either to reorganize billing or because a tenant migration is happening, stop them before they try. AKS does not support moving clusters across subscriptions or across Azure AD tenants. The cluster identity loses its permissions in the move, and the cluster stops functioning. There is no recovery path from this other than recreating the cluster. If you've already done this and your cluster is broken, you're looking at a full rebuild. This is documented behavior, not a bug.
If you've worked through every step here and your cluster is still in a failed state after multiple reconciliation attempts, it's time to escalate. Open a support ticket through Microsoft Support at the Severity B level (business impact without complete outage) or Severity A if production workloads are completely down. Before opening the ticket, collect: your cluster name and resource group, the output of az aks show -n <name> -g <rg>, the output of kubectl get events -n kube-system, and the Azure Activity Log entries from the period of failure. Support engineers can access internal cluster diagnostics that aren't exposed through the public API, which is often what's needed to resolve persistent upgrade failures.
Prevention & Best Practices
Most of the Azure Kubernetes Service troubleshooting scenarios in this guide are preventable. The issues that bring teams to their knees at 2 AM are almost always ones where a little pre-upgrade validation would have caught the problem before it became a production incident.
Before any AKS cluster upgrade, run a pre-flight check on your PodDisruptionBudgets across all namespaces. The key question is: does each PDB currently allow at least one disruption? If the answer is no for any workload, either increase your replica count or adjust the PDB before the upgrade window. A PDB with minAvailable: 2 and only 2 replicas running is a guaranteed upgrade blocker every single time.
For the cgroup v2 memory issue: when you upgrade to Kubernetes 1.25 or later, audit your container base images first. The upgrade to Ubuntu 22.04 is non-negotiable, AKS made this change at the platform level. What is in your control is whether your application runtimes are ready for it. Scan your Dockerfiles and base images for Java versions older than 8u372 or 11.0.16, .NET versions older than 5.0, and any third-party agents you're running as sidecars. Update those before the upgrade, not after.
On resource limits and requests: if you see a higher eviction rate after a cgroup v2-related upgrade, the immediate lever to pull is increasing your memory limits and requests. This gives the scheduler more accurate information and reduces the likelihood of eviction events while you work on a proper runtime fix. Don't skip setting resource requests and limits entirely, a pod with no memory limit set will consume unbounded memory and affect every other workload on the same node.
Finally, test upgrades in a non-production cluster first. This seems obvious but a surprisingly large number of teams run a single AKS cluster and upgrade directly in production. Even a small dev/test cluster running a subset of your workloads will catch PDB conflicts and runtime incompatibilities before they hit your users.
- Run
kubectl get pdb --all-namespacesbefore every planned upgrade and verifyALLOWED DISRUPTIONSis at least 1 for every PDB - Audit all container base images for Java <8u372, .NET <5.0, and third-party monitoring agents before upgrading to Kubernetes 1.25+
- Enable the AKS cluster auto-upgrade feature on a non-production cluster to catch breaking changes before they reach production
- Set up Azure Monitor alerts on node
MemoryPressureandNotReadyconditions so you know about node pool problems before users do
Frequently Asked Questions
Where do I find information about debugging Kubernetes problems on AKS?
The best starting point is the official Kubernetes troubleshooting guide, which covers pods, nodes, and cluster-level issues. Microsoft also maintains an AKS-specific troubleshooting guide written by their own engineers that goes deeper on Azure-specific scenarios like node pool failures, AAD integration problems, and storage issues. For issues specific to your cluster, kubectl describe pod <name> -n <namespace> and kubectl get events will surface the most actionable diagnostic information. When those don't give you enough, the Azure Activity Log in the portal retains cluster operation history and is invaluable for understanding what happened during a failed upgrade.
Can I move my AKS cluster to a different subscription or Azure tenant?
No, and this is a hard no, not a soft limitation. Moving an AKS cluster to a different subscription or Azure AD tenant breaks the cluster's identity permissions, and there is no way to restore them after the move. The cluster simply stops working. Microsoft's official position is that AKS does not support cross-subscription or cross-tenant moves. If you need to consolidate billing or reorganize subscriptions, the correct approach is to create a new cluster in the target subscription, migrate your workloads to it, and decommission the original. Plan a migration window and treat it like a new deployment, not a simple move.
Why are all my pods stuck in Pending after a cluster upgrade?
When every pod in kube-system shows Pending simultaneously, it almost always means there are no schedulable nodes available in the cluster. This typically happens when an upgrade left a node pool in a failed provisioning state. Run kubectl describe pod <any-pending-pod> -n kube-system and look at the Events section, if you see FailedScheduling: no nodes available to schedule pods, that confirms the diagnosis. The fix is to re-trigger the upgrade with az aks upgrade targeting the same version, which forces AKS to attempt reconciliation of the failed node pool state.
Why are my pods getting OOMKilled after upgrading to Kubernetes 1.25?
This is the cgroup v2 incompatibility issue. Kubernetes 1.25 on AKS introduced Ubuntu 22.04 as the node OS, which uses cgroup version 2 by default. Older Java runtimes (anything before JDK 8u372 or 11.0.16), .NET versions before 5.0, and some other runtimes don't correctly read container memory limits from the cgroup v2 API, they see the entire host's memory instead of their container limit. The process allocates memory as if it has 16 GB available when its container limit is 512 MB, gets killed when it hits that limit, and shows as OOMKilled. Upgrade your runtime to a cgroup v2-compatible version. As a temporary measure, increase your pod's memory limits and requests to reduce the eviction rate while you work on the runtime update.
What is a PodDisruptionBudget and why is it blocking my AKS upgrade?
A PodDisruptionBudget (PDB) is a Kubernetes resource that limits how many pods of a given workload can be simultaneously unavailable during voluntary disruptions, like AKS draining a node during an upgrade. The intent is to protect workload availability: if you have 3 replicas and set minAvailable: 2, at most 1 pod can be evicted at a time. The problem arises when Allowed Disruptions calculates to zero, meaning the current number of running pods is already at or below minAvailable. When AKS tries to drain the node, the eviction API returns HTTP 429 Too Many Requests, and the upgrade halts. You'll see this in events as "Eviction blocked by Too Many Requests (usually a pdb)". Fix it by increasing your replica count, adjusting the PDB's minAvailable setting, or temporarily deleting the PDB (with a backup) before retrying the upgrade.
Can I enable Kubernetes RBAC on an existing AKS cluster that was created without it?
No. Kubernetes RBAC is a cluster-creation-time setting in AKS and cannot be enabled on an existing cluster after the fact. If your cluster was created without RBAC and you need it, which, given current security requirements, you almost certainly do, the only path is to create a new cluster with RBAC enabled and migrate your workloads. When using the Azure CLI or any API version newer than 2020-03-01, RBAC is enabled by default, so new clusters you create today will have it unless you explicitly opt out. If you're doing this migration, also consider enabling Azure AD integration and Azure RBAC for Kubernetes at the same time, since you're already standing up a new cluster.