How to Troubleshoot Azure Kubernetes Service (AKS)
Why This Is Happening
I've spent years staring at AKS cluster dashboards at 2 AM, watching pods flip between CrashLoopBackOff and Pending while an on-call phone buzzes nonstop. Azure Kubernetes Service troubleshooting is genuinely hard , not because the platform is poorly designed, but because Kubernetes itself is a distributed system with hundreds of moving parts, and AKS adds a cloud infrastructure layer on top of every single one of them.
Here's what most people don't realize: when your AKS workload breaks, the failure rarely has a single root cause. A pod stuck in ImagePullBackOff might look like a container registry problem, but it could actually be a networking issue caused by a misconfigured Network Security Group (NSG), a missing role assignment on your managed identity, or a node that's quietly hit its resource ceiling. Azure's error messages , things like "0/3 nodes are available: 3 Insufficient memory" or "back-off restarting failed container", tell you what happened, not why.
The most common root causes I see across AKS environments are:
- Node pool resource exhaustion, CPU, memory, or ephemeral storage limits hit silently, causing the scheduler to stop placing new pods
- RBAC and Managed Identity misconfigurations, the AKS kubelet or workload identity can't pull images from Azure Container Registry (ACR), read Key Vault secrets, or write to storage
- CNI networking failures, Azure CNI IP exhaustion in the subnet, Calico policy conflicts, or CoreDNS falling over under load
- Control plane API server throttling, especially on Standard tier clusters under heavy
kubectltraffic or with poorly configured Horizontal Pod Autoscalers - Node not ready states, typically from VM SKU quota limits, OS disk pressure, or a failed node image upgrade
- Persistent Volume Claim (PVC) binding failures, when Azure Disk or Azure Files CSI driver can't provision storage due to quota, permissions, or zone mismatches
What makes AKS troubleshooting especially painful is that the Kubernetes control plane is Microsoft-managed. You can't SSH into the etcd nodes, you don't have direct access to the API server logs, and some failures happen entirely inside Azure's fabric before they ever surface in your cluster's event log. That's why knowing exactly which tools to reach for, and in what order, makes the difference between a 10-minute fix and a 4-hour war with Azure support.
This guide walks through the full diagnostic hierarchy I use in production. Browse all Microsoft fix guides →
The Quick Fix, Try This First
Before you go deep on any Azure Kubernetes Service troubleshooting path, run this triage sequence. I call it the "four-command health check." It takes about 60 seconds and tells you immediately whether you're dealing with a node problem, a pod problem, a networking problem, or something in the Azure control plane.
Open a terminal where you have kubectl configured against your cluster and run these in order:
# 1. Are your nodes actually ready?
kubectl get nodes -o wide
# 2. What's currently broken in the cluster?
kubectl get pods --all-namespaces | grep -v Running | grep -v Completed
# 3. What does Kubernetes think went wrong?
kubectl describe pod <pod-name> -n <namespace>
# 4. What did the container actually say before it died?
kubectl logs <pod-name> -n <namespace> --previous
Read the output of describe pod from the bottom up. The Events section at the very end is gold. Look for lines like "Failed to pull image", "FailedScheduling", "Readiness probe failed", or "OOMKilled", each one points to a completely different fix path.
If nodes show NotReady, that's your priority. Jump straight to Step 1. If nodes are all Ready but pods are stuck in Pending, go to Step 2. If pods are running but crashing, Step 3 covers you. If everything looks fine in kubectl but your app is still unreachable, the problem is almost certainly in networking, Step 4 has that.
For a quick AKS-specific health overview, you can also use the Azure CLI:
az aks show \
--resource-group <your-rg> \
--name <your-cluster-name> \
--query "{provisioningState:provisioningState, powerState:powerState, kubernetesVersion:kubernetesVersion}" \
--output table
If provisioningState shows anything other than Succeeded, the cluster itself is mid-operation or stuck, in that case, check the Azure Activity Log in the portal before doing anything else.
~/.kube/config alias for each cluster environment and keep kubectl config get-contexts muscle memory. I've seen engineers spend 20 minutes troubleshooting the wrong cluster because they had the wrong context active. Always verify context first with kubectl config current-context before any diagnostic session.
A node in NotReady status means the kubelet on that VM has stopped communicating with the API server. This is one of the most impactful AKS failures you'll face because the Kubernetes scheduler immediately stops placing new pods on that node, and existing pods will be evicted after the node.kubernetes.io/not-ready toleration timeout (default: 5 minutes).
First, get full detail on the affected node:
kubectl describe node <node-name>
Scroll to the Conditions section. Look for MemoryPressure, DiskPressure, or PIDPressure set to True. These are the most common reasons a node goes unresponsive. Also check the Events section for eviction warnings.
Then check node resource utilization directly:
kubectl top nodes
If kubectl top returns an error about the metrics server, install it first, AKS doesn't enable it by default on all tiers.
For a node that's genuinely stuck, a cordon-and-drain is usually the right call:
# Stop new pods scheduling on this node
kubectl cordon <node-name>
# Safely evict all pods (grace period: 60 seconds)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --grace-period=60
After draining, check in the Azure portal whether the underlying VM is healthy: go to Resource Groups → [your-rg] → Virtual Machine Scale Sets → [node pool VMSS] → Instances. If the VM shows a failed health state, use the Reimage option to rebuild it from the node image. AKS will automatically re-join the reimaged node to the cluster.
If you see that your node pool has hit its VM quota limit (you'll see event code QuotaExceeded in the Azure Activity Log), you'll need to request a quota increase in the Azure portal under Subscriptions → [your-sub] → Usage + quotas before scaling will work.
Pods stuck in Pending haven't been scheduled to a node yet. This almost always means the cluster can't find a node that satisfies the pod's resource requests, node affinity rules, or taints. ImagePullBackOff means the pod was scheduled but the container runtime failed to pull the image.
For Pending, the describe output will tell you exactly what constraint is failing:
kubectl describe pod <pod-name> -n <namespace>
Look for messages like "0/3 nodes are available: 3 Insufficient cpu, 3 Insufficient memory." This means every node is too full. Either scale your node pool or reduce the pod's resource requests in its deployment spec. Check your current requests/limits:
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].resources}'
For ImagePullBackOff on images hosted in Azure Container Registry, the most common cause is a broken link between AKS and ACR. Check the attachment:
az aks check-acr \
--resource-group <your-rg> \
--name <your-cluster> \
--acr <your-acr-name>.azurecr.io
If that returns errors, re-attach ACR to AKS. This grants the kubelet's managed identity the AcrPull role on your registry:
az aks update \
--resource-group <your-rg> \
--name <your-cluster> \
--attach-acr <acr-resource-id>
Once updated, delete the stuck pod (the ReplicaSet will recreate it) and watch it pull successfully. Also verify that your image tag actually exists in the registry, a tag typo produces the same error and trips up more engineers than you'd think.
CrashLoopBackOff means the container started, crashed, and Kubernetes is now applying exponential backoff before retrying. The backoff starts at 10 seconds and doubles each time, capping at 5 minutes. Your app is cycling but you're not seeing useful errors, here's how to actually read them.
Pull the logs from the most recent crash:
kubectl logs <pod-name> -n <namespace> --previous
If the container crashed before producing output, check the exit code:
kubectl get pod <pod-name> -n <namespace> \
-o jsonpath='{.status.containerStatuses[0].lastState.terminated}'
Exit code 137 means OOMKilled, the container exceeded its memory limit and the Linux OOM killer terminated it. This is one of the most common Azure Kubernetes Service troubleshooting issues I handle. The fix: increase the memory limit in the pod spec, or identify the leak in your application.
Exit code 1 is an application error, read the logs. Exit code 139 is a segmentation fault, usually a bug in native code or a corrupted binary.
For OOMKilled, temporarily bump the limit to confirm that's the issue:
kubectl set resources deployment <deployment-name> \
-n <namespace> \
--limits=memory=512Mi \
--requests=memory=256Mi
Also check whether your readiness or liveness probe is too aggressive. If the probe fires before the app is actually ready and restarts the container repeatedly, it produces the same CrashLoopBackOff symptom even when the app itself is perfectly healthy. Increase the initialDelaySeconds in your probe configuration if your app has a slow startup path.
Networking failures in AKS are some of the most frustrating to debug because the symptoms are indirect, your pods are running, but services can't reach each other, external traffic isn't routing in, or DNS lookups silently fail. I've seen entire production outages traced back to a single NSG rule that an infrastructure team added without realizing AKS depended on it.
Start by confirming pod-to-pod connectivity inside the cluster:
# Spin up a temporary debug pod
kubectl run debug-pod --image=busybox:1.36 --restart=Never -it --rm -- sh
# Inside the debug pod, test DNS resolution
nslookup kubernetes.default.svc.cluster.local
# Test connectivity to another service
wget -qO- http://<service-name>.<namespace>.svc.cluster.local
If DNS fails, check CoreDNS health:
kubectl get pods -n kube-system -l k8s-app=kube-dns
kubectl logs -n kube-system -l k8s-app=kube-dns --tail=50
CoreDNS pods in CrashLoopBackOff or OOMKilled state will break name resolution for every workload in the cluster. Restart them if needed:
kubectl rollout restart deployment/coredns -n kube-system
For Azure CNI subnet IP exhaustion, which blocks new pod scheduling entirely, check available IPs:
az network vnet subnet show \
--resource-group <vnet-rg> \
--vnet-name <vnet-name> \
--name <subnet-name> \
--query "ipConfigurations | length(@)"
If your subnet is near its IP limit, you'll need to expand the CIDR or migrate to a subnet with more address space. For ingress issues, verify your load balancer's public IP is correctly bound and check NSG rules on the AKS node resource group (the one named MC_<rg>_<cluster>_<region>) aren't blocking traffic on ports 80 and 443.
Storage failures in AKS show up as PVCs stuck in Pending state, pods unable to start because a volume can't be mounted, or cryptic errors like "Multi-Attach error for volume" when a pod reschedules to a different node. These block deployments completely and require a specific diagnostic path.
Start by checking the PVC and its associated events:
kubectl describe pvc <pvc-name> -n <namespace>
Look at the Events section. Common messages and what they mean:
- "no volume plugin matched", the CSI driver isn't installed or the StorageClass references a missing provisioner
- "ProvisioningFailed: disk.skuName Standard_LRS is not supported", your node VM SKU doesn't support the disk type you requested
- "Topology: No topology key found", you're requesting a zonal disk but your node pool spans multiple zones without zone-aware scheduling
- "VolumeAttachError: Attach volume ... did not finish within allowed timeout", Azure Disk attach operation timed out, often a transient Azure platform issue
Verify the CSI drivers are healthy:
kubectl get pods -n kube-system | grep csi
You should see csi-azuredisk-controller, csi-azuredisk-node, csi-azurefile-controller, and csi-azurefile-node pods all in Running state. If the disk controller is down, PVC provisioning stops entirely.
For the "Multi-Attach error" on Azure Disk (ReadWriteOnce volumes), this happens when Kubernetes tries to attach the disk to a new node before it's fully detached from the old one. The safe fix:
# Delete the old pod that held the volume
kubectl delete pod <old-pod> -n <namespace> --grace-period=0 --force
# Wait 60-90 seconds for Azure to complete the disk detach
# Then delete and recreate the PVC consumer deployment
If you need shared storage across multiple pods, switch to Azure Files (ReadWriteMany) rather than fighting with Azure Disk's single-attach limitation. Update your StorageClass to use kubernetes.io/azure-file provisioner and ensure the storage account firewall allows access from your AKS subnet.
Advanced Troubleshooting
If the steps above haven't resolved your Azure Kubernetes Service troubleshooting problem, you're likely dealing with something at the infrastructure layer, Azure Resource Manager, the AKS resource provider, networking fabric, or enterprise policy constraints. Here's where I go when standard kubectl diagnostics come up empty.
Reading AKS Diagnostic Logs in Azure Monitor
AKS ships control plane logs to Azure Monitor if you've enabled Diagnostic Settings. Go to Azure Portal → Kubernetes Service → [your cluster] → Monitoring → Diagnostic settings. Enable these categories: kube-apiserver, kube-controller-manager, kube-scheduler, kube-audit, and cluster-autoscaler.
Once logs are flowing to a Log Analytics workspace, query them with KQL:
// Find API server errors in the last 2 hours
AzureDiagnostics
| where Category == "kube-apiserver"
| where TimeGenerated > ago(2h)
| where log_s contains "error" or log_s contains "failed"
| project TimeGenerated, log_s
| order by TimeGenerated desc
| limit 50
// Check cluster autoscaler decisions
AzureDiagnostics
| where Category == "cluster-autoscaler"
| where TimeGenerated > ago(1h)
| project TimeGenerated, log_s
| order by TimeGenerated desc
Cluster Autoscaler Not Scaling Up
If nodes aren't being added despite pods stuck in Pending, the cluster autoscaler may be blocked by Azure quota, a VM SKU availability issue in that region, or a spot instance eviction rate that's triggered a cooldown period. Check the autoscaler status:
kubectl get configmap cluster-autoscaler-status \
-n kube-system \
-o yaml
The ScaleUp and ScaleDown sections will tell you if scaling is blocked and why.
Azure Policy and Admission Controller Blocks
On enterprise AKS clusters with Azure Policy enabled, pods can silently fail to deploy because an admission webhook rejects them. Event viewer equivalent in Kubernetes:
kubectl get events -n <namespace> --sort-by='.lastTimestamp' | tail -20
Look for events with reason FailedCreate and messages containing admission webhook. These are Azure Policy denying your workload based on a policy like "no privileged containers" or "require resource limits." You'll need to work with your Azure Policy administrator to either update the policy or add an exemption for your namespace.
Workload Identity and Key Vault Integration Issues
If your pods use Azure Workload Identity to authenticate to Azure services, a broken federation will cause silent 401 errors inside the container. Validate the service account annotation:
kubectl get serviceaccount <sa-name> -n <namespace> \
-o jsonpath='{.metadata.annotations}'
It should contain azure.workload.identity/client-id. If missing, re-annotate it with the correct managed identity client ID from the Azure portal.
provisioningState has been stuck on Updating or Failed for more than 30 minutes; you're seeing API server availability drops in Azure Service Health (check Portal → Service Health → Health History); a node pool upgrade is stuck mid-way; or you're hitting platform-level Azure networking failures that affect the entire region. Before calling, gather your cluster resource ID, the exact time window of the failure, and the contents of your cluster's Activity Log, this cuts support call resolution time dramatically.
Prevention & Best Practices
The best Azure Kubernetes Service troubleshooting is the kind you never have to do. In my experience, clusters that stay healthy share a few common patterns, and clusters that become fires week after week are missing these same fundamentals.
Set up Azure Monitor for Containers (Container Insights) from day one. It gives you CPU and memory utilization per namespace, per pod, and per node, with alerting built in. Without it, you're blind to resource pressure until pods start crashing. Enable it from the portal under AKS → Monitoring → Insights → Enable.
Always define resource requests and limits on every container. Without requests, the scheduler can over-commit a node. Without limits, a single runaway process can take down every pod on the node by consuming all available memory. A safe starting template: set requests at 50% of what you expect the workload to use under normal load, and limits at 150%.
Enable node auto-upgrade on your AKS cluster or establish a regular maintenance window. Unpatched node images accumulate known CVEs and OS-level bugs that silently contribute to instability. Use the --auto-upgrade-channel flag set to patch for production clusters, it keeps nodes updated within the same minor version without unexpected control plane version jumps.
Maintain a separate system node pool dedicated to kube-system workloads (CoreDNS, the metrics server, the CSI drivers). Tainting this node pool with CriticalAddonsOnly=true:NoSchedule prevents application workloads from competing with platform components for resources. I've seen CoreDNS get evicted by a poorly configured batch job, everything in the cluster broke instantly.
Finally, run regular chaos and drain tests on non-production clusters. Drain one node per month and observe how your workloads recover. It consistently surfaces missing PodDisruptionBudgets, pods with no replicas, and services that fail to route traffic during node transitions, all problems you want to find in testing, not production.
- Enable Container Insights and set alerts for node CPU > 80% and memory > 85% sustained for 5 minutes
- Configure PodDisruptionBudgets for every stateful workload so drains don't take down your whole deployment
- Use
kubectl get events --watchduring deployments to catch scheduling and image pull failures in real time - Run
az aks get-upgradesmonthly and plan Kubernetes version upgrades before your current version hits end-of-support (Microsoft supports N-2 minor versions)
Frequently Asked Questions
Why does my AKS pod keep restarting even though the app looks fine in the logs?
The most likely culprit is a liveness probe that's misconfigured. If your probe hits an endpoint that takes longer to respond than the timeoutSeconds setting, Kubernetes marks the probe as failed and restarts the container, even though the application is running perfectly. Check your probe configuration with kubectl describe pod <pod-name> and look under the Liveness section. Try increasing timeoutSeconds from the default (1 second) to something like 5 seconds, and raise failureThreshold to 3. Also consider whether your app has a slow cold-start path that needs a higher initialDelaySeconds.
How do I fix "Error from server: etcdserver: request timed out" in AKS?
This error means the AKS API server is under heavy load and etcd (the Kubernetes state store) can't process requests fast enough. Since etcd is Microsoft-managed in AKS, you can't tune it directly. What you can control: reduce the frequency of kubectl polling in any scripts or CI/CD pipelines hitting the cluster, check whether a misbehaving HorizontalPodAutoscaler is generating thousands of patch requests per minute, and if you're on the Free tier, consider upgrading to Standard tier which gives you a higher API server SLA. Check the Azure Service Health portal first to rule out a regional incident.
My AKS upgrade is stuck at "Upgrading", what do I do?
AKS upgrades proceed node by node, cordoning and draining each one before moving to the next. If the upgrade appears stuck, it's usually because a node can't be drained, typically because of a PodDisruptionBudget blocking eviction, a pod with no graceful shutdown handler that hangs during termination, or a DaemonSet pod that can't be evicted. Check the Activity Log in the portal for error details and run kubectl get events --all-namespaces --sort-by='.lastTimestamp' to see what's blocking. If a specific pod is refusing to terminate, you can force-delete it with kubectl delete pod <name> --grace-period=0 --force, but confirm it's safe to do so before pulling that trigger.
Why are my AKS pods getting evicted randomly?
Random evictions almost always mean node-level resource pressure. When a node's available memory drops below the memory.available eviction threshold (default: 100Mi), the kubelet starts evicting pods in order of their QoS class, BestEffort first, then Burstable, and finally Guaranteed. Run kubectl describe node <node-name> and check the Conditions section for MemoryPressure: True. The fix is either to reduce memory usage per pod (tighten limits), increase node pool VM size, scale out the node pool, or set proper resource requests so pods get classified as Guaranteed QoS (requests equal limits).
How do I connect to an AKS pod for live debugging?
If the pod is running, use kubectl exec -it <pod-name> -n <namespace> -- /bin/sh (or /bin/bash if bash is available in the image). If the container's image doesn't have a shell (common in distroless images), use the kubectl debug command to attach a debug sidecar: kubectl debug <pod-name> -it --image=busybox:1.36 --share-processes --copy-to=debug-copy. This creates a copy of the pod with a debug container sharing the same process namespace, so you can inspect the main container's files and network stack without rebuilding your image.
AKS cluster autoscaler isn't scaling down even when nodes are empty, how do I fix it?
The autoscaler won't scale down a node if any pod on it blocks eviction: pods with local storage, pods in the kube-system namespace without a PodDisruptionBudget, pods that don't belong to a controller (standalone pods), or pods that violate their PodDisruptionBudget. Check the autoscaler's reasoning with kubectl get configmap cluster-autoscaler-status -n kube-system -o yaml, the ScaleDown section lists exactly which pods are blocking each node's removal. Also verify the scale-down delay hasn't been set unusually high via the --scale-down-delay-after-add flag in your autoscaler configuration.