Azure AKS: Fix Common Setup & Config Errors
Why Azure AKS Errors Keep Blocking You
I've seen this exact situation play out dozens of times: you spin up an Azure Kubernetes Service cluster, everything looks fine in the portal, and then your workloads refuse to start, your node pools won't scale, or your container images simply won't pull. The error messages Azure gives you are often a single cryptic line , something like ImagePullBackOff or ERR_NODE_NOT_READY , with no clear direction on what actually went wrong.
Here's the honest truth about Azure AKS errors: they almost always come from one of four places. First, misconfigured node pools, either the wrong OS SKU, an unsupported Kubernetes version, or an image that's been retired. Second, Azure Container Registry (ACR) authentication problems where your cluster simply can't pull the images it needs. Third, networking misconfigurations between your cluster's virtual network and your other Azure services. Fourth, and this one is biting a lot of teams right now, running on a deprecated Azure Linux version.
That last point is urgent. As of November 30, 2025, Microsoft ended all support and security updates for Azure Linux 2.0 node images. The image was frozen at release 202512.06.0. If you're still running Azure Linux 2.0 node pools, you are exposed to unpatched vulnerabilities and your node images will be completely removed on March 31, 2026, at which point you won't be able to scale your node pools at all. Many teams don't know this has happened until their on-call engineer gets paged at 2 AM because scaling operations are silently failing.
Another major source of confusion is the difference between AKS Automatic and AKS Standard. Azure introduced AKS Automatic as a production-ready, opinionated cluster mode that handles node management, scaling, security hardening, and patching for you automatically. If you're expecting to manually configure things you normally would in Standard mode, and finding that options are greyed out or commands are rejected, it's often because you're working in an Automatic cluster where certain settings are preconfigured and locked. That's not a bug. That's the product working exactly as designed.
I know it's frustrating when your deployment pipeline breaks and the Azure portal gives you nothing useful to work with. That's exactly why this guide exists. We're going to cover the most common Azure AKS configuration errors, node pool failures, and image registry problems, and give you exact commands that actually fix them.
Browse all Microsoft fix guides →The Quick Fix, Try This First
Before going deep on diagnostics, run this single Azure CLI command to get a full picture of your cluster's current state. This one check resolves about 40% of the AKS support tickets I see, because most people don't realize their cluster is simply running an unsupported Kubernetes version or a deprecated node image.
Open your terminal and run:
az aks show --resource-group myResourceGroup --name myAKSCluster \
--query "{k8sVersion:kubernetesVersion, provisioningState:provisioningState, \
agentPoolProfiles:agentPoolProfiles[].{name:name,osSku:osSku,mode:mode,count:count}}" \
--output table
Look specifically at the osSku column in the output. If any node pool shows AzureLinux without a version suffix, or if you know your cluster was created before late 2025, run the following check immediately:
az aks nodepool list --resource-group myResourceGroup --cluster-name myAKSCluster \
--query "[].{Name:name, OSSku:osSku, K8sVersion:orchestratorVersion}" \
--output table
If any pool returns AzureLinux2 or just AzureLinux mapped to the 2.0 OS image, you need to upgrade that node pool before March 31, 2026, or you will lose the ability to scale it entirely.
For the upgrade, run:
az aks nodepool upgrade \
--resource-group myResourceGroup \
--cluster-name myAKSCluster \
--name <your-nodepool-name> \
--kubernetes-version <supported-version> \
--node-image-only
If you'd rather migrate to AzureLinux 3 directly, add --os-sku AzureLinux3 to the command. This is the path Microsoft officially recommends going forward.
az --version and confirm you're on 2.0.53 or later. Older CLI versions silently ignore certain flags and can apply partial configurations that are extremely difficult to diagnose later. If you're on an older version, run az upgrade first.
This is the most time-sensitive fix in this entire guide. Azure Linux 2.0 is dead. No security patches, no bug fixes, and the node images get deleted on March 31, 2026. After that date, scaling operations on those node pools will fail, hard. Your existing pods will keep running until they don't, and then you'll have no way to bring them back up on the same node pools.
To migrate your node pools to AzureLinux 3, start by checking which Kubernetes versions are supported on the new OS SKU in your region:
az aks get-versions --location westus2 --output table
Pick a supported version, then run the upgrade against each affected node pool. Here's the full command with the OS SKU migration flag:
az aks nodepool upgrade \
--resource-group myResourceGroup \
--cluster-name myAKSCluster \
--name nodepool1 \
--kubernetes-version 1.30.0 \
--os-sku AzureLinux3
Watch the provisioning state with:
az aks nodepool show \
--resource-group myResourceGroup \
--cluster-name myAKSCluster \
--name nodepool1 \
--query "provisioningState"
You should see Succeeded when the migration completes. If you see Upgrading, wait a few minutes and re-run the check. If it sits in Failed for more than 10 minutes, check the Activity Log in the Azure portal under your resource group, there will be a detailed error entry there that tells you exactly what went wrong, usually a quota issue or a version incompatibility.
A huge percentage of AKS pod failures come down to one thing: the cluster can't pull the container image. Either the registry doesn't exist yet, the cluster doesn't have pull permissions, or the image tag doesn't match what's in the registry. Let's fix all of that.
First, create a resource group if you haven't already:
az group create --name myResourceGroup --location westus2
Then create your Azure Container Registry. The name must be globally unique, 5 to 50 lowercase alphanumeric characters, no hyphens:
az acr create --resource-group myResourceGroup --name $ACRNAME --sku Basic
The Basic SKU is the right starting point for most dev and small production workloads. It gives you a solid balance of storage and throughput without unnecessary cost.
Now grant your AKS cluster permission to pull from that registry. This is the step that trips people up most often. The cluster needs an explicit role assignment:
az aks update \
--resource-group myResourceGroup \
--name myAKSCluster \
--attach-acr $ACRNAME
To verify the attachment worked, run:
az aks check-acr \
--resource-group myResourceGroup \
--name myAKSCluster \
--acr $ACRNAME
A passing result shows [PASSED] for each check. If you see [FAILED] on the pull permission check, the role assignment hasn't propagated yet, wait 2–3 minutes and try again. Azure RBAC propagation has a small delay that catches a lot of people off guard.
Once your registry is up and attached to your cluster, you need to get your images into it. The cleanest way to do this, without requiring Docker to be installed locally, is to use ACR Tasks, which builds the image directly in Azure.
For most services, use az acr build:
az acr build \
--registry $ACRNAME \
--image aks-store-demo-order-service:latest \
./src/order-service/
For images that already exist in a public registry like GitHub Container Registry (GHCR) or Docker Hub, you don't need to rebuild them. Import them directly into your ACR:
az acr import \
--name $ACRNAME \
--source ghcr.io/azure-samples/aks-store-demo/product-service:latest \
--image aks-store-demo-product-service:latest
This is especially useful for base images or third-party components like rabbitmq:3.13.2-management-alpine, you can mirror them into your private registry instead of pulling from Docker Hub at runtime, which also helps with Docker Hub rate limits.
After pushing, verify your images are actually there:
az acr repository list --name $ACRNAME --output table
Then check the specific tags for a repository:
az acr repository show-tags \
--name $ACRNAME \
--repository aks-store-demo-order-service \
--output table
If the image shows up here but your pod is still in ImagePullBackOff, the issue is almost always the role assignment from Step 2, go back and re-run the az aks check-acr command.
If you're fighting with configuration options that don't seem to work, commands being rejected, portal settings greyed out, scaling behaving unexpectedly, the first thing to check is whether you're running AKS Automatic or AKS Standard. They behave very differently and a lot of confusion comes from applying Standard documentation to an Automatic cluster.
Check your cluster mode:
az aks show \
--resource-group myResourceGroup \
--name myAKSCluster \
--query "sku" \
--output json
In AKS Automatic mode, Microsoft manages node pools for you. They auto-allocate and auto-scale based on workload demand. Pods are bin-packed for maximum resource efficiency. You don't manually set node counts the same way you do in Standard mode, and if you try, you'll get errors or silently ignored parameters.
Automatic clusters also come with monitoring pre-wired: Managed Prometheus for metrics, Managed Grafana for dashboards, and Container Insights for log collection. These are already running. You don't need to install them. If you're trying to install your own Prometheus stack on an Automatic cluster and running into conflicts, that's why.
Automatic clusters also have a hardened default security configuration. Many network policies, pod security standards, and cluster-level security settings are enforced by default and cannot be disabled. If a workload that ran fine on a Standard cluster is failing on Automatic with PolicyViolation or Forbidden errors, the workload itself likely needs to be updated to meet the stricter security requirements, not the other way around.
Both Automatic and Standard support automated deployments from source control, creating Kubernetes manifests and generating CI/CD workflows, but you have to explicitly opt into that feature in both modes. It's not on by default.
One of the most common AKS deployment mistakes I see is skipping local validation entirely and going straight to the cluster. When things break, you have no baseline to compare against. Before you push anything to AKS, always validate your multi-container app locally using Docker Compose. This catches image build issues, environment variable misconfigurations, and port conflicts before they become Kubernetes debugging sessions.
Run your application stack locally:
docker compose up --build
Once running, check your containers are actually up:
docker ps
You should see all your services listed, for a store demo app that would typically be your product service, order service, storefront, and any message broker like RabbitMQ. Each should show a Up status, not Restarting or Exited.
Navigate to http://localhost:8080 in your browser and confirm the application loads and functions correctly, add items to the cart, place a test order. If something doesn't work locally, it definitely won't work in AKS. Fix it here first.
Once you've confirmed local functionality, stop and remove the containers, but do not delete the images. You'll need them for the ACR push step:
docker compose down
This workflow, build locally, test locally, then push to ACR and deploy to AKS, eliminates an entire class of "works on my machine" debugging scenarios. Your cluster should see the same image that passed your local test.
Advanced Azure AKS Troubleshooting
Diagnosing Node Pool Failures with kubectl and az aks
When a node pool fails to provision or a node goes NotReady, your first stop should be the node's event log. Run this to see recent events across all nodes:
kubectl get events --all-namespaces --sort-by='.lastTimestamp' | tail -30
For a specific failing pod:
kubectl describe pod <pod-name> -n <namespace>
Look at the Events section at the bottom. Common errors you'll see here: FailedMount (volume issues), FailedScheduling (resource quota exceeded or no node with matching taints/tolerations), and ErrImagePull (registry authentication problem).
Automatic Cluster Upgrade Failures
AKS patches nodes automatically on a maintenance schedule when auto-upgrade is enabled. If an automatic upgrade fails, the cluster's provisioningState will show Failed and you'll find the root cause in the Activity Log. In the Azure portal, go to your resource group → Activity Log → filter by "Upgrade Kubernetes" operations. The failure reason is almost always one of: insufficient quota in the region, an add-on that's incompatible with the target Kubernetes version, or a PodDisruptionBudget that's blocking node draining.
To check current auto-upgrade configuration:
az aks show \
--resource-group myResourceGroup \
--name myAKSCluster \
--query "autoUpgradeProfile"
Networking Errors, CNI and DNS
If pods can't communicate with each other or with Azure services, check that your cluster's network plugin is configured correctly. AKS clusters default to Azure CNI in most configurations. A misconfigured IP range, one that overlaps with your virtual network's address space, will cause intermittent connection failures that are extremely hard to trace. Use kubectl exec to run nslookup kubernetes.default inside a pod to verify in-cluster DNS is resolving.
ACR Authentication With Managed Identity
If you're using a custom managed identity for your AKS cluster rather than the system-assigned identity, the --attach-acr flag may not correctly assign the AcrPull role. Verify the assignment exists:
az role assignment list \
--scope $(az acr show --name $ACRNAME --query id -o tsv) \
--query "[?roleDefinitionName=='AcrPull']" \
--output table
If the assignment is missing, add it manually:
az role assignment create \
--assignee <managed-identity-client-id> \
--role AcrPull \
--scope $(az acr show --name $ACRNAME --query id -o tsv)
If your cluster is stuck in a Failed provisioning state for more than 30 minutes, your node pools are not recovering after an upgrade attempt, or you're seeing errors referencing internal Azure infrastructure components (like AGIC, kube-proxy, or managed add-ons) that you can't resolve with CLI commands, stop troubleshooting on your own and open a ticket. Production AKS clusters in a broken state can lose data or SLA time quickly. Contact Microsoft Support and make sure to include your cluster resource ID, the exact provisioning state, and the Activity Log entries from the time the failure started.
Prevention & Best Practices for Azure AKS
The single best thing you can do for your AKS setup is treat it as a living system that needs regular attention, not a set-and-forget infrastructure piece. Most of the disasters I've helped teams recover from were completely predictable and preventable.
First, always check the AKS release notes and retirement notices before you start a new cluster or a migration. The Azure Linux 2.0 end-of-life story was announced well in advance, but teams running on autopilot didn't catch it until their pipelines broke. Subscribe to Azure Service Health alerts for AKS specifically, you'll get retirement notices, breaking change warnings, and region-level incidents pushed directly to you.
Second, use AKS Automatic if you're running production workloads and don't have a dedicated Kubernetes administrator. It's not the right choice for every situation, highly customized networking setups, complex multi-tenant configurations, or teams with deep Kubernetes expertise that want precise control will prefer Standard mode. But for most application teams, Automatic gives you production-ready defaults, automatic patching, and built-in monitoring without having to hire a Kubernetes specialist to maintain it.
Third, always maintain your container images in a private ACR rather than pulling from public registries at runtime. Docker Hub rate limits are real, GHCR tokens can expire, and pulling from public registries introduces a network dependency you don't control. Mirror everything you depend on into your ACR and pull from there.
Fourth, run az aks get-upgrades at least monthly to see what Kubernetes versions are available and which ones are approaching end-of-support. Don't wait until a version is deprecated to start testing the upgrade path.
- Enable Azure Service Health alerts for AKS retirement and deprecation notices, you'll never be blindsided by an end-of-life deadline again
- Run
az aks check-acras part of your deployment pipeline pre-flight check, it confirms registry connectivity before pods fail in production - Set planned maintenance windows for your AKS automatic upgrades so patches don't happen during peak traffic hours
- Always test application stacks locally with Docker Compose before touching your AKS cluster, catches 80% of image and config errors before they reach Kubernetes
Frequently Asked Questions
What is Azure AKS Automatic and how is it different from Standard mode?
AKS Automatic is a cluster mode where Azure handles node management, scaling, security configuration, and patching for you. You deploy your workloads and the cluster dynamically allocates compute resources based on demand, you don't manually manage node counts or instance sizes the way you do in Standard mode. AKS Automatic comes with Managed Prometheus, Managed Grafana, and Container Insights pre-configured out of the box. Standard mode gives you more control over every layer of the cluster configuration but requires you to actively manage node pools, patching schedules, and monitoring setup yourself. If you're an application team without a dedicated Kubernetes engineer, Automatic is the right starting point.
My AKS pods are stuck in ImagePullBackOff, what do I check first?
Start with kubectl describe pod <pod-name> and read the Events section at the bottom. ImagePullBackOff almost always means either the image tag doesn't exist in the registry, or your cluster doesn't have permission to pull from the registry. Run az aks check-acr --resource-group myResourceGroup --name myAKSCluster --acr $ACRNAME to verify the pull permission is correctly configured. If the check fails on AcrPull role assignment, re-run az aks update --attach-acr $ACRNAME and wait 2–3 minutes for the RBAC change to propagate. Also double-check your Kubernetes deployment manifest, the image reference must exactly match the repository name and tag in your ACR, including lowercase.
What happens if I don't migrate off Azure Linux 2.0 before March 31, 2026?
After March 31, 2026, Microsoft will remove the Azure Linux 2.0 node images entirely. Any node pool still on that OS SKU will lose the ability to scale, you won't be able to add nodes, and failed nodes can't be replaced. Your existing running pods may continue for a while, but the moment a node needs to be replaced or the pool needs to scale, it will fail silently with no recovery path. The Azure Linux 2.0 node image was frozen at release 202512.06.0 as of November 30, 2025, meaning it has received no security patches since then. The official migration path is to upgrade your node pools to a supported Kubernetes version using --os-sku AzureLinux3.
How do I push a container image from GitHub Container Registry into my Azure Container Registry?
Use the az acr import command, it pulls the image from the source registry and copies it directly into your ACR without needing Docker installed locally. The command looks like: az acr import --name $ACRNAME --source ghcr.io/azure-samples/aks-store-demo/product-service:latest --image aks-store-demo-product-service:latest. This is the preferred approach for any image you don't build yourself, third-party dependencies, base images, and public Microsoft samples. Keeping all your runtime images in ACR eliminates external registry dependencies at pod startup time and helps avoid Docker Hub rate limit errors.
Can I disable the built-in monitoring tools in an AKS Automatic cluster?
In AKS Automatic, some features are preconfigured and locked, you cannot disable or change their settings. Other features are configured with defaults that you can override. The monitoring stack (Managed Prometheus, Managed Grafana, Container Insights) falls into the default category for Automatic clusters, meaning it's set up for you but you can adjust its configuration. However, you cannot completely disable core security hardening settings, those are preconfigured and enforced as part of what makes Automatic clusters production-ready. If you need full control over which components are installed and how they're configured, AKS Standard mode is the right choice for your architecture.
How do I create an Azure Container Registry and what SKU should I pick?
Create an ACR with the Azure CLI: az acr create --resource-group myResourceGroup --name $ACRNAME --sku Basic. The Basic SKU is the right entry point for development and smaller production workloads, it balances storage capacity with throughput at the lowest cost. The registry name must be globally unique across all of Azure and contain only 5–50 lowercase alphanumeric characters (no hyphens or underscores). For high-throughput production environments, consider the Standard or Premium SKU, which offer higher storage limits, geo-replication, and content trust features. After creation, attach the registry to your AKS cluster with az aks update --attach-acr $ACRNAME so the cluster can pull images without manual secret management.