Azure Container Registry CrashLoopBackOff, ImagePullBackOff & Pod Fix Guide
Why This Is Happening
Picture this: you push a fresh image to your Azure Container Registry, run kubectl apply, and watch your pod immediately go into ImagePullBackOff. Or worse , the pod starts, crashes, and Kubernetes keeps restarting it in a relentless CrashLoopBackOff loop. Your pipeline was working yesterday. Nothing changed. Except something clearly did. I've been on support calls where teams spent four hours staring at Kubernetes logs when the fix was a 30-second credential refresh in Azure. That's why this guide exists.
Both CrashLoopBackOff and ImagePullBackOff are Kubernetes-level symptoms, but with Azure Container Registry they almost always trace back to one of the same three root causes: an authentication failure, a network access restriction, or a misconfigured service principal. The error messages Kubernetes surfaces , things like unauthorized: authentication required or no basic auth credentials, are technically accurate but maddeningly vague. They tell you what broke, not why.
Here's what's actually going on under the hood. When your AKS node (or any Kubernetes worker) needs to pull an image, it reaches out to your Azure Container Registry endpoint, something like myregistry.azurecr.io, and presents credentials. Those credentials might come from a service principal attached to the cluster, from admin user keys baked into an image pull secret, or from a scoped token tied to a repository. If any of those credentials are expired, revoked, misconfigured, or blocked by a firewall rule, the pull fails. Kubernetes retries it, fails again, and after a few attempts labels the pod ImagePullBackOff. If the image does pull but the container exits immediately due to a bad entrypoint or missing environment variable, that's your CrashLoopBackOff path, a different beast entirely, and one I cover in the Advanced section.
The scenarios I see most often in enterprise environments:
- A service principal secret silently expired, Azure does not send reminders by default
- Someone regenerated the admin user password in the Access keys blade but didn't update the Kubernetes image pull secret
- A network firewall rule or ACR private endpoint was added, blocking the node's egress IP
- A scoped token's password hit its expiration date
- The Docker daemon isn't running on the dev machine where someone is trying to test
az acr login
None of these are obvious from a raw kubectl describe pod output. Microsoft's error messages around Azure Container Registry authentication are famously terse, a 401 response with a JSON blob that just says "authentication required" is not going to point you at the expired service principal secret that's been sitting there for 13 months. That's what this guide does instead.
The Quick Fix, Try This First
Before you go digging into service principals or firewall rules, run the official health check command. This single command does the heavy lifting of diagnosing your Azure Container Registry in one shot and is the first thing Microsoft's own support team runs when you open a ticket.
Open your terminal, Azure Cloud Shell works fine here, and run:
az acr check-health --name <your-acr-name> --ignore-errors --yes
Replace <your-acr-name> with your actual registry name (just the name, not the full .azurecr.io FQDN). The --ignore-errors flag keeps the command running even if it hits a partial failure, so you get the full picture. The --yes flag skips the confirmation prompt so it runs non-interactively, useful in scripts.
What you're looking for in the output: any line that shows an error code rather than a green check. Common ones you might see include DOCKER_COMMAND_ERROR (Docker not installed or daemon not running), CONNECTIVITY_REFRESH_TOKEN_ERROR (token issues or subscription access problems), and DNS resolution failures. Each error code maps directly to a documented fix, the official health check error reference covers them all, and the step-by-step section below addresses the most common ones.
If the health check comes back clean and you're still seeing ImagePullBackOff, the problem is almost certainly in your Kubernetes image pull secret or the service principal permissions assigned to your AKS cluster. Jump straight to Step 3.
If you're on a developer machine and the health check throws DOCKER_COMMAND_ERROR, your Docker daemon is either not installed or not running. Start Docker Desktop, wait for it to fully initialize (the whale icon in the system tray stops animating), and run the health check again before anything else.
az acr login fails with "this command requires running the docker daemon," don't panic and don't try to install Docker in Cloud Shell, it won't work. Instead, run az acr login -n <acr-name> --expose-token to grab a bearer token directly. Cloud Shell provides the Docker CLI but not the daemon, so the --expose-token path is the supported workaround for this environment specifically.
This sounds basic, and I know it feels like the "have you tried turning it off and on again" of container troubleshooting. But the DOCKER_COMMAND_ERROR in Azure Container Registry authentication logs trips up more people than you'd expect, especially on fresh developer machines, CI runners that don't pre-install Docker, or environments that use containerd directly without the Docker CLI.
The az acr login command works by calling docker login behind the scenes and passing it a Microsoft Entra access token. No Docker client, no Docker daemon, the whole authentication flow breaks before it even reaches ACR.
Check if Docker is running:
docker info
If you get Cannot connect to the Docker daemon at unix:///var/run/docker.sock (Linux/macOS) or a similar named pipe error on Windows, the daemon is down. On Linux, bring it back up with:
sudo systemctl start docker
sudo systemctl enable docker
On Windows with Docker Desktop, open the application from your Start menu. Wait for the engine to initialize fully, this can take 30–60 seconds. When Docker Desktop shows "Docker Desktop is running" in the system tray tooltip, try your az acr login again.
If you're in an environment where Docker simply isn't available (Azure Cloud Shell, a lightweight CI agent, a container that doesn't mount the Docker socket), use the token-only approach instead:
az acr login -n <acr-name> --expose-token
This returns a JSON object with an accessToken field. You can then pass that token to tools that accept bearer tokens directly, bypassing the Docker daemon requirement entirely. You should see a successful token output with a loginServer field matching your registry's FQDN when this works correctly.
The unauthorized: authentication required error is the most common ACR authentication failure I see. The full error from Docker looks like this:
Error response from daemon: Get "https://<acr-name>.azurecr.io/v2/":
unauthorized: {"errors":[{"code":"UNAUTHORIZED","message":
"authentication required, visit https://aka.ms/acr/authorization
for more information."}]}
This is a 401 response from the registry. The registry received your request, understood it, and rejected the credentials. Three things can cause this: wrong username/password, expired credentials, or a service principal that doesn't have permission to access this registry.
If you're using the admin user: Go to the Azure portal, open your Container Registry resource, and navigate to Settings > Access keys. You'll see two passwords labeled password and password2. Copy one and run:
docker login <acr-name>.azurecr.io -u <admin-username> -p <password>
One thing that catches people: if someone else on your team previously regenerated these passwords (which you can do from that same Access keys blade), any scripts or Kubernetes secrets using the old password will silently fail. The password itself never appears again after you regenerate it.
If you're using a scoped token: Tokens have an optional expiration date. In the Azure portal, open your registry, go to Repository permissions > Tokens, click your token, and check the Expiration date column. If it's past, generate a new password for that token. Alternatively, run:
az acr token list --registry <acr-name> --resource-group <rg-name> -o table
If you're using a service principal: Check that the principal has at minimum AcrPull role on the registry. The specific roles available for ACR are AcrPull, AcrPush, and AcrDelete, plus standard Azure RBAC roles like Owner and Contributor. Assign the right role from Access control (IAM) on the registry resource page.
When authentication succeeds, docker login outputs Login Succeeded and you'll be able to pull images without a 401.
This is where ImagePullBackOff in Kubernetes almost always lands. Your AKS cluster (or any Kubernetes cluster pulling from ACR) needs a way to authenticate at pull time. That authentication is stored in a Kubernetes Secret of type kubernetes.io/dockerconfigjson, commonly called an image pull secret. When the credentials inside that secret expire or get regenerated, every pod referencing it will immediately start failing.
First, check what image pull secrets are attached to your failing pod:
kubectl describe pod <pod-name> -n <namespace>
Look for the Events section at the bottom. An ImagePullBackOff will show something like Failed to pull image "myregistry.azurecr.io/myapp:latest": rpc error: ... unauthorized. Also look at the Image pull secrets field near the top to see which secret it's using.
Delete the old secret and recreate it with fresh credentials:
kubectl delete secret <secret-name> -n <namespace>
kubectl create secret docker-registry <secret-name> \
--namespace <namespace> \
--docker-server=<acr-name>.azurecr.io \
--docker-username=<service-principal-id> \
--docker-password=<service-principal-secret>
If you're using AKS with the managed identity integration (the recommended path), you don't need image pull secrets at all, AKS can attach the AcrPull role to the kubelet's managed identity directly. Do that with:
az aks update -n <aks-cluster-name> -g <resource-group> \
--attach-acr <acr-name>
After recreating the secret or attaching the ACR, restart the failing pods:
kubectl rollout restart deployment/<deployment-name> -n <namespace>
Watch the pods come back up with kubectl get pods -n <namespace> -w. You're looking for Running status with 0 restarts. If you still see ImagePullBackOff after this step, move to Step 4 to check service principal secret expiration.
Service principal secrets in Azure have a maximum lifetime of two years, and by default many are created with a one-year expiry. When that secret expires, every system using it goes dark simultaneously, no warnings, no gradual degradation. Just sudden 401s across the board. I've seen this take down production pipelines at 2am more times than I want to count.
To check if your service principal secret is expired, you need the service principal's application (client) ID. Then run:
az ad app credential list --id "<SP_APP_ID>" --query "[].endDateTime" -o tsv
This returns the expiration date for each credential on that app registration. If the date is in the past, that's your problem. If you're a portal person rather than CLI, open Azure Active Directory > App registrations > <your-app> > Certificates & secrets. The Expires column on the Client secrets tab tells you the same thing.
To create a new secret:
az ad app credential reset --id "<SP_APP_ID>" --append
The --append flag creates a new secret without removing the existing one, useful if other systems also use this principal and you need time to rotate. Copy the password value from the output immediately. Azure will never show you that value again after this command completes.
Once you have the new secret, update your Kubernetes image pull secret (follow the delete-and-recreate steps from Step 3), update any CI/CD pipeline variables that reference the old secret, and optionally delete the expired credential to keep the app registration clean:
az ad app credential delete --id "<SP_APP_ID>" --key-id "<old-key-id>"
When the new credentials are in place and the image pull secret is recreated, you should see your pods transition from ImagePullBackOff to ContainerCreating to Running within a minute or two.
If you've confirmed your credentials are valid and fresh but you're still seeing pull failures, the culprit is often a network-level block. Azure Container Registry supports IP-based access restrictions and private endpoints. When either is configured, any request coming from an IP address not on the allowlist gets a hard denial, often surfaced as a timeout or as the "Client with IP address is not allowed access" error.
In the Azure portal, navigate to your Container Registry resource and go to Settings > Networking. Under the Public access tab, check whether Selected networks is chosen instead of All networks. If it is, your AKS node's outbound IP address needs to be in the Firewall section.
To find your AKS cluster's outbound IPs:
az aks show -n <cluster-name> -g <resource-group> \
--query "networkProfile.loadBalancerProfile.effectiveOutboundIPs[].id" -o tsv
Then for each resource ID returned, get the actual IP:
az network public-ip show --ids <resource-id> --query "ipAddress" -o tsv
Add those IPs to the ACR firewall allowlist from the Networking blade, or via CLI:
az acr network-rule add --name <acr-name> --ip-address <aks-outbound-ip>
If you're using private endpoints instead, the issue might be DNS resolution, the registry's FQDN needs to resolve to the private IP, not the public one, from within your VNet. Run a DNS lookup from inside your cluster to verify:
kubectl run -it --rm dns-test --image=busybox --restart=Never -- \
nslookup <acr-name>.azurecr.io
The returned IP should be in your private IP range (typically 10.x.x.x or 172.x.x.x), not a public Azure IP. If it's returning a public IP, your private DNS zone isn't linked to the VNet correctly. Check Private DNS zones > <acr-name>.privatelink.azurecr.io > Virtual network links in the portal and add the missing link.
After adding the IP rule or fixing DNS, give it 60–90 seconds for the change to propagate, then trigger a fresh pod deployment. The pull should succeed without any credential changes.
Advanced Troubleshooting
When the standard fixes don't stick, or you're in an enterprise environment with stricter controls, you need to go deeper. Here's what I reach for when the obvious stuff hasn't worked.
Diagnosing CrashLoopBackOff Separately from ImagePullBackOff
These two errors are often conflated but they mean very different things. ImagePullBackOff means the container image never even made it onto the node, the pull itself failed. CrashLoopBackOff means the image pulled successfully, the container started, and then it crashed. Kubernetes is restarting it repeatedly. For CrashLoopBackOff, your Azure Container Registry configuration is usually fine, the problem is inside the container itself.
Check the container's own logs even while it's crashing:
kubectl logs <pod-name> -n <namespace> --previous
The --previous flag pulls logs from the last terminated instance of the container, which gives you the actual crash output rather than an empty log from the new instance that hasn't had time to write anything. Missing environment variables, a bad entrypoint path, or an application-level startup failure will all show up here. This is almost never an ACR issue once the image has pulled.
Managed Identity vs. Service Principal Authentication
In enterprise AKS deployments, I strongly recommend moving off service principal credentials entirely and onto managed identities. The authentication happens automatically via Azure's internal identity system, no secrets to rotate, no expiry to track. If you're still on service principals for AKS-to-ACR auth, this is the migration worth doing:
az aks update -n <cluster-name> -g <resource-group> --enable-managed-identity
az aks update -n <cluster-name> -g <resource-group> --attach-acr <acr-name>
Validating Permissions with the AcrPull Role
A service principal or managed identity that has Contributor on the subscription but not an explicit ACR role assignment will still fail. ACR uses its own RBAC model. Confirm the assignment exists:
az role assignment list --scope \
$(az acr show -n <acr-name> --query id -o tsv) \
--query "[].{Principal:principalName,Role:roleDefinitionName}" -o table
Look for your service principal or managed identity in the Principal column with AcrPull in the Role column. If it's missing, assign it:
az role assignment create --assignee <principal-id> \
--role AcrPull \
--scope $(az acr show -n <acr-name> --query id -o tsv)
Checking the "Unable to Get Admin User Credentials" Error
The Unable to get admin user credentials error is particularly tricky because it combines a token refresh failure with a subscription-level resource lookup. The full error mentions a 401 from CONNECTIVITY_REFRESH_TOKEN_ERROR followed by a "resource not found in subscription" message. This almost always means the identity you're using doesn't have read access to the registry resource in Azure Resource Manager, separate from actually pulling images.
Run az login again to refresh your Azure CLI session, then verify you're in the correct subscription:
az account show --query "{Subscription:name, ID:id}" -o table
az account set --subscription "<correct-subscription-id>"
If switching subscriptions resolves it, someone may have moved the registry to a different subscription without updating the credentials in all consuming systems.
Escalate to Microsoft Support if: you're seeing 401 errors but all credentials and role assignments are provably correct and fresh; your registry is showing healthy in az acr check-health but pulls still fail intermittently; or you suspect an Azure platform-level incident affecting your registry's availability. For Helm or Notary errors from the health check, those don't indicate an ACR problem, they just mean those tools aren't installed or the CLI version is incompatible with the installed version of either tool. Don't let those distract you during a live incident.
Prevention & Best Practices
Most of the pain I've described in this guide is entirely preventable. The fixes above are reactive, these practices keep you from needing them in the first place.
Set calendar alerts for service principal expiration. Azure won't email you when a secret is about to expire. After you create any secret, immediately open your calendar and set a reminder 30 days before the endDateTime you found with az ad app credential list. This alone prevents the majority of surprise 401 outages I've seen in enterprise environments. Better yet, build a monthly Azure Automation runbook that checks all service principal credentials across your tenant and posts to a Slack channel.
Move to managed identities wherever possible. For AKS clusters, the --attach-acr flag on az aks update eliminates the entire category of "expired service principal secret" failures. The kubelet managed identity authenticates to ACR automatically, the permission is tied to the cluster's lifecycle, and there's nothing to rotate. This is the architecture Microsoft recommends and it's what I'd build from scratch today.
Run az acr check-health as part of your deployment pipeline. Add it as a pre-deployment gate in your CI/CD system. It runs fast (under 30 seconds), catches the most common ACR problems before they affect production, and produces machine-parseable output you can alert on.
Keep admin user credentials disabled unless you specifically need them. The admin user is a shared credential, if it leaks or gets regenerated at the wrong time, everything using it breaks at once. Service principals and managed identities scope down to least privilege and are independently revokable. In the Azure portal under Settings > Access keys, the admin user toggle should be off for most production registries.
- Enable Diagnostic settings on your ACR resource and stream logs to Log Analytics, authentication failures show up within seconds and are searchable by IP, identity, and repository
- Use
--scope-maptokens for CI/CD pipelines instead of admin credentials, scope them to only the repositories and operations each pipeline actually needs - Pin image tags in your Kubernetes manifests (
myimage:1.4.2notmyimage:latest) so a failed latest push doesn't silently roll out a broken image - Add ACR secret rotation to your infrastructure-as-code pipeline using Azure Key Vault references, so Kubernetes pull secrets always stay in sync with current credentials automatically
Frequently Asked Questions
Why does my pod show ImagePullBackOff even though I can docker pull the image manually just fine?
The most common reason is that your manual docker pull used your personal Azure CLI session credentials, while the Kubernetes pod is trying to pull using the cluster's service account or image pull secret, which has different (and probably expired or missing) credentials. Run kubectl describe pod <pod-name> and look at the Image pull secrets field and the error in the Events section. The Kubernetes pull path is entirely separate from your local Docker session, so testing with docker pull locally doesn't validate what the cluster can do.
Can I use az acr login inside Azure Cloud Shell to authenticate against my registry?
Not with the standard az acr login command, Cloud Shell provides the Docker CLI but doesn't run the Docker daemon, and az acr login requires the daemon to complete the authentication handshake. You'll get an error saying "this command requires running the docker daemon, which is not supported in Azure Cloud Shell." The fix is to use az acr login -n <acr-name> --expose-token instead, which returns a bearer token you can use directly without the Docker daemon. This is officially supported and works reliably in Cloud Shell.
How do I find out which service principal is attached to my AKS cluster for ACR pulls?
Run az aks show -n <cluster-name> -g <resource-group> --query "servicePrincipalProfile.clientId" -o tsv. If the output is msi, your cluster is already using a managed identity (the better option). If it returns a GUID, that's the application ID of the service principal. You can then check its credentials with az ad app credential list --id <app-id> to see expiration dates. For clusters using managed identity, check the kubelet identity with az aks show -n <cluster-name> -g <resource-group> --query "identityProfile.kubeletidentity.objectId" -o tsv.
My ACR health check shows Helm and Notary errors. Does that mean my container registry is broken?
No, Helm and Notary errors in the az acr check-health output don't indicate anything wrong with your container registry or your network connectivity. They just mean that Helm or Notary either isn't installed on the machine where you're running the command, or the Azure CLI version isn't compatible with the version of those tools that is installed. You can safely ignore these errors if you don't use Helm charts stored in ACR or Notary content trust signing. Focus on any DNS, connectivity, or authentication errors in the output instead.
How do I know if an ACR private endpoint is causing my ImagePullBackOff rather than bad credentials?
The easiest tell is the error type: a private endpoint DNS problem usually manifests as a timeout or context deadline exceeded error rather than a 401 unauthorized. From inside your cluster, run kubectl run -it --rm dns-test --image=busybox --restart=Never -- nslookup <acr-name>.azurecr.io. If the returned IP is a public Azure IP (usually in the 40.x.x.x range) but your registry has public access disabled, DNS isn't resolving through your private DNS zone correctly. Check that your private DNS zone <acr-name>.privatelink.azurecr.io is linked to the VNet your AKS nodes live in.
Is it safe to use the ACR admin user for production Kubernetes deployments?
It works, but I'd recommend against it for production. The admin user is a single shared credential with full push/pull access to every repository in the registry, there's no way to scope it down. If it leaks or if someone regenerates it without notifying all consumers, every system using it breaks at once. For production AKS, use the --attach-acr managed identity path or a dedicated service principal with only the AcrPull role scoped to just the repositories that specific cluster needs. Reserve admin credentials for emergency break-glass access if you use them at all.