Azure Container Registry Troubleshooting: Fix Every Error
Why This Is Happening
Picture this: your CI/CD pipeline has been humming along for months, and then one morning a deployment fails with unauthorized: authentication required or your Docker client throws Error response from daemon: Head "https://yourregistry.azurecr.io/v2/": unauthorized. Your Kubernetes pods are stuck in ImagePullBackOff, Slack is lighting up, and the error message from Azure gives you absolutely nothing useful to go on. I've seen this exact scenario play out across dozens of enterprise environments , and the root cause is almost never what you'd guess from the error text alone.
Azure Container Registry (ACR) sits at the intersection of three complex systems: Azure's identity and access control (Entra ID, service principals, managed identities), your network topology (virtual networks, private endpoints, firewall rules), and Docker's own authentication protocol. When any one of those layers misfires, ACR surfaces a vague error that could mean a dozen different things. That's what makes Azure Container Registry troubleshooting genuinely hard , the same 401 Unauthorized can come from an expired token, a broken service principal, a misconfigured firewall, or even an incorrect registry SKU.
The most common causes I see in real environments break down like this:
- Authentication failures: Expired or revoked service principal credentials, a managed identity that hasn't been assigned the right role, or the admin account being disabled on the registry.
- RBAC misconfiguration: A pipeline identity with
Readerinstead ofAcrPull, or a developer who needs push access but only hasAcrPullassigned. - Network lockdowns: A private endpoint was added to the registry but the CI runner or AKS node pool can't reach it. Or a firewall rule allows your office IP but not your Azure DevOps agent IP range.
- Registry SKU limitations: Trying to use geo-replication or content trust on a Basic-tier registry, or hitting the 10 GB storage cap on Basic without realising it.
- Token and credential cache issues: Docker storing a stale credential in the local credential helper, which silently overrides your fresh
az acr logintoken. - Repository-scoped token expiry: Repository-scoped access tokens have a configurable lifetime (default 1 hour); automated pipelines that run longer than that will start failing mid-run.
The frustrating part is that Azure's error messages map multiple distinct failure modes to the same surface-level strings. 403 Forbidden could mean your service principal lacks AcrPush, or it could mean a network ACL is blocking the request entirely. You need to know how to read past the Docker client error and go straight to the ACR diagnostic logs to find the real signal. This guide walks you through exactly that.
Browse all Microsoft fix guides: Browse all Microsoft fix guides →
The Quick Fix, Try This First
Before you dig into RBAC assignments or network policies, run this triage sequence. It catches the three most common ACR problems in under two minutes and resolves the majority of day-to-day failures.
Step 1, Re-authenticate using the Azure CLI token, not Docker credentials:
az login
az acr login --name <yourregistryname>
This exchanges your current Azure CLI session for a short-lived Docker-compatible token and writes it to your local Docker credential store. It bypasses any stale cached credentials. If this alone fixes your docker pull or docker push, the root cause was a cached expired token in your Docker credential helper, see the Prevention section to stop it recurring.
Step 2, Verify your identity has the right role:
az role assignment list \
--assignee <your-object-id-or-service-principal-id> \
--scope $(az acr show --name <yourregistryname> --query id -o tsv) \
--output table
You're looking for AcrPull (for pulling images), AcrPush (for pushing), or AcrDelete (for deleting). The built-in Reader role does not grant registry data-plane access, that's a surprise for a lot of people. If the role assignment table comes back empty or shows only Reader, that's your problem.
Step 3, Check if the admin account is enabled (for non-RBAC setups):
az acr show --name <yourregistryname> --query adminUserEnabled
If it returns false and you're relying on username/password authentication rather than managed identity or service principal, nothing will work. Enable it with az acr update --name <yourregistryname> --admin-enabled true and retrieve credentials with az acr credential show --name <yourregistryname>.
If none of those three steps resolve the issue, move on to the step-by-step section below. You're dealing with a network, policy, or deeper identity issue.
az acr login from inside a CI/CD agent (Azure DevOps, GitHub Actions), always pass --expose-token and use the resulting ACRTOKEN environment variable for Docker commands rather than relying on credential helper injection. Credential helpers behave unpredictably in containerized agents and are the silent culprit behind a huge percentage of intermittent pipeline failures.
The Docker client error message is a starting point, not an answer. Go to the actual ACR diagnostic logs first, this tells you whether the failure is authentication, authorization, or network-level.
In the Azure Portal, navigate to your Container Registry resource. In the left-hand menu under Monitoring, click Diagnostic settings. If you don't have a diagnostic setting configured yet, click Add diagnostic setting, check ContainerRegistryLoginEvents and ContainerRegistryRepositoryEvents, and send them to a Log Analytics workspace. Give it two minutes for new log events to flow through.
Once logs are flowing, go to Logs in the left menu and run this query:
ContainerRegistryLoginEvents
| where TimeGenerated > ago(1h)
| where ResultDescription != "200"
| project TimeGenerated, Identity, CallerIPAddress, ResultDescription, OperationName
| order by TimeGenerated desc
The ResultDescription field is where the real information lives. 401 means the token was invalid or absent entirely. 403 means the token was valid but the identity lacks permission or a firewall is blocking it. 429 means you've hit rate limits (common on Basic SKU registries under heavy pull load). The CallerIPAddress field helps you confirm whether the request is even reaching the registry or being blocked before it arrives.
If you see your expected client IP in the logs with a 401, the problem is authentication. If the IP doesn't appear at all, the request is being dropped at the network layer before it ever reaches ACR.
If your logs show 401 Unauthorized from a known IP, work through authentication sources in order. For service principals, the most common failure is that the client secret has expired, Azure doesn't notify you when this happens, it just starts returning 401s.
Check service principal credential expiry:
az ad sp credential list --id <service-principal-app-id> --output table
Look at the endDateTime column. If it's in the past, create a new secret immediately:
az ad sp credential reset --id <service-principal-app-id> \
--years 1 \
--query password -o tsv
Store that new secret in your pipeline's secret store and update the Docker login step.
For managed identity scenarios, which is the pattern you should be using for AKS, App Service, and Azure Container Instances, the failure is almost always a missing role assignment rather than an expired credential. Managed identities don't have passwords that expire, but they do need explicit RBAC assignments at the registry scope. If you recently reassigned a node pool's managed identity or changed the registry, the role assignment won't follow automatically.
# Get the principal ID of the managed identity
PRINCIPAL_ID=$(az identity show \
--name <identity-name> \
--resource-group <rg-name> \
--query principalId -o tsv)
# Assign AcrPull at registry scope
az role assignment create \
--assignee $PRINCIPAL_ID \
--role AcrPull \
--scope $(az acr show --name <registry-name> --query id -o tsv)
Role assignments take 1–5 minutes to propagate across Azure's authorization system. If you assigned the role and still get 401s immediately, wait five minutes and retry before assuming the assignment failed.
This is the most under-diagnosed category. Your credentials are perfect, your RBAC is correct, but requests still fail, because the firewall is silently dropping them. This is the scenario where the caller IP doesn't appear in ACR diagnostic logs at all.
First, identify whether your registry has network restrictions enabled:
az acr show --name <registry-name> \
--query networkRuleSet \
--output json
If defaultAction is Deny, only explicitly allowed IP ranges or virtual network subnets can reach the registry. Check whether your CI/CD agent, AKS node, or developer machine IP falls within the allowed set. For Azure DevOps hosted agents, Microsoft publishes IP ranges weekly at the Azure IP Ranges download page, these change every week, which is why many teams hit this problem suddenly after months of working fine.
To temporarily test whether the firewall is the culprit, allow all networks (do this in a non-production registry only):
az acr update --name <registry-name> --default-action Allow
If that resolves the 401/403 errors, the firewall was blocking you. For a proper fix, add the specific IP range rather than opening to the world:
az acr network-rule add \
--name <registry-name> \
--ip-address <your-cidr-range>
If you're using private endpoints, confirm DNS resolution is working correctly. From inside the VNet, run nslookup yourregistry.azurecr.io, it must resolve to the private IP (typically in the 10.x.x.x range), not the public IP. If it resolves to the public IP, your private DNS zone (privatelink.azurecr.io) is either missing or not linked to the VNet.
When Kubernetes pods are stuck in ImagePullBackOff with an ACR image, the problem is almost always one of three things: the AKS cluster doesn't have ACR pull permission, the image tag doesn't exist, or a private endpoint configuration is incomplete.
The fastest way to check AKS–ACR attachment:
az aks check-acr \
--name <aks-cluster-name> \
--resource-group <rg-name> \
--acr <registry-name>.azurecr.io
This command runs end-to-end connectivity and auth checks from within the cluster's node pool. It will tell you directly whether the node pool's managed identity has AcrPull assigned and whether DNS/network path is working. If the command flags a missing role, attach ACR to AKS in one step:
az aks update \
--name <aks-cluster-name> \
--resource-group <rg-name> \
--attach-acr <registry-name>
For a more detailed failure message, describe the failing pod and look at the events section:
kubectl describe pod <pod-name> -n <namespace>
In the Events output, look for the exact HTTP error. Failed to pull image "yourregistry.azurecr.io/image:tag": rpc error: code = Unknown desc = failed to pull and unpack image: ... 401 Unauthorized points to RBAC. If you see dial tcp: lookup yourregistry.azurecr.io: no such host, it's a DNS resolution failure from within the cluster, usually a private endpoint DNS linkage issue. And if you see manifest unknown, the image tag simply doesn't exist in the registry; check az acr repository show-tags --name <registry-name> --repository <image-name> to confirm what tags are actually present.
Push failures have a different failure profile than pull failures. If docker push starts uploading layers but then fails partway through, storage quota is the most likely culprit. The Basic ACR SKU has a 10 GB storage limit. Standard is 100 GB. Premium is 500 GB. You can check current usage:
az acr show-usage --name <registry-name> --output table
If StorageBytes usage is near or at the limit, you have two options: clean up untagged images and old manifests, or upgrade the SKU. For cleanup:
# Remove all untagged manifests across all repositories
az acr run \
--registry <registry-name> \
--cmd "acr purge --filter '.*:.*' --untagged --keep 10 --ago 30d" \
/dev/null
The --keep 10 flag retains the 10 most recent tagged images per repository. The --ago 30d flag removes manifests older than 30 days. Adjust these values to match your retention policy. To upgrade SKU instead:
az acr update --name <registry-name> --sku Standard
For repository-scoped push failures specifically, check whether you're using a repository-scoped token rather than a full registry identity. Repository-scoped tokens in ACR can have content/read and content/write actions configured separately. If the token was provisioned with only content/read, push operations will return 403 DENIED even though pulls work fine. Review token permissions in the portal under Repository permissions > Tokens or via:
az acr token show \
--name <token-name> \
--registry <registry-name> \
--output json
Check the scopeMapProperties section, it lists exactly which actions are permitted per repository. Add content/write and content/delete if push is required.
Advanced Troubleshooting
If the step-by-step fixes above haven't resolved your Azure Container Registry troubleshooting issue, you're likely dealing with a policy-level block, a geo-replication inconsistency, or an enterprise network architecture problem. Here's where to dig deeper.
Azure Policy Enforcement
In enterprise environments, Azure Policy assignments can silently block ACR operations. A common one is a policy requiring private endpoints on all PaaS services, if your registry doesn't have a private endpoint configured, policy may prevent you from pulling or pushing images even if authentication is correct. Check for policy compliance issues on your registry:
az policy state list \
--resource $(az acr show --name <registry-name> --query id -o tsv) \
--filter "complianceState eq 'NonCompliant'" \
--output table
Any non-compliant policies listed here can be causing denial-of-service effects. Talk to your Azure governance team before modifying policies in shared environments.
Geo-Replication Consistency Issues
If you're using a Premium SKU registry with geo-replication enabled and seeing intermittent pull failures, you may be hitting a replication lag issue. A client in East US pulling from a registry that was just pushed from West Europe might be hitting the East US replica before the manifest has fully replicated. Check replication health:
az acr replication list \
--registry <registry-name> \
--output table
Look for replicas with a provisioningState of Failed or Updating. A replication in a bad state will cause all pull requests routed to that region to fail. You can delete and re-create a broken replica:
az acr replication delete --registry <registry-name> --name <replication-name>
az acr replication create --registry <registry-name> --location <azure-region>
Event Viewer and Windows-Side Debugging
If you're running Docker Desktop on Windows and seeing ACR errors in local development, check the Windows Application Event Log. Open Event Viewer > Windows Logs > Application and filter by Source docker. Event ID 18456 patterns in Docker's log output often indicate credential helper failures. The Windows Credential Manager (Control Panel > Credential Manager > Windows Credentials) may be caching an old ACR token under a https://yourregistry.azurecr.io entry. Remove it and re-run az acr login.
Entra ID Conditional Access Blocking Service Principals
If your tenant has Conditional Access policies requiring MFA or compliant devices, and you're using a service principal for ACR auth, those policies can start blocking token issuance if they're misconfigured to apply to service principals. This shows up as AADSTS53003: Access has been blocked by Conditional Access policies in the underlying OAuth error, even though the Docker client just shows 401. The fix is to create a named location or service principal exclusion in the Conditional Access policy, work with your identity team on this one.
Prevention & Best Practices
Most ACR incidents I've responded to were preventable. The pattern is almost always the same: a system that worked fine for months hits a change that wasn't tracked, a secret that expired, a network rule that was tightened, an SKU limit that crept up slowly. Here's how to stay ahead of it.
Use managed identities instead of service principal secrets wherever possible. Managed identities don't have passwords that expire and don't require credential rotation. For AKS, App Service, Container Instances, and Azure Functions, there's no good reason to use a service principal for ACR auth. The az aks update --attach-acr command sets this up in one step. You eliminate an entire class of "it was working yesterday" failures.
Set up Azure Monitor alerts on ACR authentication failures. Create an alert rule on the ContainerRegistryLoginEvents table that fires when failed logins exceed 10 in a 5-minute window. You want to know about this before a broken deployment takes down a service. In the portal, go to Monitor > Alerts > Create > Alert rule, select your Log Analytics workspace, and use a custom log search query.
Automate image lifecycle management. The single best way to avoid storage quota failures is to run acr purge on a schedule. Configure it as an ACR Task that runs nightly:
az acr task create \
--registry <registry-name> \
--name purge-untagged \
--cmd "acr purge --filter '.*:.*' --untagged --keep 5 --ago 14d" \
--schedule "0 1 * * *" \
--context /dev/null
Document and version your network rules. Store your ACR network rule set in infrastructure-as-code (Bicep, Terraform, or ARM) and treat any manual change to it as a deployment that goes through your normal change process. Ad-hoc firewall rule additions that aren't tracked are a leading cause of "we didn't change anything" outages.
- Run
az acr check-health --name <registry-name>monthly, it checks connectivity, auth config, and DNS resolution in one command and flags known misconfiguration patterns - Set calendar reminders 30 days before service principal secret expiry,
az ad sp credential listshows the expiry date; pipe it to a Logic App for automated alerting - Upgrade from Basic to Standard SKU for any registry used in production CI/CD, the concurrency and storage difference is significant and Basic SKU rate limits hit hard under parallel build load
- Enable the
quarantinepolicy on Premium SKU registries, it prevents images from being pulled until a vulnerability scan completes, which gives you a security gate without custom tooling
Frequently Asked Questions
Why does az acr login work fine but docker push still fails with "denied"?
This is almost always a Docker credential helper conflict. az acr login writes a token to the Docker credential store, but if your ~/.docker/config.json has a credStore or credHelpers entry pointing to a system credential manager (like wincred on Windows or osxkeychain on Mac), Docker may be reading an older cached credential from that store instead of the fresh token. Run docker logout yourregistry.azurecr.io, then clear the stored credential from your OS credential manager, and then re-run az acr login --name <registry>. The push should succeed after that.
My AKS pods get ImagePullBackOff even though az aks check-acr passes, what else could it be?
The most common culprit after RBAC checks out is that the image tag you're referencing in your Kubernetes manifest doesn't exist in the registry. Run az acr repository show-tags --name <registry> --repository <image-name> --orderby time_desc --output table to confirm the exact tags available. The second possibility is a Windows node pool trying to pull a Linux container image, Windows nodes can only pull Windows-based images and will fail silently with a confusing error. Check your node selector and toleration configuration in the pod spec.
How do I fix "ERROR: The registry login server 'x.azurecr.io' is not accessible" in Azure DevOps pipelines?
This error from the Azure DevOps Docker task means the pipeline agent can't reach the registry endpoint at the network level, it's not an authentication error at all. If you're using Microsoft-hosted agents, confirm your registry isn't locked down to specific IP ranges, since hosted agent IPs change weekly. Download the current Azure IP JSON file, find your agent pool's service tag (e.g., AzureDevOps), and add those ranges to your ACR network rules. If you're using self-hosted agents, check whether the agent machine can reach yourregistry.azurecr.io on port 443, proxy configurations and NSG rules are the usual blockers.
Can I use ACR without enabling the admin user account?
Yes, and you should, the admin account is a single shared credential that anyone with access to the registry settings page can see. For any automated workload, use a service principal with AcrPull/AcrPush assigned at registry scope, or better yet, use managed identity if the consuming service supports it. For individual developer access, use az acr login which uses your personal Entra ID identity. The admin account is really only appropriate for quick ad-hoc testing or legacy systems that can't use token-based authentication.
Why am I getting "toomanyrequests" errors from my ACR on Basic tier?
Basic SKU ACR enforces throttling limits on concurrent operations, specifically around 10 concurrent pull requests at the registry level. If you have a large AKS cluster scaling up rapidly and all nodes try to pull the same large image simultaneously, you'll hit this. The fix is to upgrade to Standard or Premium SKU, which have significantly higher concurrency limits. As a short-term workaround, enable image caching at the Kubernetes level using a DaemonSet-based pre-pull mechanism, or configure your HPA to scale more gradually to spread out the pull requests over time.
How do I transfer images from one Azure Container Registry to another without re-pulling locally?
Use the az acr import command, it transfers directly between registries server-side without touching your local machine or network bandwidth. The source registry identity needs AcrPull permissions, and the destination registry identity needs AcrPush. Run: az acr import --name <dest-registry> --source <source-registry>.azurecr.io/image:tag --image image:tag. If the source registry has network restrictions, you may need to temporarily allow access from Azure's service IPs or run the command from within the same VNet using Azure Cloud Shell attached to the VNet.