Azure Container Instances CrashLoopBackOff, ImagePullBackOff & Pod Fix Guide

Microsoft Fix Intermediate 14 min read Official Docs Grounded Updated April 20, 2026

What's in This Guide

Why This Happens
The Quick Fix
Step-by-Step Solution
Advanced Troubleshooting
Prevention & Best Practices
FAQ

Why This Is Happening

I've seen this exact scenario play out dozens of times: you spend an hour configuring your Azure Container Instances deployment, you hit "Create" or fire off your az container create command, and within seconds you're staring at an error that Microsoft's portal makes almost no effort to explain. Whether it's a CrashLoopBackOff, an ImagePullBackOff, a ContainerGroupQuotaReached, or a silent GPU provisioning failure , the frustration is real. Especially when your work is blocked and a deadline is looming.

Azure Container Instances is genuinely one of the fastest ways to run containerized workloads in the cloud. No Kubernetes cluster to manage, no node pools to scale , just bring a container image and go. But that simplicity hides a surprising number of sharp edges. The error messages that surface when things break are often cryptic, terse, and designed more for internal Azure logging than for the human being trying to debug their deployment at 11pm.

Let's talk about what's actually going on under the hood. There are four major buckets that cause most ACI deployment failures:

1. Image pull failures. Your container group can't reach the registry, doesn't have the right credentials, or is pointing at an image tag that doesn't exist. This manifests as an ImagePullBackOff state, ACI tried to pull your image, failed, and is now backing off before retrying.

2. Container runtime crashes. The image pulled fine, but the container process exited immediately or keeps crashing on startup. This is the ACI equivalent of Kubernetes' CrashLoopBackOff. It usually means your entrypoint command is wrong, an environment variable is missing, or the application itself is failing to initialize.

3. Quota and subscription limits. Azure enforces hard limits on how many container groups you can run per region and per subscription. When you hit those limits, you get a ContainerGroupQuotaReached error that looks alarming but is actually fixable. Spot containers have their own separate quota, StandardSpotCores, which defaults to a surprisingly low number depending on your subscription type.

4. GPU provisioning failures. If you're running GPU-accelerated workloads, there's an additional layer of complexity: the right GPU SKU has to be available in your chosen region, your container image needs the right NVIDIA drivers, and the CUDA or TensorRT libraries must be present. Miss any one of those and your container won't start.

None of these root causes are obvious from the error messages alone. That's what this guide is for, working through each failure mode with specific, tested solutions that actually get Azure Container Instances running again. Browse all Microsoft fix guides →

The Quick Fix, Try This First

If you're getting a general deployment failure or your container group is stuck in a bad state, start here before diving into the deeper steps. This single sequence resolves the majority of Azure Container Instances failures I see in practice.

Step 1: Pull the container logs immediately. Before you change anything, grab the logs from your failed container. Run this in Azure CLI:

az container logs --resource-group MyResourceGroup --name myapp

If the container started and then crashed, the application logs will tell you exactly what went wrong. Nine times out of ten, the answer is right there, a missing environment variable, a bad database connection string, or an entrypoint script that exits with code 1 because of a typo.

Step 2: Check the container events. Logs only appear if the container actually started. If you're getting an ImagePullBackOff or the container never started at all, check events instead:

az container show --resource-group MyResourceGroup --name myapp --query "containers[0].instanceView.events" -o table

This surfaces the low-level provisioning events, including authentication failures against your container registry, image not found errors, and GPU allocation failures.

Step 3: Delete and recreate the container group. ACI container groups are largely immutable once created. If you hit a bad state, the cleanest path forward is often to delete the group and redeploy:

az container delete --resource-group MyResourceGroup --name myapp --yes
az container create --resource-group MyResourceGroup --name myapp --image myregistry.azurecr.io/myapp:latest --cpu 2 --memory 4

This sounds heavy-handed, but ACI is stateless by design. Container groups don't persist data between restarts unless you've mounted an Azure File Share, so deleting and recreating is safe and fast, typically under 60 seconds.

Pro Tip

Always specify an explicit image tag, never use :latest in production ACI deployments. The :latest tag is the single biggest source of ImagePullBackOff errors I've seen, because the tag resolves differently across environments and caching behavior is unpredictable. Pin to a specific SHA or version tag like myimage:1.4.2 and your deployments become dramatically more predictable.

Diagnose the Exact Failure, Read the Container State

The first thing to nail down is whether you're dealing with an image pull failure, a runtime crash, a quota error, or a GPU provisioning issue. Each one has a different fix path. Mixing them up wastes time.

Go to the Azure portal, navigate to Container Instances → [your container group] → Overview. Under "Containers," look at the Status column. You'll see one of several states:

Waiting, the container hasn't started yet, likely stuck on image pull
Running, container started but may still be crashing in a loop
Terminated, container exited, check the exit code

Click on the container name, then select the Logs tab and the Events tab. The Events tab is especially important for ACI ImagePullBackOff scenarios, it will show you messages like Failed to pull image: unauthorized: authentication required or manifest unknown: manifest tagged by 'latest' is not found.

From the CLI, you can get a full JSON dump of the container state:

az container show \
  --resource-group MyResourceGroup \
  --name myapp \
  --output json

Look at the instanceView.currentState and instanceView.previousState fields. The detailStatus field within those objects often contains the human-readable reason for a failure that the portal summary doesn't show.

If you see a ContainerGroupQuotaReached error, the output will look something like: "Code: ContainerGroupQuotaReached, Message: Resource type 'Microsoft.ContainerInstance/containerGroups' container group quota 'StandardSpotCores' exceeded in region...", at that point, jump straight to Step 4 below. For image and runtime issues, continue with Steps 2 and 3.

What you should see when this step works: You've identified whether the failure is at image pull time, container startup, or quota enforcement. That clarity cuts your troubleshooting time in half.

Fix Azure Container Instances Image Pull Failures

If the events show an authentication error or a missing manifest, the fix depends on where your image lives.

For Azure Container Registry (ACR): You need to attach the registry credentials to your container group. The cleanest way is to use a managed identity or an ACR admin account. First, enable admin on your registry if it isn't already:

az acr update --name myregistry --admin-enabled true

Then pull the credentials:

az acr credential show --name myregistry

Now recreate your container group with those credentials explicitly passed:

az container create \
  --resource-group MyResourceGroup \
  --name myapp \
  --image myregistry.azurecr.io/myimage:1.0.0 \
  --registry-login-server myregistry.azurecr.io \
  --registry-username <username> \
  --registry-password <password> \
  --cpu 2 --memory 4

For Docker Hub or other public registries: If you're getting rate-limit errors (Docker Hub introduced aggressive pull limits for unauthenticated requests), pass your Docker Hub credentials the same way, using --registry-login-server index.docker.io.

For private registries behind a firewall: Make sure your ACI container group has network access to the registry endpoint. If you're using a VNet-injected container group, verify that the subnet's NSG rules allow outbound HTTPS on port 443 to your registry's IP range.

For image tag errors: Run az acr repository show-tags --name myregistry --repository myimage --output table to list all available tags and confirm the one you're using actually exists. A misspelled tag will give you a manifest not found error that looks identical to an auth failure.

What success looks like: The Events tab shows Pulling image... followed by Successfully pulled image and the container moves to Running state.

Fix CrashLoopBackOff, Stop the Container From Dying at Startup

The image pulled successfully, but your container keeps crashing. This is the ACI equivalent of a CrashLoopBackOff, and it almost always comes down to one of three things: a bad entrypoint command, missing environment variables, or an application that depends on a service it can't reach.

Check the exit code first. From the CLI:

az container show \
  --resource-group MyResourceGroup \
  --name myapp \
  --query "containers[0].instanceView.currentState" \
  --output json

An exit code of 1 is a general application error, check your app logs. Exit code 127 means the command wasn't found (wrong entrypoint). Exit code 137 means the container was killed, often due to an OOM (out of memory) condition, increase your memory allocation.

Temporarily override the entrypoint for debugging. To get an interactive shell inside your container image and poke around, create a debug instance with the command overridden to just sleep:

az container create \
  --resource-group MyResourceGroup \
  --name myapp-debug \
  --image myregistry.azurecr.io/myimage:1.0.0 \
  --command-line "tail -f /dev/null" \
  --cpu 1 --memory 2

Then exec into it:

az container exec \
  --resource-group MyResourceGroup \
  --name myapp-debug \
  --exec-command "/bin/sh"

From inside, you can manually run your entrypoint script, check environment variables with env, test network connectivity, and confirm that required files are in place. This approach has saved me hours of guessing.

What success looks like: You've identified why the container exits immediately and fixed it, either by correcting the command-line argument, adding missing environment variables via --environment-variables, or increasing the CPU/memory allocation.

Resolve ContainerGroupQuotaReached and StandardSpotCores Limit Errors

If you're deploying Spot containers and hitting a ContainerGroupQuotaReached error, the issue is that your Azure subscription has a default cap on Spot core usage that you've exceeded. This is separate from your regular container group quota.

The error message itself is actually informative, it tells you your current limit and what you requested:

Code: ContainerGroupQuotaReached
Message: Resource type 'Microsoft.ContainerInstance/containerGroups'
container group quota 'StandardSpotCores' exceeded in region 'eastus'.
Limit: '100', Usage: '12', Requested: '90'.

Here's the quick breakdown of default limits by subscription type: Enterprise Agreement subscriptions get 100 StandardSpotCores, Default (Pay-As-You-Go) subscriptions get 10, and all other subscription types get 0. If you're on an Other plan trying to use Spot containers, you'll need to change your subscription type first.

To check which subscription type you're on, go to Azure Portal → Cost Management + Billing → Billing accounts. Your account type is shown in the Properties panel.

There are three paths forward depending on your situation:

Option A, Switch to a Default subscription: This gets you 10 StandardSpotCores. Enough for testing, probably not for production workloads.

Option B, Switch to Enterprise Agreement: This bumps you to 100 cores. The right move if you're running production-scale Spot workloads.

Option C, File a quota increase request: For anything beyond the EA default, you'll need to open a support request with Microsoft. Go to Azure Portal → Help + Support → New Support Request, set Issue Type to "Service and subscription limits (quotas)", and select Container Instances. Be specific about the region and the core count you need.

One important thing to keep in mind: ACI Spot containers are still in public preview as of early 2026. Microsoft explicitly doesn't recommend them for production scenarios. If you need guaranteed capacity, use standard (non-Spot) container groups instead.

What success looks like: After quota adjustment, your az container create --priority spot command completes without error and the container group shows in Running state.

Fix Azure Container Instances GPU Failures, Drivers, SKUs, and Region Support

GPU-enabled Azure Container Instances have their own unique failure modes, and they're some of the most confusing because the errors often look like generic provisioning failures. There are three distinct things that have to be right simultaneously: the GPU SKU has to exist in your region, your container image has to have the right drivers, and the NVIDIA toolkit has to be installed correctly.

Check GPU SKU region availability first. Not every Azure region supports GPU SKUs for ACI. As of January 2026, V100 GPU SKUs for Linux containers are available in: East US, West Europe, West US 2, Southeast Asia, and Central India. That's it. If you're trying to deploy a GPU container group in, say, North Europe or Australia East, it won't work, you need to either move to a supported region or choose a GPU SKU that's actually available there.

To verify what's available in your target region:

az container show-location-capability \
  --location eastus \
  --query "gpuCapabilities" \
  --output table

Fix your container image's GPU drivers. Even if the region and SKU are right, your container image needs NVIDIA drivers and either CUDA or TensorRT libraries installed. Without these, the container will start but won't be able to access the GPU hardware at all, and often exits with a cryptic error.

There are two solid options here. First, you can use the NVIDIA Container Toolkit, add it to your Dockerfile:

FROM nvidia/cuda:12.3.0-base-ubuntu22.04
RUN apt-get update && apt-get install -y nvidia-container-toolkit

Second, and often easier, is to pull a prebuilt image from the NVIDIA GPU Cloud (NGC) repository. NGC has ready-to-run images for PyTorch, TensorFlow, TensorRT, and many other frameworks that already have correct driver versions baked in. Search at catalog.ngc.nvidia.com for your framework.

For Azure ML workloads, Microsoft also maintains base images with correct GPU driver versions pre-installed, these are available through the Azure Machine Learning registry and are validated to work with ACI GPU SKUs.

What success looks like: Your GPU container starts successfully and running nvidia-smi inside the container returns the GPU device info, driver version, and CUDA version without errors.

Advanced Troubleshooting

If you've worked through the five steps above and you're still stuck, here's where things get more nuanced. These scenarios come up less often, but when they do, they're particularly hard to diagnose without knowing where to look.

Networking and VNet Injection Issues

When you deploy Azure Container Instances into a virtual network (VNet injection), you introduce a whole new set of potential failure modes. The container group gets a private IP on your subnet, which means DNS resolution, outbound internet access, and registry connectivity all depend on your VNet configuration being correct.

If your containers can't pull images or connect to external services after VNet injection, check these in order: First, verify the subnet has been delegated to Microsoft.ContainerInstance/containerGroups. Go to Portal → Virtual Networks → [your VNet] → Subnets → [your subnet] and look at "Subnet delegation." Second, check your NSG rules, outbound HTTPS (port 443) to your container registry and outbound DNS (port 53) to your DNS server must both be allowed. Third, if you're using a custom DNS server on the VNet, make sure it can resolve public hostnames like mcr.microsoft.com and your ACR endpoint.

Persistent "ServiceUnavailable" Quota Errors Across Multiple Regions

Sometimes a ServiceUnavailable error about ContainerGroups quota being exceeded appears even when you're not deploying Spot containers and you haven't intentionally created many groups. This usually means you've hit the per-region default limit on the total number of container groups, not just Spot cores.

To check your current container group count in a region:

az container list \
  --query "[?location=='eastus'] | length(@)" \
  --output tsv

Clean up any container groups left over from previous deployments that aren't actively running. Then, if you legitimately need more groups than the default allows, open a quota increase request the same way as for Spot cores, via Help + Support → Service and subscription limits.

Container Group Restart Policy Confusion

ACI has three restart policies: Always, OnFailure, and Never. If your container is a one-off task (a migration script, a batch job) and you've set the policy to Always, it will restart indefinitely after completion even though it succeeded, which looks exactly like a crash loop. Check your restart policy:

az container show \
  --resource-group MyResourceGroup \
  --name myapp \
  --query "restartPolicy"

For batch jobs, use --restart-policy Never or --restart-policy OnFailure.

Multi-Container Group Failures

If your container group has multiple containers (a sidecar pattern, for example), a failure in any one container can prevent the whole group from becoming healthy. Use az container show and check the instanceView for each container individually, not just the first one. The failing container isn't always the one you'd expect.

When to Call Microsoft Support

If you've confirmed your image is correct, your quota is sufficient, your GPU SKU is in a supported region, and you're still getting unexplained provisioning failures, it's time to escalate. Azure Container Instances provisioning happens on Microsoft's backend infrastructure, and occasionally there are platform-level issues in specific regions that only Microsoft can see. Open a ticket via Microsoft Support, include your container group resource ID (from az container show --query id), the exact error message, the region, and the timestamp of the failed deployment. That information will get you a meaningful response much faster than a generic description.

Prevention & Best Practices

Most Azure Container Instances failures are preventable. After working through enough of these deployments, you start to see the same patterns over and over. Here's what the teams that never seem to hit these issues are doing differently.

Build slim, self-contained images. The more dependencies your container image has at runtime, the more things can go wrong. Where possible, bake everything your application needs into the image at build time rather than relying on startup scripts to install packages. This makes images faster to pull, reduces ImagePullBackOff risk due to network timeouts, and makes your deployments more deterministic.

Test your image locally before deploying to ACI. Run docker run --env-file .env myimage:1.0.0 locally before any ACI deployment. If it doesn't work locally, it won't work in ACI. This sounds obvious, but I've seen teams skip this step and spend two hours debugging an ACI deployment when the real problem was a missing environment variable that would have been obvious in 30 seconds of local testing.

Monitor your quota usage proactively. Don't wait for a ContainerGroupQuotaReached error to find out you're near your limits. Set up Azure Monitor alerts on your subscription's quota usage so you get notified when you're at 80% of your container group or core limits. Go to Portal → Monitor → Alerts → New Alert Rule and select "Subscription quota remaining" as the signal.

Use infrastructure-as-code for all ACI deployments. Managing container groups with a Bicep template or ARM template means you have a reliable, version-controlled record of your exact configuration. When a deployment fails, you can diff the current template against the last known good state, rather than trying to remember what you typed in the portal three weeks ago.

For GPU workloads, validate your driver stack in staging first. GPU driver compatibility between your CUDA version, the NVIDIA Container Toolkit, and the V100 hardware available in ACI is not always straightforward. Build a test image that just runs nvidia-smi and confirm it works before building your full production image on top of that base. One failed nvidia-smi check in staging saves you a broken production deployment.

Quick Wins

Always pin image tags to specific versions, never deploy with :latest in non-development environments
Set container restart policy explicitly, use OnFailure for jobs, Always only for long-running services
Pre-validate your ACR credentials with docker login before attaching them to an ACI deployment
Check GPU SKU region availability before you design your architecture around a specific region, the supported list is short and doesn't cover every popular Azure region

Frequently Asked Questions

How do I file a quota increase request for ACI Spot containers?

Go to Azure Portal → Help + Support → New Support Request. Set Issue Type to "Service and subscription limits (quotas)" and select Container Instances as the service. In the details, specify that you need an increase to the StandardSpotCores limit and include the region, your current limit, and the core count you need. Enterprise Agreement customers start with a 100-core Spot limit; Default (Pay-As-You-Go) subscribers get 10. If you need more than either of those defaults, a support ticket is the only path forward. Include a brief business justification, requests with context get processed faster.

Why does my Azure Container Instances GPU container start but can't access the GPU?

This almost always means the NVIDIA drivers or CUDA libraries aren't present in your container image. The GPU hardware gets allocated during provisioning, but without the right software stack inside the container, the container can't talk to it. Your best starting point is to rebuild your image using a base from the NVIDIA GPU Cloud (NGC) repository, which already has validated driver versions and CUDA/TensorRT libraries built in. Alternatively, you can install the NVIDIA Container Toolkit in your Dockerfile. Also double-check that you're deploying to a supported region, V100 GPUs are only available in East US, West Europe, West US 2, Southeast Asia, and Central India for Linux containers.

What's the difference between CrashLoopBackOff and ImagePullBackOff in ACI?

They fail at different stages. An ImagePullBackOff means ACI never even got the container running, it failed while trying to download the image from the registry. This is almost always a credentials problem, a network issue, or a bad image tag. A CrashLoopBackOff means the image pulled fine, the container started, but then the application inside crashed (exited with a non-zero code) and ACI is trying to restart it. Check the container events for image pull failures and check the container logs for crash details. The fix paths are completely different, so identifying which one you're dealing with first saves a lot of time.

Can I increase the default ACI container group limit beyond what's shown in my subscription?

Yes. The default container group quota is a soft limit and can be raised by filing a support request. Go to Help + Support → New Support Request, choose quota as the issue type, and select Container Instances. Be ready to specify the exact region, the quota name (either ContainerGroups or StandardSpotCores), and the new limit you need. Microsoft typically responds to quota requests within one business day for Enterprise Agreement customers. Standard (non-Spot) container group limits and Spot core limits are tracked separately, so you may need to file separate requests if you need both increased.

Are ACI Spot containers safe to use in production?

Not yet, officially. As of early 2026, ACI Spot containers are in public preview and Microsoft explicitly doesn't recommend them for production workloads. Spot containers can be evicted at any time when Azure needs capacity back, your container group may stop without warning. They're great for batch processing, CI/CD jobs, data pipeline tasks, and anything that can tolerate interruption. For workloads that need to stay up, use standard ACI container groups. The cost savings from Spot pricing are real, but so is the eviction risk during periods of high regional demand.

Which GPU SKUs does Azure Container Instances support, and in which regions?

As of January 2026, ACI GPU support is limited to the V100 SKU on Linux containers. V100 is available in five regions: East US, West Europe, West US 2, Southeast Asia, and Central India. There is currently no GPU support for Windows containers in ACI. If you need a region that isn't on this list, you'll need to consider a different Azure service, like Azure Kubernetes Service with GPU node pools, or Azure Machine Learning compute clusters, which support GPU SKUs in more regions. Always verify current availability in the official Azure documentation before building your architecture around a specific region, as this list may change.

Related Microsoft Fix Guides

Sai Kiran Pandrala

Our team includes certified Microsoft engineers, Azure architects, and system administrators with 10+ years of enterprise IT experience. Every guide is written from hands-on troubleshooting, not guesswork. We test every fix before publishing.