How to Troubleshoot Azure Container Instances

Microsoft Fix Intermediate 14 min read Official Docs Grounded Updated April 20, 2026

What's in This Guide

Why This Is Happening
The Quick Fix
Step-by-Step Solution
Advanced Troubleshooting
Prevention & Best Practices
FAQ

Why This Is Happening

I've seen this exact scenario more times than I can count , you spin up an Azure Container Instances group, the deployment says it succeeded, and then your container either sits in a Waiting state forever, crashes on startup, or exits with a cryptic non-zero code and no explanation. Azure's portal error messages are notoriously sparse. "Container group provisioning failed" tells you almost nothing useful, and that's genuinely infuriating when you're trying to ship something.

Azure Container Instances troubleshooting is tricky because ACI sits at an unusual intersection: it's a serverless compute platform, but it runs full container images. That means problems can originate from the image itself, from the Azure networking layer, from registry authentication, from resource quota limits, from your environment variable configuration, or from the container's own application logic. There's no single place to look.

The most common root causes I see fall into five buckets:

Image pull failures , either the image tag doesn't exist, the registry is private and credentials weren't passed correctly, or Docker Hub rate-limiting kicked in.
Container startup crashes, the application inside the container exits immediately because it can't find an expected environment variable, a required port binding fails, or a mounted volume path doesn't exist.
Networking misconfiguration, especially in VNet-injected container groups where DNS resolution breaks, or where a subnet doesn't have the Microsoft.ContainerInstance/containerGroups service delegation applied.
Resource exhaustion, the container is hitting its CPU or memory ceiling and getting OOM-killed (event code OOMKilled) without any obvious indication in the portal.
Restart policy or exit code confusion, containers configured with Always restart policy that exit with code 0 (success) still restart, causing a loop that looks like a crash but isn't.

What makes Azure Container Instances troubleshooting especially frustrating is that the provisioning state and the container state are two separate things. Your container group can be in Succeeded provisioning state while the container itself is in Terminated with exit code 137. Microsoft's portal doesn't always surface this clearly. I'll show you exactly where to look.

Browse all Microsoft fix guides →

The Quick Fix, Try This First

Before you go deep into logs and registry settings, run this Azure CLI command. It gives you everything diagnostic in one shot and saves at least 20 minutes of clicking around the portal:

az container show \
  --resource-group <your-rg> \
  --name <your-container-group> \
  --query "containers[*].{Name:name, State:instanceView.currentState, PrevState:instanceView.previousState, RestartCount:instanceView.restartCount}" \
  --output table

This single query shows you the current container state, the previous state (which often tells you more than the current one), and the restart count. If restart count is above 0, you have a crash loop. If current state is Terminated with an exit code, that exit code is your primary diagnostic signal.

Then immediately pull the logs:

az container logs \
  --resource-group <your-rg> \
  --name <your-container-group> \
  --container-name <your-container-name>

If the container started and then crashed, its stdout/stderr output will be here. In my experience, about 60% of Azure Container Instances issues are fully diagnosed from these two commands alone. The container itself is usually telling you exactly what's wrong, a missing env var, a port conflict, a failed database connection, but nobody reads the logs first.

If logs come back empty or the container never started at all, that points to an image pull problem or a networking issue at the infrastructure level. In that case, move directly to Step 2 (image pull diagnostics) in the step-by-step section below.

For containers stuck in Waiting state for more than 3–4 minutes: that's almost always a private registry authentication failure or a VNet subnet delegation issue. Neither shows up as a helpful error in the portal, I'll cover both precisely in the steps below.

Pro Tip

Always set --restart-policy Never during initial debugging. With the default OnFailure policy, Azure will restart crashed containers before you can capture their state, and with Always, even a successful exit causes a restart loop. Never lets the container die once and preserves all diagnostic output for you to inspect.

Read the Container Events and State in Azure Portal

Open the Azure Portal and navigate to your Container Instance resource. In the left blade, click Containers (not Overview, that gives you the group-level view). You'll see a list of containers in your group. Click on your specific container name.

You'll land on a page with three tabs: Properties, Logs, and Connect. The Properties tab is where the gold is. Look at:

Current state, Running, Terminated, or Waiting
Exit code, if Terminated, this is critical. Exit code 1 = application error. Exit code 137 = OOM kill or SIGKILL. Exit code 139 = segfault. Exit code 0 = container exited cleanly (but may still be restarting if policy is Always)
Restart count, anything above 2 means you have a persistent crash loop

Then click the Logs tab. If your container produced any output before crashing, it's here. Copy it fully, don't just skim the last line. Startup errors often appear several lines before the final crash message.

If you see no logs at all and the container is in Waiting state, the container never ran. This means the problem is infrastructure-level, image pull, networking, or subnet configuration. Proceed to Step 2.

If the container shows Terminated with exit code 137 and you see an event that says OOMKilling, your memory limit is too low. The fix is to redeploy with a higher memory allocation, ACI allows up to 16 GB for CPU-based containers.

Diagnose and Fix Container Image Pull Failures

Image pull errors are the single most common cause of containers stuck in Waiting state. The error often doesn't surface clearly in the portal. The fastest way to see it is via CLI:

az container show \
  --resource-group <your-rg> \
  --name <your-container-group> \
  --query "containers[0].instanceView.events" \
  --output json

Look for an event with reason: "Failed" and a message containing ErrImagePull or ImagePullBackOff. If you see either, the problem is one of three things:

1. Wrong image tag. Double-check that the exact tag you specified exists in your registry. Run:

az acr repository show-tags \
  --name <your-acr-name> \
  --repository <your-image-repo> \
  --output table

2. Private registry authentication missing or wrong. If you're pulling from Azure Container Registry (ACR), you need to either enable the Admin user and pass credentials, or use a managed identity. To check if admin is enabled:

az acr show --name <your-acr-name> --query adminUserEnabled

If it returns false, either enable admin (for dev) or assign the AcrPull role to the container group's managed identity.

3. Docker Hub rate limiting. If you're pulling a public Docker Hub image, unauthenticated pulls are throttled to 100 per 6 hours per IP. Since ACI containers often share NAT IPs, you can hit this quickly. Pass Docker Hub credentials explicitly, or mirror the image to ACR.

After fixing the credentials or image reference, you'll need to delete and recreate the container group, ACI doesn't support in-place image updates.

Fix VNet Integration and Subnet Delegation Errors

If your container group is deployed into a Virtual Network (the --vnet flag in CLI or the Virtual network setting in the portal), subnet configuration problems are a major source of Azure Container Instances networking issues. Containers stuck in Waiting with no image pull errors usually point here.

The subnet you deploy into must have a service delegation for Microsoft.ContainerInstance/containerGroups. Without it, ACI can't inject the container group into the VNet. Check this in the portal by going to Virtual networks → your VNet → Subnets → click your subnet → scroll to Subnet delegation. It must say Microsoft.ContainerInstance/containerGroups.

To add it via CLI:

az network vnet subnet update \
  --resource-group <your-rg> \
  --vnet-name <your-vnet> \
  --name <your-subnet> \
  --delegations Microsoft.ContainerInstance/containerGroups

The second networking issue I see constantly is DNS resolution failures inside the container. When you inject into a VNet, the container uses Azure's internal DNS resolver at 168.63.129.16. If your Network Security Group (NSG) is blocking outbound UDP port 53 to that address, DNS will silently fail. Check your NSG rules:

az network nsg rule list \
  --resource-group <your-rg> \
  --nsg-name <your-nsg> \
  --output table

You need an outbound allow rule for destination 168.63.129.16, port 53, protocol UDP, and also TCP port 443 to AzureContainerRegistry and AzureMonitor service tags if you're using ACR or Log Analytics.

Resolve OOM Kills and CPU Throttling

Exit code 137 is the OOM killer's signature. Your container requested more memory than its limit and Azure killed it. What makes this especially annoying is that your application might run fine locally with no memory pressure, but in ACI the default limits (1 CPU, 1.5 GB RAM) can be surprisingly tight for certain workloads, especially anything using a JVM, Node.js with a large heap, or Python with in-memory data processing.

To check the current resource allocation:

az container show \
  --resource-group <your-rg> \
  --name <your-container-group> \
  --query "containers[*].{Name:name, CPU:resources.requests.cpu, MemGB:resources.requests.memoryInGB}" \
  --output table

To redeploy with increased limits:

az container create \
  --resource-group <your-rg> \
  --name <your-container-group> \
  --image <your-image> \
  --cpu 2 \
  --memory 4 \
  --restart-policy Never

ACI supports a maximum of 4 vCPU and 16 GB RAM for Linux containers in most regions. For Windows containers, limits vary, check the quota for your subscription with:

az container list-usage --location eastus --output table

If you're consistently hitting memory limits even after increasing them, the issue may be a memory leak in your application rather than an undersized limit. Enable Azure Monitor integration to track memory usage over time, which helps distinguish a gradual leak from an immediate spike on startup.

Enable Diagnostic Logging with Azure Monitor and Log Analytics

The portal's built-in log view only shows you the last few thousand lines of stdout/stderr from the most recent container run. For production Azure Container Instances troubleshooting, you need persistent logs that survive container restarts. That means wiring up Log Analytics.

First, create a Log Analytics workspace if you don't have one:

az monitor log-analytics workspace create \
  --resource-group <your-rg> \
  --workspace-name <your-workspace-name> \
  --location eastus

Get the workspace ID and key:

$WORKSPACE_ID=$(az monitor log-analytics workspace show \
  --resource-group <your-rg> \
  --workspace-name <your-workspace-name> \
  --query customerId -o tsv)

$WORKSPACE_KEY=$(az monitor log-analytics workspace get-shared-keys \
  --resource-group <your-rg> \
  --workspace-name <your-workspace-name> \
  --query primarySharedKey -o tsv)

Then redeploy your container with diagnostics enabled:

az container create \
  --resource-group <your-rg> \
  --name <your-container-group> \
  --image <your-image> \
  --log-analytics-workspace $WORKSPACE_ID \
  --log-analytics-workspace-key $WORKSPACE_KEY

Once connected, go to your Log Analytics workspace in the portal, click Logs, and query the ContainerInstanceLog_CL table. A useful starter query:

ContainerInstanceLog_CL
| where ContainerGroup_s == "<your-container-group>"
| order by TimeGenerated desc
| take 200

This gives you persistent, queryable log history even after the container has been deleted and recreated multiple times. This is the single most useful change you can make for ongoing ACI operations.

Advanced Troubleshooting

When the standard steps above don't solve the problem, you're usually dealing with one of several more obscure scenarios. Here's what I reach for after the basics are exhausted.

Using az container exec for Live Debugging

If your container is running but behaving incorrectly, you can exec into it directly, similar to docker exec:

az container exec \
  --resource-group <your-rg> \
  --name <your-container-group> \
  --container-name <your-container> \
  --exec-command "/bin/sh"

From inside the container, you can test DNS resolution with nslookup, check environment variables with env, verify file mounts with ls -la, and test outbound connectivity with curl. This is invaluable for debugging ACI networking issues that are otherwise invisible from outside.

Container Group ARM Template Validation

Malformed ARM templates are a common but non-obvious source of deployment failures. The portal error for a bad template is often just "deployment failed" with no detail. Export your current container group as an ARM template and validate it:

az container export \
  --resource-group <your-rg> \
  --name <your-container-group> \
  --file container-group.yaml

Then use az deployment group validate before re-applying. Look specifically for invalid osType values, mismatched resource limits between containers and the group level, and GPU SKU requests in regions that don't support them.

Service Principal and Managed Identity Issues

If your container is trying to authenticate against Azure services (Key Vault, Storage, Service Bus) using a managed identity and getting 401 or 403 errors, verify the identity is actually assigned:

az container show \
  --resource-group <your-rg> \
  --name <your-container-group> \
  --query identity

If it returns null, the identity wasn't assigned at deployment time. Also check that the identity has the required RBAC role on the target resource, ACI containers frequently get the identity assigned but the role assignment on the downstream resource is forgotten.

Azure Policy Blocking Deployment

In enterprise environments, Azure Policy can silently block ACI deployments that don't meet compliance requirements, for example, policies requiring all containers to use approved base images, or policies blocking privileged containers. Check the Activity Log for your resource group and look for any PolicyViolation events in the same time window as your deployment attempt. Navigate to your resource group → Activity log → filter by time range and look for Microsoft.ContainerInstance/containerGroups/write with a Failed status.

Region-Specific Quota Limits

ACI has per-subscription, per-region quotas. If you're getting provisioning errors with no clear cause, check your quota:

az container list-usage --location <your-region> --output table

Common limits that get hit: maximum container groups per region (100 by default), maximum CPU cores per region, and GPU container limits. Submit a support request to increase quotas, they're usually approved within 1–2 business days.

When to Call Microsoft Support

If your container group is stuck in Pending provisioning state for more than 15 minutes with no events, if you're seeing intermittent provisioning failures that you cannot reproduce consistently, or if you've confirmed your configuration is correct but ACI is still not starting, these are infrastructure-level issues that require Microsoft's internal tooling to diagnose. Open a support ticket at Microsoft Support and include the container group resource ID, the exact time window of the failure, and the output of az container show with --output json. The resource ID and timestamp let support engineers pull internal platform telemetry that isn't exposed to you.

Prevention & Best Practices

Most Azure Container Instances issues I see are preventable. After enough of these debugging sessions, you develop habits that short-circuit the entire class of problems before they happen in production.

The biggest one: test your image locally against the exact same environment variables and resource constraints before deploying to ACI. Run it with docker run --memory="1.5g" --cpus="1" to simulate ACI's default limits. If it crashes locally under those constraints, it'll crash in ACI too, and at least locally you have easy access to logs and interactive debugging.

Second, use Azure Container Registry rather than Docker Hub for all production images. ACR integrates natively with ACI via managed identity (no credentials to manage), geo-replicates if needed, and doesn't have rate limits. The Docker Hub rate limit issue is completely avoidable.

Third, structure your container images to fail fast and loudly. If a required environment variable is missing, your entrypoint script should print a clear error message and exit with a non-zero code immediately, not silently continue and fail in a confusing way ten seconds later. Something like:

#!/bin/sh
: "${DATABASE_URL:?DATABASE_URL is required but not set. Exiting.}"
: "${API_KEY:?API_KEY is required but not set. Exiting.}"
exec "$@"

Fourth, tag your images with meaningful version tags, never just :latest. The :latest tag is a notorious source of ACI container image pull errors, it can change underneath you between deployments and makes rollbacks nearly impossible to reason about.

Finally, set up Log Analytics integration from day one, not after the first production incident. The cost is minimal and having historical log data available when something goes wrong is invaluable for ACI container crash loop analysis and post-incident review.

Quick Wins

Always deploy with --restart-policy Never during initial testing, switch to OnFailure only once the container is proven stable
Store secrets in Azure Key Vault and inject them as environment variables via Key Vault references, not hardcoded in your deployment command
Add a HEALTHCHECK instruction to your Dockerfile, ACI respects it for liveness determination and it surfaces container-level health in the portal
Use resource request/limit pairs: set a modest request and a higher limit so ACI can bin-pack efficiently while giving your container room to burst

Frequently Asked Questions

My Azure Container Instance is stuck in "Waiting" state, what does that mean?

The Waiting state means the container hasn't started executing yet. The two most common causes are an image pull failure (wrong tag, wrong registry URL, or missing credentials for a private registry) and a VNet subnet that doesn't have the Microsoft.ContainerInstance/containerGroups delegation applied. Run az container show --query "containers[0].instanceView.events" to see the events list, which will almost always tell you exactly which of these two issues you're hitting. If events are empty, check your NSG rules for outbound HTTPS access to AzureContainerRegistry and MCR.Microsoft.com.

My container exits with code 137, what's causing it and how do I fix it?

Exit code 137 in Linux containers means the process was killed by signal 9 (SIGKILL), which in ACI almost always means the container exceeded its memory limit and was OOM-killed by the kernel. The fix is to redeploy with a higher memory value, try doubling the current allocation first. If the problem persists even with generous memory, you may have a memory leak in your application. Enable Log Analytics and query ContainerInstanceLog_CL to see the memory growth pattern over time before the kill event. For JVM-based applications, also consider explicitly setting -Xmx to a value slightly below your ACI memory limit so the JVM doesn't balloon beyond the container ceiling.

How do I pull from a private Azure Container Registry without storing credentials in my deployment command?

The cleanest way is to assign a user-assigned managed identity to your container group and grant it the AcrPull role on your ACR. When you deploy, pass --assign-identity <identity-resource-id> and ACI will authenticate to ACR automatically without any passwords. You do need to use the ACI REST API or an ARM template for this currently, the az container create CLI command supports system-assigned identity but the ACR integration via managed identity works most reliably through ARM. The alternative for simpler setups is enabling ACR admin user and passing those credentials, but that approach means rotating credentials manually, so the managed identity route is worth the extra setup.

Can I SSH into a running Azure Container Instance?

There's no SSH in ACI, but you can get an interactive shell using az container exec --exec-command "/bin/sh" (or /bin/bash if your image has bash). This works while the container is in Running state and gives you a direct terminal session inside the container. You can run diagnostics, check environment variables, test network connectivity, and inspect file mounts exactly as you would locally. One important constraint: this only works while the container is running, if your container crashes immediately on startup, you won't be able to exec into it. In that case, override the entrypoint to sleep 3600 for a debugging deployment, exec in, and manually run your application to capture the error interactively.

Why does my container keep restarting even though it's not crashing?

This is almost always a restart policy configuration issue. If your container group was created with --restart-policy Always (the default for long-running containers), ACI will restart the container even when it exits with code 0, meaning a successful, intentional exit still triggers a restart. This is correct behavior for services that are supposed to run continuously, but it looks like a crash loop if you're running a batch job or one-shot container. Change your restart policy to OnFailure (only restart on non-zero exit) or Never (never restart) depending on your use case. You'll need to delete and recreate the container group since restart policy can't be changed in place.

My container works fine locally but fails in ACI with a networking error, what should I check?

The most common ACI-specific networking failures that don't reproduce locally are: DNS resolution failures (your container can't resolve external hostnames because an NSG is blocking UDP port 53 outbound), missing outbound internet access (ACI containers in a VNet have no outbound internet unless you've configured a NAT Gateway or firewall UDR), and TLS certificate validation failures against internal endpoints that use self-signed certs not trusted in the container's base image. Start by running az container exec to get a shell inside the running container and test curl -v https://your-endpoint.com and nslookup your-endpoint.com directly. The verbose output from those two commands will pinpoint whether you're hitting a DNS, routing, or TLS issue.

Related Microsoft Fix Guides

Sai Kiran Pandrala

Our team includes certified Microsoft engineers, Azure architects, and system administrators with 10+ years of enterprise IT experience. Every guide is written from hands-on troubleshooting, not guesswork. We test every fix before publishing.