Azure Container Instances: Fix Common Errors & Setup Issues
Why This Is Happening
You've pushed your container image, fired off a deployment command, and Azure Container Instances just… sits there. Maybe it throws a cryptic provisioning error. Maybe it pulls the image fine but the container exits with code 1 immediately. Or maybe networking is broken in ways that make zero sense at first glance. I've seen every one of these scenarios , and the frustrating part is that the Azure Portal error messages are often so generic they tell you almost nothing actionable.
Azure Container Instances is designed to be the fastest path to running a Docker container in the cloud without touching a VM or setting up Kubernetes. Microsoft's documentation describes it as a solution for "any scenario that can operate in isolated containers, without orchestration", think event-driven apps, CI/CD pipeline jobs, batch data processing, and short-lived compute tasks. That simplicity is its strength. But when something goes wrong, the abstracted nature of ACI means the failure surface is sometimes hidden from you.
Most Azure Container Instances problems fall into five buckets:
- Image pull failures, ACI can't reach your registry, or authentication is misconfigured for a private Azure Container Registry (ACR) instance.
- Container exit failures, The image pulls successfully but the container process crashes at startup, often because of missing environment variables, bad entry points, or OS/architecture mismatches.
- Networking and virtual network issues, Outbound connectivity fails, especially when the container group is deployed into a VNet subnet without a NAT gateway configured.
- TLS errors, ACI strictly requires TLS 1.2 for all secure connections. If your application or client library still negotiates TLS 1.0 or 1.1, it will be rejected, Microsoft retired support for those older versions.
- Resource specification errors, CPU or memory requests are outside the limits allowed for the region or SKU, causing the deployment to fail at the resource allocation stage.
What makes this extra maddening is that ACI deployments happen fast, seconds, not minutes, so when they fail, you barely have time to see what happened before the container group lands in a terminal error state. And yes, I know how frustrating that is when you're trying to ship something.
The good news: every single one of these issues is solvable, and most of them have a clear fix once you know where to look. Browse all Microsoft fix guides →
The Quick Fix, Try This First
Before you go deep on diagnostics, do this one thing: check your container group's logs and events directly from the Azure CLI. This single command gives you more useful information than five minutes of clicking around in the Portal.
Open Azure Cloud Shell or any terminal with the Azure CLI installed and run:
az container show \
--resource-group YOUR_RESOURCE_GROUP \
--name YOUR_CONTAINER_NAME \
--query "containers[0].instanceView" \
--output json
Look at the currentState and events fields in the output. The events array in particular will often show you exactly what went wrong, whether that's a failed image pull (Failed event with a registry 401 or 404), an OOMKilled exit, or a networking provisioning error.
Then pull the actual container logs:
az container logs \
--resource-group YOUR_RESOURCE_GROUP \
--name YOUR_CONTAINER_NAME
If your container started and then crashed, those logs will show the application-level error, a missing environment variable, a failed database connection, an uncaught exception. That output alone resolves about 60% of the cases I've seen.
If the container never started at all (image pull failed or provisioning failed), the logs will be empty. In that case, the instanceView.events output from the first command is your primary signal. Look for messages referencing InaccessibleImage, Failed to pull image, or OutOfMemory.
One more quick win: if you're seeing a deployment rejection rather than a runtime failure, verify your CPU and memory specs. ACI allows exact specifications, but there are per-region limits. For standard Linux containers, common valid combinations are 1 CPU / 1.5 GB, 2 CPU / 3.5 GB, or 4 CPU / 16 GB. Requesting, say, 3 CPU / 1 GB is invalid and will cause an immediate provisioning failure with an unhelpful generic error.
--restart-policy Never when deploying a test or diagnostic container. The default restart policy (Always) will cause ACI to keep restarting a crashing container, which makes it much harder to capture logs before they're overwritten. Switch to Never so the container exits and holds its logs for inspection.
The Azure Portal gives you a high-level status, but it hides the details you actually need. Go straight to the CLI or REST API for real diagnostic output.
In the Portal, navigate to your container group: Portal → Resource Groups → [your group] → Container Instances → [your container name] → Containers. Click the container name, then select the Events tab. You'll see a timestamped list of provisioning events. If the container never started, this is where the image pull error or network provisioning failure will appear.
For a richer view, use the CLI:
az container show \
--resource-group YOUR_RG \
--name YOUR_CONTAINER \
--output table
The provisioningState field will read Succeeded, Failed, or Creating. If it's stuck on Creating for more than 3–4 minutes, something is blocking the provisioning pipeline, usually a network issue or an invalid resource spec.
Note the osType field too. ACI supports both Windows and Linux containers, but several features, including volume mounting (Azure Files, emptyDir, GitRepo, secret), GPU resources, and multiple containers per group, are currently restricted to Linux containers only. If you're on Windows and expecting multi-container support or Azure Files mounts, that's your problem right there.
If the state is Failed, look at containers[0].instanceView.currentState.detailStatus. A value of CrashLoopBackOff means the app is starting and crashing repeatedly. OOMKilled means the container hit its memory limit. Error with no exit code usually signals an image pull failure or a network-level provisioning problem.
Once you've identified the failure category, you know which step below applies to your situation.
Private registry authentication is one of the most common sources of Azure Container Instances deployment failures. If your container image lives in an Azure Container Registry (ACR) instance, there are two ways to authenticate: explicit credentials (username and password) or managed identity. Managed identity is the right long-term approach, it means you don't embed credentials anywhere in your container group definition.
To set up managed identity for ACR image pulls, first enable a system-assigned managed identity on your container group. Then assign the AcrPull role to that identity scoped to your ACR instance:
# Assign AcrPull role to the managed identity
az role assignment create \
--assignee YOUR_MANAGED_IDENTITY_PRINCIPAL_ID \
--role AcrPull \
--scope /subscriptions/YOUR_SUB_ID/resourceGroups/YOUR_RG/providers/Microsoft.ContainerRegistry/registries/YOUR_ACR_NAME
Then deploy the container group with the identity and registry reference:
az container create \
--resource-group YOUR_RG \
--name mycontainer \
--image YOUR_ACR.azurecr.io/myapp:latest \
--acr-identity YOUR_MANAGED_IDENTITY_RESOURCE_ID \
--assign-identity YOUR_MANAGED_IDENTITY_RESOURCE_ID
If you're using explicit credentials and getting a 401 Unauthorized error, double-check that admin access is enabled on the ACR instance: Portal → Container Registry → [your registry] → Settings → Access keys → Admin user: Enabled. Copy the username and one of the two passwords, then pass them via --registry-username and --registry-password in your deploy command.
Also confirm the registry is in a supported configuration. ACI supports Docker Hub, Azure Container Registry, and other cloud-based Docker-compatible registries. If you're pulling from a self-hosted registry over a private endpoint only, you'll need a VNet-deployed container group with proper DNS resolution to reach it.
A successful image pull will show an event like Pulling image "YOUR_IMAGE" followed by Successfully pulled image in the container events. If you see those two events, your image pull is working, and the problem is in the application startup, not authentication.
This one trips up a lot of engineers when they first move from standalone ACI deployments to VNet-integrated ones. The rule is non-negotiable: if you deploy your container group into a virtual network, you must use a NAT gateway for outbound connectivity. That is the only supported configuration.
Without a NAT gateway, your VNet-deployed container group will have no outbound internet connectivity, it can receive traffic if you've set it up correctly, but it cannot initiate outbound connections. Applications that phone home to an API, pull a config from a URL, or connect to an external database will silently fail or time out, and the container logs will show connection refused or timeout errors rather than anything pointing to the actual network configuration problem.
Here's how to attach a NAT gateway to the subnet your container group uses:
# Create a public IP for the NAT gateway
az network public-ip create \
--resource-group YOUR_RG \
--name myNatGatewayIP \
--sku Standard \
--allocation-method Static
# Create the NAT gateway
az network nat gateway create \
--resource-group YOUR_RG \
--name myNatGateway \
--public-ip-addresses myNatGatewayIP \
--idle-timeout 10
# Associate the NAT gateway with your ACI subnet
az network vnet subnet update \
--resource-group YOUR_RG \
--vnet-name YOUR_VNET \
--name YOUR_ACI_SUBNET \
--nat-gateway myNatGateway
After this, any container group deployed into that subnet will have outbound internet connectivity via the NAT gateway's static public IP. This also gives you a predictable outbound IP for firewall allowlisting, a side benefit that's genuinely useful in enterprise scenarios.
VNet-deployed container groups can also communicate securely with on-premises resources via a VPN gateway or ExpressRoute connection on the same virtual network. If that connectivity is broken, verify the VPN gateway health and check that your subnet's route table includes the correct routes for on-premises address spaces.
Azure Container Instances requires TLS 1.2 for all secure client connections, no exceptions. TLS 1.0 and TLS 1.1 support has been retired. If your application code, an SDK it depends on, or a client tool connecting to ACI is still negotiating on an older TLS version, the handshake will fail and you'll see connection errors that can look like generic network failures if you're not looking in the right place.
The most common scenario: a legacy .NET Framework application inside the container that defaults to TLS 1.0 or 1.1 when making outbound HTTPS calls. Fix this by explicitly setting the security protocol in your application code:
// .NET, force TLS 1.2
System.Net.ServicePointManager.SecurityProtocol =
System.Net.SecurityProtocolType.Tls12;
For Python applications using the requests library, make sure you're on a version of OpenSSL that supports TLS 1.2 in your container image. Alpine-based images sometimes ship with older OpenSSL builds. Switching to a Debian-based image (e.g., python:3.12-slim instead of python:3.12-alpine) often resolves this immediately.
For Node.js, TLS 1.2 is the default from Node 12+ onwards, so if you're on a modern base image you should be fine. If you're running something older and can't update, you can force TLS 1.2 via:
// Node.js, force TLS 1.2 minimum
const https = require('https');
const agent = new https.Agent({
secureProtocol: 'TLSv1_2_method'
});
On the infrastructure side, if you're exposing your container group directly to the internet with a DNS name label (e.g., myapp.eastus.azurecontainer.io), make sure any load balancer, API gateway, or reverse proxy in front of it is also configured to reject TLS 1.0 and 1.1 connections from clients. The container FQDN format follows the pattern [customlabel].[region].azurecontainer.io and access always goes over HTTPS.
If you're debugging TLS issues from outside the container, the openssl s_client command is your best friend:
openssl s_client -connect YOUR_FQDN:443 -tls1_2
A successful TLS 1.2 handshake will return the certificate chain and end with Verify return code: 0 (ok). Anything else tells you exactly where the negotiation is breaking.
If you're running a multi-container setup, say, an application container alongside a logging sidecar or a data processor, and your containers can't see each other, the issue is almost always that you're not using a container group. Containers in the same ACI container group share the same host, local network, storage, and lifecycle. Containers in separate container groups are isolated from each other by default.
When deploying a multi-container group, YAML is the cleanest approach. Here's a minimal two-container group definition:
apiVersion: 2019-12-01
location: eastus
name: myContainerGroup
properties:
containers:
- name: app
properties:
image: myregistry.azurecr.io/myapp:latest
resources:
requests:
cpu: 1
memoryInGb: 1.5
ports:
- port: 80
- name: logger
properties:
image: myregistry.azurecr.io/mylogger:latest
resources:
requests:
cpu: 0.5
memoryInGb: 0.5
osType: Linux
ipAddress:
type: Public
ports:
- protocol: tcp
port: 80
type: Microsoft.ContainerInstance/containerGroups
Note the osType: Linux, multi-container groups are a Linux-only feature. Windows container deployments support only a single container per group.
For persistent storage, ACI supports direct mounting of Azure Files shares. First, create the file share in your storage account, then mount it in your container definition:
az container create \
--resource-group YOUR_RG \
--name mycontainer \
--image myimage:latest \
--azure-file-volume-account-name YOUR_STORAGE_ACCOUNT \
--azure-file-volume-account-key YOUR_STORAGE_KEY \
--azure-file-volume-share-name myshare \
--azure-file-volume-mount-path /mnt/data
Volume mounting (Azure Files, emptyDir, GitRepo, secret) is also Linux-only. If you're on a Windows container and wondering why the volume mount flag is being rejected, that's why. The fix is to switch to a Linux base image if your workload allows it.
One thing I always recommend: test your multi-container group locally with Docker Compose before deploying to ACI. ACI supports deploying directly from Docker Compose definitions via the Docker CLI integration, which means you can use the exact same docker-compose.yml file for local testing and cloud deployment, significantly reducing "works on my machine" problems.
Advanced Troubleshooting
GPU Resource Failures (Preview)
ACI supports scheduling Linux containers to use NVIDIA Tesla GPU resources, but this is a preview feature with significant caveats. GPU-enabled deployments are only available in specific regions, and the SKU must be explicitly specified in your deployment. If you're getting a deployment failure when requesting GPU resources, first confirm your target region supports GPU containers by checking the Azure products by region page. Then verify your container group spec explicitly sets the GPU SKU:
az container create \
--resource-group YOUR_RG \
--name gpu-container \
--image YOUR_GPU_IMAGE \
--gpu-count 1 \
--gpu-sku K80 \
--os-type Linux
GPU containers can only run on Linux. If --os-type is Windows or unspecified and defaults to Windows in your subscription's region config, the GPU request will be rejected.
Confidential Container Deployment Issues
Confidential containers on ACI run in a trusted execution environment (TEE) that provides hardware-based confidentiality and integrity protections. If your confidential container group is failing to provision, the most common cause is that you haven't specified the correct SKU at deployment time. Confidential containers are a distinct SKU selection, you can't take a regular container group and "upgrade" it to confidential after the fact. It must be set at creation via the ARM template or Portal's confidential container option.
For enterprise scenarios where confidential containers are used to protect data-in-use and encrypt data being processed in memory, make sure your workload's attestation configuration is correct. A misconfigured attestation policy will cause the TEE initialization to fail before your application code even starts.
Spot Container Preemptions
ACI Spot containers run on unused Azure capacity and can be preempted when Azure needs that capacity back. If your workload is being interrupted unexpectedly, you may have deployed to the Spot SKU without realizing it, or you may have chosen Spot for cost savings (up to 70% off regular ACI pricing) without designing for interruption. Spot containers are billed per-second for memory and core usage, but they are not suitable for workloads with strict availability requirements. If your workload needs to run to completion reliably, switch to regular-priority ACI. If you want to keep Spot for cost savings, implement checkpointing in your application so it can resume from where it left off after a preemption.
Availability Zone Deployment Failures
ACI supports zonal container group deployments, you can pin a container group to a specific availability zone. If you're specifying a zone and the deployment fails, check that the zone you're targeting is actually supported for ACI in your region. Not every region has all three AZ options available for ACI. Also note that availability zone is specified per container group, not per container, so all containers in the group will land in the same zone.
Standby Pools for Faster Startup
If your use case requires even faster startup times than standard ACI cold starts provide, ACI supports standby pools, pre-provisioned container capacity that eliminates the cold start latency. Standby pools require a separate configuration step and add cost, but for latency-sensitive applications this can be the right trade-off. Misconfigured standby pool settings (wrong SKU, wrong region) will silently fall back to standard provisioning without error, so validate your standby pool is actually being used by checking the provisioning telemetry.
InternalServerError or ServiceUnavailable), or a confidential container TEE initialization is failing with no actionable detail. Microsoft Support can pull platform-side logs that aren't exposed to customers.
Prevention & Best Practices
The best Azure Container Instances deployment is one that never needs troubleshooting. Here's what I consistently see working well for teams that run ACI at scale.
Always test locally with Docker before deploying to ACI. ACI accepts images from Docker Hub and other standard registries, so anything that runs correctly in a local docker run will almost certainly run correctly in ACI. The majority of application-level failures I see in ACI deployments would have been caught in a 30-second local test. Use the same environment variable values, the same volume mount paths, the same port mappings.
Use YAML or ARM templates for all non-trivial deployments. One-liner CLI commands are fine for quick tests, but for production deployments you want your container group definition in source control. YAML is simpler for most use cases; ARM templates give you more control for complex multi-container groups and confidential container deployments. The Azure docs provide official templates for both, use those as your starting point rather than building from scratch.
Build managed identity into your architecture from day one. Using managed identity for ACR image pulls means no credentials to rotate, no secrets to accidentally commit to source control, and no 401 errors caused by expired passwords. The ACI service supports managed identity for any service that uses Microsoft Entra authentication, not just ACR. If your container needs to talk to Azure Key Vault, Azure Storage, or Azure SQL, managed identity is the right authentication path for all of them.
Pin your restart policy to match your workload type. ACI supports three restart policies: Always (default, good for long-running services), OnFailure (good for batch jobs that should retry on error), and Never (good for one-shot tasks and debugging). Using the wrong restart policy causes confusing behavior, a batch job on Always will keep restarting after it completes successfully, consuming resources and generating spurious logs.
- Tag every container image with a specific version SHA, not just
:latest, ACI caches common base OS images, but:latesttags can cause unexpected behavior when the underlying image changes between deployments. - Set explicit CPU and memory resource requests that match your application's actual usage, you pay by the second based on what you provision, so over-provisioning burns money and under-provisioning causes OOMKilled failures.
- If deploying into a VNet, create the NAT gateway and configure the subnet association before running any
az container createcommands, it's much easier than fixing outbound connectivity after the fact. - Use Azure Monitor metrics to track container CPU and memory utilization over time, ACI exposes these via the Azure Monitor integration and they're invaluable for right-sizing your resource requests.
Frequently Asked Questions
What is Azure Container Instances and when should I use it instead of AKS?
Azure Container Instances is a serverless container platform, you give Azure a Docker image and it runs it, without you managing any VMs, clusters, or orchestration infrastructure. It's the right choice for event-driven workloads, short-lived batch jobs, CI/CD pipeline steps, and any scenario where you need a container to run fast and then stop. Azure Kubernetes Service (AKS) is better for long-running, always-on services that need horizontal scaling, rolling updates, service mesh features, and complex orchestration. That said, the two aren't mutually exclusive, ACI can be used as burst capacity for AKS by deploying pods as virtual nodes on ACI, which is a common pattern for variable-load production workloads.
Why does my Azure Container Instance keep restarting instead of stopping after it finishes?
This is almost always a restart policy issue. The default restart policy for ACI is Always, which means ACI will restart your container every time it exits, even if it exits cleanly with code 0. For batch jobs or one-shot tasks, set --restart-policy Never or --restart-policy OnFailure in your deployment command. Use Never if the job should run exactly once regardless of outcome, or OnFailure if you want it to retry on non-zero exit codes but stop when it succeeds. You can update the restart policy on an existing container group by stopping it, updating the spec, and restarting it.
Can I run Windows containers in Azure Container Instances?
Yes, ACI supports both Windows and Linux containers using the same API, you specify your OS preference via --os-type Windows or --os-type Linux at creation time. However, Windows containers have meaningful feature restrictions compared to Linux: they support only a single container per container group (no multi-container sidecars), no volume mounting of any kind (Azure Files, emptyDir, GitRepo, secret), no GPU resources, and no Azure Monitor resource usage metrics. For Windows container deployments, Microsoft recommends using images based on common Windows base images like mcr.microsoft.com/windows/servercore or mcr.microsoft.com/windows/nanoserver.
How do I give my Azure Container Instance access to other Azure services securely?
Use managed identity, it's the right way to authenticate from ACI to any Azure service that supports Microsoft Entra authentication, including Azure Key Vault, Azure Storage, Azure SQL, and Azure Container Registry. You enable a system-assigned or user-assigned managed identity on your container group at deployment time, then assign the appropriate Azure RBAC role to that identity scoped to the target resource. Inside your container, the identity is available via the Azure Instance Metadata Service endpoint (http://169.254.169.254/metadata/identity), and Azure SDKs like the Azure Identity library will pick it up automatically via DefaultAzureCredential. This way your container code never handles a secret, connection string, or password directly.
My container deployed successfully but I can't reach it from the internet, what's wrong?
Check three things in order. First, confirm you deployed with a public IP address, by default ACI containers may not get a public IP unless you explicitly request one with --ip-address Public. Second, make sure you've opened the correct port in your container group definition with the --ports flag matching what your application listens on. Third, if you specified a DNS name label, verify the FQDN format ([your-label].[region].azurecontainer.io) and that the DNS name isn't already taken in that region. If all three are correct and you still can't connect, check that your application is actually binding to 0.0.0.0 and not just 127.0.0.1 or localhost, that's a common Docker container networking gotcha that will block all external connections.
What's the difference between ACI Spot containers and regular ACI containers?
ACI Spot containers run on unused Azure capacity at discounted prices of up to 70% compared to regular-priority containers, billed per second for memory and core usage just like standard ACI, but at a much lower rate. The trade-off is availability: Spot containers can be preempted (forcibly stopped) when Azure needs that capacity back for higher-priority workloads. This makes Spot ideal for interruptible batch processing, non-time-sensitive data jobs, development and testing environments, and any workload that can handle interruption gracefully. Spot containers are not suitable for production APIs, user-facing services, or any workload where an unexpected restart would cause a problem. If you need cost savings on always-on workloads, AKS with Spot node pools is a better fit.