Azure HPC Not Working, Diagnosed and Fixed (2026 Guide)
Why Azure HPC Is Not Working
You've submitted your Azure Batch job, the backbone of Azure HPC workloads, and nothing happens. Or worse, it crashes with an error code so cryptic it reads like a ransom note. I've seen this happen on everything from brand-new Batch pool deployments to production clusters that worked fine last Tuesday. The problem is almost never one thing. It's usually a chain of configuration decisions that made sense in isolation but break down the moment real compute pressure hits.
Let's be direct about what Azure HPC actually means in practice. Most Azure HPC setups run on Azure Batch, Microsoft's managed service for high-performance, large-scale parallel workloads. Whether you're running molecular dynamics simulations, financial risk models, or rendering pipelines, Batch is doing the heavy lifting. And Batch is deeply dependent on two other services: Azure Storage (for resource files, application packages, and output data) and Azure Virtual Networks (for node communication and security). When any one of those three legs wobbles, your entire HPC pipeline collapses.
The errors themselves don't help. ResourceContainerAccessDenied sounds like a permissions problem, and it is, technically, but the real cause is almost always a storage account firewall rule that doesn't account for how Batch routes its traffic. Then there are Azure Batch node creation delays, where your nodes sit in a "waiting" state for 20, 30, even 60 minutes before timing out entirely. That one's usually a Python package that's too large to install via start task, or a mismatch between your VM image generation (Gen1 vs Gen2) and the VM family you selected.
Enterprise teams running domain-joined clusters hit an additional layer of pain: managed identity configuration errors, subnet service endpoint gaps, and cross-region networking rules that silently drop traffic. The Azure portal gives you just enough visibility to know something is wrong, but rarely enough to pinpoint exactly why.
The good news? Every one of these issues is fixable. This guide walks you through the most common reasons Azure HPC is not working, from storage firewall misconfigurations to slow node provisioning, and gives you the exact steps to get your compute cluster running reliably. Browse all Microsoft fix guides →
The Quick Fix, Try This First
If your Azure HPC jobs are failing with an access denied error, specifically ResourceContainerAccessDenied on your Azure Blob containers, the fastest path to resolution depends on one key question: are your Batch pool and your storage account in the same Azure region?
Here's why that matters. When both are in the same region, traffic between your Batch nodes and the storage account travels over the Azure backbone network using private IP addresses, not public ones. That's actually good for performance. But it creates a nasty side effect: you cannot add private IP addresses to a storage account firewall allowlist. The firewall only accepts public IPs and subnet rules. So if your storage account has any firewall rules enabled at all, same-region Batch traffic gets silently dropped and you get ResourceContainerAccessDenied.
Same-region quick fix: Go to your storage account in the Azure portal → Networking → Firewalls and virtual networks. Switch Public network access to Enabled from selected virtual networks and IP addresses. Then add the subnet your Batch pool is running in. That's it. The traffic routes through the subnet's service endpoint instead of a public IP, which the firewall can actually evaluate and allow.
Cross-region quick fix: If your pool and storage account are in different regions, your Batch node traffic exits over the public internet. In that case, you need a static public IP address assigned to your Batch pool. Navigate to Azure portal → Batch Account → your Pool → Properties to find the current public IP. Then add that IP directly to your storage account's firewall allowlist under Networking → Firewall → Add your client IP address (or manually enter it).
One important caveat before you start: if you are uploading application packages to this storage account, neither fix above applies. Application package uploads require the storage account to have no firewall configured at all. Microsoft's documentation is explicit on this point, it's not a workaround, it's a hard requirement.
Before you touch any configuration, you need to know exactly what kind of Azure HPC failure you're dealing with. Open the Azure portal and navigate to your Batch account. Go to Jobs, find the failed job, and click into it. Select the failed Task, then click Task Information to see the raw error output.
You're looking for two key fields: Category and Code. The most common patterns are:
Category: UserError / Code: ResourceContainerAccessDenied, storage firewall issue, covered in Steps 2 and 3- Nodes stuck in WaitingForStartTask or StartTaskFailed state, application installation failure, covered in Steps 4 and 5
- Pool stuck in Resizing with no nodes appearing, VM quota, image compatibility, or region mismatch problem
For node-level detail, go to Pools → your pool → Nodes. Click any individual node and then Upload Batch logs to retrieve the agent logs to a storage container. These logs contain the actual error messages from the node agent, they're far more specific than job-level errors and will tell you exactly where the start task choked.
Also check your pool's Properties pane. Note the Subnet under Network Configuration, the public IP addresses listed, and the VM size and image SKU. You'll need all of this in later steps. Write it down now, you don't want to be navigating back and forth once you're mid-fix.
If you see the node state cycling through Starting → WaitingForStartTask → Unusable in a loop, that's a definitive sign your start task is failing, almost certainly due to an oversized package install or a bad Python path configuration.
This step applies when your Azure Batch pool and your Azure Storage account are deployed in the same Azure region. Same-region traffic between Batch nodes and storage travels over Microsoft's internal backbone network, which means nodes use private IP addresses, and private IPs cannot be added to a storage account IP-based firewall allowlist.
The correct fix is to allow access via your Batch pool's virtual network subnet instead. Here's the exact path:
- Go to Azure portal → Batch Account → Pools → [your pool] → Properties
- Under Network Configuration, find and copy the full Subnet resource ID (it looks like
/subscriptions/.../virtualNetworks/.../subnets/...) - Navigate to your Storage Account → Networking → Firewalls and virtual networks
- Set Public network access to Enabled from selected virtual networks and IP addresses
- Under Virtual networks, click + Add existing virtual network
- Select the subscription, virtual network, and subnet that your Batch pool uses
If the subnet you're adding does not have the Microsoft.Storage service endpoint enabled, Azure shows this warning: "The following networks don't have service endpoints enabled for 'Microsoft.Storage'. Enabling access will take up to 15 minutes to complete." Click through and save. Wait the full 15 minutes. Then resubmit your Batch job and check whether the ResourceContainerAccessDenied error clears.
If the save succeeds and your job still fails after 15 minutes, go back and verify the Batch pool is actually attached to the subnet you added, not a different subnet or a peered network that resolves differently.
When your Batch pool and storage account are in different Azure regions, the situation is actually simpler in one way: cross-region traffic exits over the public internet, so your Batch nodes will have a public IP address that you can add to the storage firewall allowlist. But there's a catch, that IP address can change.
Every time you resize a pool down to zero nodes and then scale back out, Azure may assign different public IP addresses to the new nodes. This is the core of the problem. If you're manually managing the allowlist, you'll be chasing rotating IPs forever.
The permanent solution is to create a Batch pool with a static public IP address. Here's the approach:
- In the Azure portal, create a new Public IP Address resource. Set it to Static allocation, not Dynamic.
- Create a new Batch pool (or recreate the existing one) with this static public IP assigned. Use the Create a pool with specified public IP addresses option in the Batch account pool creation wizard.
- Once the pool is created, go back to Pool → Properties and verify the static public IP is listed under the pool's Load Balancer configuration.
- Navigate to Storage Account → Networking → Firewall
- Under Firewall, add the static public IP address you assigned
After saving, rerun your Azure HPC job using the newly created pool. The storage access error should be gone. Importantly, delete the old pool once you've confirmed the new one works. Running two pools simultaneously burns unnecessary compute quota and budget.
# Verify your Batch pool's public IP via Azure CLI
az batch pool show \
--pool-id yourPoolId \
--account-name yourBatchAccount \
--account-endpoint yourBatchEndpoint \
--query "networkConfiguration.publicIPAddressConfiguration"
This is the fix for the second major Azure HPC failure mode: nodes that take forever to become ready, or fail during the start task with no obvious error. I've seen engineers wait 45 minutes watching a node sit in WaitingForStartTask before the whole thing times out. If your start task installs a large Python runtime, or any large application suite, this is almost certainly your problem.
The root cause is architectural. Azure Batch's start task model was designed for lightweight environment setup, small scripts, config file drops, service registrations. When you try to install a full Python distribution plus dozens of packages (NumPy, SciPy, TensorFlow, whatever your HPC stack needs) every single time a node is provisioned or reimaged, you're fighting the platform instead of working with it.
The correct fix is to build a custom VM image with Python and all your packages pre-installed, then point your Batch pool at that image. Here's how to do that for a Windows node:
- Create a Windows VM in the Azure portal. Match the region to your Batch account's region exactly, this matters for image compatibility.
- Choose an image with Gen1 in the name (e.g., Windows Server 2019 Datacenter - Gen1). Some VM families don't support Gen2, and Batch can fail silently if there's a mismatch.
- Connect via RDP and run your Python installer. Use the Customize installation option and on the Advanced Options screen, install to
C:\Python310(or your chosen path). - Manually append that path to the System PATH environment variable so all users (including the Batch node agent user) can call Python without a full path.
- Install all required packages via
pip install. - Sysprep and capture the VM as a managed image.
- Create your Batch pool using this custom image under Operating system → Custom image.
After this, node provisioning times drop dramatically, typically from 30–60 minutes down to under 5 minutes, because the start task no longer needs to run a massive install sequence.
Even after building a custom image, Azure HPC node creation can still fail if there's a mismatch between the image SKU, VM family, and pool operating system settings. This is a subtle configuration problem that catches a lot of engineers off guard.
Here's the specific thing to check. When you create a Batch pool in the Azure portal and select your operating system, there's a SKU dropdown in the Operating system section. This list must contain the exact OS version you specified in your custom image. If it doesn't match, Batch may provision the pool but nodes will fail to initialize correctly.
The Gen1 vs Gen2 distinction is particularly important. Microsoft's documentation is clear on this: you should always use a Gen1 image for Batch pools unless you have a specific reason not to, because certain VM families, particularly H-series and N-series which are common in HPC workloads, don't support Gen2 images. If you accidentally select a Gen2 image on a Gen1-only VM family, you'll get a cryptic pool creation error or nodes that provision but immediately become unusable.
To check your image's generation, go to Azure portal → Images → [your custom image] → Properties. Look for the Hyper-V generation field. It should say V1.
For Linux nodes, the equivalent setup involves Ubuntu Server 18.04 LTS - Gen1, and you'll want to configure your Python path via /etc/environment or a .bashrc addition that applies system-wide, not just for the user who built the image. Batch tasks run under a different user context than the one you used during image setup, so user-scoped PATH changes won't carry over.
# Linux: Set system-wide Python path during image preparation
echo 'export PATH="/usr/local/python310/bin:$PATH"' | sudo tee -a /etc/environment
source /etc/environment
python3 --version # Verify it resolves correctly for all users
After reconfiguring the pool with a correctly matched image, delete all existing nodes and let the pool reprovision from scratch. Don't try to reimage existing broken nodes, start clean.
Advanced Troubleshooting
When the standard fixes don't resolve your Azure HPC not working situation, it's time to go deeper. These scenarios typically affect enterprise environments, domain-joined Batch accounts, or deployments with complex networking topology.
Managed Identity Configuration Failures
Azure Batch now supports managed identities as a way to authenticate to storage accounts without storing credentials anywhere. This is the right approach for enterprise HPC, but it introduces a new failure mode. If you've assigned a managed identity to your Batch account or pool but forgot to grant it the correct RBAC role on the storage account, you'll get access denied errors that look identical to firewall errors on the surface.
Check this in the Azure portal: Storage Account → Access Control (IAM) → Role assignments. Your Batch account's managed identity needs at minimum Storage Blob Data Contributor on any container it needs to write to, and Storage Blob Data Reader on containers it only reads from. If the role assignment is missing or scoped to the wrong level (e.g., container vs account level), Batch jobs will fail with access denied regardless of firewall settings.
Subnet Service Endpoint Propagation Delays
When you enable a Microsoft.Storage service endpoint on a subnet, Azure says it takes "up to 15 minutes." In practice, on subnets attached to large virtual networks with many route table entries, propagation can take closer to 25–30 minutes. If you're validating a fix and giving it exactly 15 minutes before concluding it didn't work, try waiting longer before reverting changes.
You can monitor service endpoint status via Azure CLI:
az network vnet subnet show \
--resource-group yourRG \
--vnet-name yourVNet \
--name yourSubnet \
--query "serviceEndpoints"
Look for "provisioningState": "Succeeded" on the Microsoft.Storage entry. If it says Updating, keep waiting.
Cross-Subscription Storage Account Access
Some enterprise HPC deployments use a storage account that lives in a different Azure subscription than the Batch account. In this scenario, virtual network rules work differently. The storage account's virtual network firewall can only reference virtual networks in the same subscription by default. Cross-subscription VNet rules require setting up resource provider registrations in both subscriptions and using the full resource ID format when adding the VNet rule.
Azure Batch Pool Deletion Failures
If an old pool is stuck in a Deleting state and blocking you from recreating it, this is usually caused by nodes that are still running jobs or are in a transitional state. You cannot delete a pool that has running tasks. Force-stop all tasks first via Jobs → [job] → Terminate, then wait 2–3 minutes before attempting pool deletion again. If the pool still won't delete, open a support ticket, forced pool deletion at the infrastructure level is not something the portal can always handle gracefully.
Event Log Analysis for Node-Level Failures
For Windows Batch nodes, RDP into a failed node (if it's still accessible) and check the Windows Event Viewer under Windows Logs → Application. Filter by Source = MicrosoftAzureBatch. Node agent events are verbose and will tell you exactly which command in your start task failed and why, including full paths and return codes that don't appear in the portal UI.
Prevention & Best Practices
Once you get Azure HPC working, the goal is to keep it working. The issues covered in this guide are almost entirely preventable with the right upfront architecture decisions. Here's what to put in place before your next deployment.
Co-locate Batch Accounts and Storage Accounts
Always create your Azure Batch account and its associated Azure Storage account in the same region. Yes, cross-region storage access is technically possible, but it routes over the public internet, requires static public IP management, and introduces latency into your HPC data pipeline. Same-region deployments use the Azure backbone and resolve the IP allowlist problem at the subnet level, which is far more maintainable at scale.
Build Custom Images, Not Heavy Start Tasks
Define a clear boundary: start tasks should handle configuration, not installation. If your start task takes more than 2–3 minutes on a clean node, you're doing too much there. Move application installation into a custom VM image using the managed image pipeline. This gives you faster node provisioning, reproducible environments, and eliminates an entire class of timeout-related failures.
Use Static Public IPs for Cross-Region Scenarios
If you genuinely need cross-region storage access, perhaps because your data lives in a specific region for compliance reasons, always assign a static public IP to your Batch pool at creation time. Never rely on dynamically assigned IPs and then try to keep the storage firewall allowlist current manually. That breaks every time you scale to zero and back out.
Validate Gen1 Compatibility Before Pool Creation
Before you create any Batch pool against a custom image, check the image's Hyper-V generation and cross-reference it against Microsoft's VM family compatibility list. H-series (HB, HC, HBv2, HBv3) and N-series (NC, ND, NV) are the VM families most commonly used in HPC workloads, and several of them only support Gen1 images. Catching this mismatch before pool creation saves hours of debugging.
- Tag every Batch account and storage account with matching
environmentandregiontags to catch cross-region mismatches at a glance in the portal - Enable Diagnostic Settings on your Batch account and stream logs to a Log Analytics workspace, this gives you a persistent, searchable history of node events and job failures
- Set a pool resize timeout explicitly (Azure default is 15 minutes, but complex start tasks need 30+) to prevent premature node failure declarations
- Use User Subscription pool allocation mode instead of Batch Service if you need VNet integration and managed identity support in the same pool, Batch Service mode has limitations in those configurations
Frequently Asked Questions
Why does my Azure Batch job say ResourceContainerAccessDenied even though I added my IP to the storage firewall?
If your Batch pool and storage account are in the same Azure region, adding an IP address to the storage firewall won't help, same-region traffic flows over the Azure backbone using private IPs, which can't be allowlisted. You need to add the Batch pool's subnet to the storage account's virtual network rules instead. Go to Storage Account → Networking → Firewalls and virtual networks → add your Batch subnet. Wait up to 15 minutes for the service endpoint to propagate before retesting.
My Batch nodes are stuck in WaitingForStartTask for 30+ minutes, what do I do?
This is almost always a start task that's taking too long to complete, most commonly because you're installing a large Python distribution or a heavyweight application package at node initialization time. The fix is to pre-bake that installation into a custom VM image rather than running it as a start task. Build the image once with everything pre-installed, capture it as a managed image, and point your Batch pool at that image. Node ready times drop from 30–60 minutes to under 5 minutes.
Can I use Azure Batch with a storage account that has a firewall if I'm uploading application packages?
No. This is a hard limitation documented by Microsoft. Application package uploads to the storage account associated with your Batch account require that the storage account has no firewall configured at all. All of the subnet-based and IP-based workarounds described for job execution do not apply to application package uploads. If you need both a firewall and application packages, you'll need to use a separate, unfirewalled storage account for packages and a firewalled one for job data.
Does it matter whether I use Gen1 or Gen2 VM images for Azure Batch HPC pools?
Yes, it matters a lot for HPC workloads specifically. The H-series and N-series VM families commonly used in Azure HPC, HB, HC, HBv2, HBv3, NC, ND, often don't support Gen2 images. If you create a Batch pool using a Gen2 image on one of these VM families, you'll hit pool creation errors or nodes will fail to initialize. Always use Gen1 images for HPC Batch pools unless you've confirmed Gen2 support for your specific VM SKU. When creating a pool, verify the SKU list in the Operating system section matches what your image was built on.
How do I find out why a specific Batch node failed without RDP access?
From the Azure portal, go to Batch Account → Pools → your pool → Nodes → click the specific node → Upload Batch logs. You'll need to specify a storage container for the logs to be uploaded to. Once uploaded, download the agent-log.txt and start-task-stderr.txt files, these contain the raw output from the node agent and your start task's standard error stream. They're far more detailed than anything visible in the portal UI and will tell you exactly which command failed and with what return code.
My Batch pool's public IP changes every time I resize to zero, is there any way to keep it static?
Yes. Create a dedicated Public IP Address resource in the Azure portal with Static allocation, then assign it to your Batch pool at creation time using the "Create a pool with specified public IP addresses" option. Once assigned, that IP stays the same regardless of how many times you resize the pool to zero and back out. You then add this one static IP to your storage account firewall allowlist and never have to touch the allowlist again. This is the recommended approach for any cross-region Batch and storage deployment.