How to Troubleshoot Azure HPC Batch (Complete Fix Guide)
Why This Is Happening
I've seen this exact scenario dozens of times: you submit a perfectly valid Azure HPC Batch job, walk away to grab coffee, and come back to find it stuck in Active state , or worse, your compute nodes are sitting in Unusable with zero explanation in the portal. Azure HPC Batch troubleshooting is legitimately one of the harder distributed-computing problems to debug because the failure surface is enormous. You've got pool allocation, node provisioning, startup tasks, application packages, task scheduling, and inter-node communication , any layer can silently fail.
Let me be honest: Azure's error messages are often maddening. "The node is currently unusable" tells you nothing. "Pool resize failed" gives no root cause. The portal hides half the relevant state behind nested blades that most engineers never find. That's exactly why I wrote this guide, to give you the full troubleshooting chain that Microsoft support would walk through internally.
The most common root causes I encounter for Azure HPC Batch job failures and pool errors fall into five buckets:
- Quota exhaustion, Your subscription's regional vCPU quota or Batch account core quota is maxed out. Azure silently queues your pool resize and never tells you it's blocked.
- Startup task failures, Your pool startup task script exits with a non-zero code, which marks every node as unusable before a single job task runs.
- Application package deployment failures, A referenced package version doesn't exist, the storage account SAS token expired, or the package ZIP is malformed.
- VM SKU unavailability, The H-series or N-series SKU you requested isn't available in your target region at that moment (this is especially common for GPU nodes and high-performance compute SKUs like HB120rs_v3).
- Inter-node communication misconfiguration, MPI workloads on multi-instance tasks fail because the subnet NSG is blocking required ports, or the pool isn't configured with inter-node communication enabled.
This guide covers Azure Batch job scheduling errors, HPC workload task failures, pool allocation problems, and the deeper diagnostic paths that surface what's actually broken. Whether you're running MPI jobs on HPC Pack clusters, tightly-coupled scientific workloads, or embarrassingly parallel rendering tasks, these steps apply.
Browse all Microsoft fix guides →The Quick Fix, Try This First
Before you go deep, run this Azure CLI diagnostic sequence. It covers the top three causes of Azure HPC Batch job stuck in active state and pool failures in under five minutes.
Open Azure Cloud Shell or your local terminal with the Azure CLI installed and authenticated. Run:
# Check your Batch account quota usage
az batch account show \
--name <your-batch-account> \
--resource-group <your-rg> \
--query "{dedicatedCoreQuota:dedicatedCoreQuota, dedicatedCoreQuotaPerVMFamily:dedicatedCoreQuotaPerVMFamily}"
# Check pool state and last resize error
az batch pool show \
--pool-id <your-pool-id> \
--account-name <your-batch-account> \
--account-endpoint https://<your-batch-account>.<region>.batch.azure.com \
--query "{state:state, resizeErrors:resizeErrors, currentDedicatedNodes:currentDedicatedNodes, targetDedicatedNodes:targetDedicatedNodes}"
# Check job state and recent task failures
az batch job show \
--job-id <your-job-id> \
--account-name <your-batch-account> \
--account-endpoint https://<your-batch-account>.<region>.batch.azure.com \
--query "{state:state, executionInfo:executionInfo}"
Look at the resizeErrors field. If it says AllocationFailed with a reason of Overcommitted, you've hit quota. If it says AccountCoreQuotaReached, same issue, file a quota increase request in the Azure portal under Help + Support → New support request → Service limit (quota). If it says NodeAgentClientTimeout or StartTaskFailed, jump straight to Step 3.
If the pool state is Active but currentDedicatedNodes is zero and targetDedicatedNodes is non-zero, your pool resize is silently blocked, which means the job will never dispatch tasks. That's not a job problem, it's a pool problem. Fix the pool first.
--query when using az batch commands, the raw JSON output for pool objects is hundreds of lines long and the signal gets buried. The field that saves the most debugging time is resizeErrors[].code combined with resizeErrors[].message. I've watched engineers spend two hours in the portal on something that field reveals in ten seconds.
Pool allocation failure is the number-one cause of Azure Batch compute node unusable state and jobs that never run. Here's how to nail the exact cause.
In the Azure portal, navigate to your Batch account, click Pools in the left menu, select your pool, then click Resize history. You'll see a log of every scale operation with a status. Failed entries show an error code inline.
The two most common error codes here:
- AllocationFailed / OverconstrainedAllocationRequest, The VM SKU you specified (e.g.,
Standard_HB120rs_v3) is not available in enough quantity in your region right now. The fix is to either request a different region, use spot/low-priority nodes, or wait and retry. - AccountCoreQuotaReached, Your Batch account dedicated core quota is exhausted. Navigate to your Batch account → Quotas blade to see current usage vs. limit.
To request a quota increase via CLI:
az support tickets create \
--ticket-name "Batch-Core-Quota-Increase-$(date +%Y%m%d)" \
--title "Increase Azure Batch dedicated core quota" \
--description "Need dedicated core quota increase for HPC workloads in East US" \
--problem-classification "/providers/Microsoft.Support/services/batch/problemClassifications/quota" \
--severity "moderate" \
--contact-first-name "Your" \
--contact-last-name "Name" \
--contact-email "you@company.com" \
--contact-country "USA"
If the SKU is the issue, check regional availability:
az vm list-skus \
--location eastus \
--size Standard_HB \
--output table
Look for NotAvailableForSubscription in the Restrictions column. If your target SKU shows that, try westus2 or northeurope, HPC SKUs have uneven availability. If it works, update your pool's VM size and redeploy.
Success looks like: resizeErrors is empty, currentDedicatedNodes equals targetDedicatedNodes, and pool state is Active with nodes in Idle state.
If your nodes are showing up but sitting in Unusable state, specifically with a StartTaskFailed reason, the startup task script is crashing. This is incredibly common with HPC Batch setups because startup tasks often install MPI runtimes, CUDA drivers, or custom software, and any of those can fail silently on a fresh node image.
First, get the exact failure output. In the portal: Batch account → Pools → your pool → Nodes → click a failed node → Files on node. Navigate to startup\stderr.txt and startup\stdout.txt. That's where your startup script's output lives. Download both.
Via CLI:
# List files on a failed node
az batch node file list \
--pool-id <pool-id> \
--node-id <node-id> \
--account-name <account-name> \
--account-endpoint https://<account>.<region>.batch.azure.com \
--recursive true \
--query "[?name=='startup/stderr.txt' || name=='startup/stdout.txt']"
# Download the stderr log
az batch node file download \
--pool-id <pool-id> \
--node-id <node-id> \
--file-path startup/stderr.txt \
--destination ./node-stderr.txt \
--account-name <account-name> \
--account-endpoint https://<account>.<region>.batch.azure.com
Common startup task errors I see:
command not found, Your script references a tool not on the base image (e.g.,module load mpiworks on HPC Pack but not vanilla Ubuntu).Permission denied, The task is running as a non-admin user but trying to install packages. SetelevationLeveltoadminin your pool startup task definition.apt-get: unable to lock /var/lib/dpkg/lock, Race condition with the OS background package manager. Add asleep 30 && apt-get -y updateat the top of your script.
After fixing the script, reimage the nodes:
az batch node reimage \
--pool-id <pool-id> \
--node-id <node-id> \
--account-name <account-name> \
--account-endpoint https://<account>.<region>.batch.azure.com \
--node-reimage-option requeue
Success: the node transitions from Rebooting → Starting → WaitingForStartTask → Idle. If it goes back to Unusable, the stderr log has more detail to chase.
Azure Batch application package deployment failure is one of those issues that shows up as a generic node error but is actually a storage or packaging problem. Batch downloads application packages from your linked Azure Storage account at node startup. If that download fails, the node goes unusable, and the error message in the portal is often just "node unusable" with no package-specific detail.
Check the node's agent log. It's at agent-logs\batch_agent.log on the node files browser. Search for lines containing ApplicationPackage or download. You'll see entries like:
[ERROR] Failed to download application package 'myapp' version '1.2.0': BlobNotFound (404)
[ERROR] SAS token for storage container expired at 2026-03-15T12:00:00Z
The three fixes, in order of likelihood:
1. Missing or wrong package version. In the portal: Batch account → Applications → find your application → verify the exact version string. Your pool definition must reference the exact version. A typo ("1.2.0" vs "1.20") will cause a 404 every time.
2. Expired storage account SAS token. Batch auto-generates SAS tokens for its linked storage account, but if you've manually linked a storage account and something changed (account key rotation, policy change), tokens can expire. Re-link the storage account:
az batch account set \
--name <batch-account> \
--resource-group <rg> \
--storage-account <storage-account-name>
3. Malformed application package ZIP. Batch expects the ZIP to extract cleanly. If your build pipeline created a nested ZIP (a ZIP inside a ZIP) or left macOS __MACOSX directories in the archive, extraction can fail. Test locally:
unzip -t myapp-1.2.0.zip
Fix the packaging, upload a new version, increment the version number (you can't overwrite an existing version, Batch locks them), and update your pool definition to reference the new version. Then reimage affected nodes.
Success: node agent log shows Successfully downloaded application package 'myapp' version '1.2.1' and nodes transition to Idle.
Your pool is healthy, nodes are idle, job is active, but tasks keep failing with exit code 1, or -1073741502 (a Windows process crash code), or 137 (Linux OOM kill). Azure HPC Batch task exit code errors need a different diagnostic path than pool or node errors.
Start by listing failed tasks and their exit codes:
az batch task list \
--job-id <job-id> \
--account-name <account-name> \
--account-endpoint https://<account>.<region>.batch.azure.com \
--filter "state eq 'completed'" \
--query "[?executionInfo.exitCode != \`0\`].{id:id,exitCode:executionInfo.exitCode,failureInfo:executionInfo.failureInfo}" \
--output table
Then pull the stdout and stderr for a specific failed task:
az batch task file download \
--job-id <job-id> \
--task-id <task-id> \
--file-path stderr.txt \
--destination ./task-stderr.txt \
--account-name <account-name> \
--account-endpoint https://<account>.<region>.batch.azure.com
Interpret common exit codes:
- Exit code 1, Generic script/application error. Read stderr, it's almost always a missing dependency, wrong path, or invalid argument.
- Exit code 137, Linux OOM kill. Your task is consuming more memory than the node SKU provides. Either increase the VM size or reduce per-task memory usage. Check
/var/log/syslogon the node forOut of memory: Kill processlines. - Exit code -1073741502 (0xC0000142), Windows DLL initialization failure. A dependent library is missing. Check that your application package includes all required DLLs, or install the Visual C++ Redistributable in your startup task.
- Exit code 2 on MPI tasks, Usually a rank 0 connectivity issue. Verify inter-node communication is enabled on the pool and check NSG rules (see Step 5).
For multi-instance MPI tasks specifically, check the coordination command output, it runs before the actual task command and sets up the MPI environment. It has its own stdout/stderr files on each node under the task directory.
Success: tasks complete with exit code 0 and show state Completed in az batch task list.
This one bites almost every team running tightly-coupled HPC Batch workloads. MPI jobs require direct TCP/UDP communication between nodes on specific ports, and Azure's default NSG rules block most of it. The symptom: MPI hangs indefinitely, rank 0 times out waiting for other ranks, and you see connection refused errors in the task stderr.
First, verify your pool has inter-node communication enabled:
az batch pool show \
--pool-id <pool-id> \
--account-name <account-name> \
--account-endpoint https://<account>.<region>.batch.azure.com \
--query "enableInterNodeCommunication"
If this returns false, you must recreate the pool with --enable-inter-node-communication true. You cannot toggle this on an existing pool.
Next, check the VNet NSG attached to your Batch pool subnet. Batch requires these inbound rules to be open within the subnet's CIDR range:
# Required NSG rules for Azure Batch VNet pools (port ranges)
# Batch service management traffic
Source: BatchNodeManagement.<region> Port: 29876-29877 Protocol: TCP Action: Allow
# Node-to-node inter-node communication (MPI)
Source: VirtualNetwork Port: Any Protocol: Any Action: Allow
# Required outbound rules
Destination: Storage Port: 443 Protocol: TCP Action: Allow
Destination: BatchNodeManagement Port: 443 Protocol: TCP Action: Allow
Add the missing rules via CLI:
az network nsg rule create \
--resource-group <rg> \
--nsg-name <nsg-name> \
--name AllowBatchNodeManagementInbound \
--priority 100 \
--direction Inbound \
--source-address-prefixes BatchNodeManagement.eastus \
--destination-port-ranges 29876 29877 \
--protocol Tcp \
--access Allow
az network nsg rule create \
--resource-group <rg> \
--nsg-name <nsg-name> \
--name AllowVNetInterNode \
--priority 110 \
--direction Inbound \
--source-address-prefixes VirtualNetwork \
--destination-port-ranges '*' \
--protocol '*' \
--access Allow
Also verify your MPI library matches across all nodes. OpenMPI 4.x and MPICH 3.x have different process manager ports. If you're using InfiniBand-accelerated SKUs (HB, HC, HBv3), confirm that the ib0 interface is coming up in your startup task, run ibstatus or check /sys/class/infiniband/ to verify the IB fabric is active.
Success: MPI ranks connect, the coordination command completes without timeout, and task stdout shows all ranks reporting in (e.g., Hello from rank 0 of 64).
Advanced Troubleshooting
When the five steps above don't resolve it, you're dealing with something deeper. Here's what I reach for in those cases.
Azure Monitor Diagnostic Logs for Batch
Enable diagnostic settings on your Batch account to ship logs to a Log Analytics workspace. Navigate to Batch account → Diagnostic settings → Add diagnostic setting. Enable ServiceLog and AllMetrics. Send to a Log Analytics workspace.
Once logs are flowing (takes about 5 minutes), query in Log Analytics:
// Find all pool resize errors in the last 24 hours
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.BATCH"
| where Category == "ServiceLog"
| where OperationName contains "PoolResize"
| where ResultType == "Failed"
| project TimeGenerated, OperationName, ResultDescription, properties_s
| order by TimeGenerated desc
// Task failure analysis by exit code
AzureDiagnostics
| where Category == "ServiceLog"
| where OperationName == "TaskComplete"
| extend exitCode = tostring(parse_json(properties_s).exitCode)
| where exitCode != "0"
| summarize count() by exitCode
| order by count_ desc
Event Viewer on Batch Nodes
For Windows Batch nodes (Server 2019 / 2022 base images), the Windows Event Log has critical diagnostics. RDP into a problem node (enable RDP access via the portal: Pool → Nodes → Connect → RDP) and check:
- System log, Event ID 41 (unexpected restart/kernel panic), Event ID 6008 (unexpected shutdown), Event ID 7001 (service failed to start)
- Application log, Filter on Error level events around the time the node went unusable
Microsoft-Windows-TaskScheduler/Operational, Shows Batch agent task scheduling failures
Group Policy Interference (Domain-Joined Nodes)
In enterprise environments where Batch nodes are joined to an Active Directory domain, Group Policy can actively break Batch. I've seen GPOs that:
- Remove local admin rights from the Batch task user account (
_azbatch) - Deploy firewall rules that block inter-node ports
- Force proxy settings that prevent Batch agent from reaching Storage or the Batch service endpoint
- Apply disk encryption (BitLocker) that interferes with the Batch agent's working directory
To check active GPO application on a node, open PowerShell as admin and run:
gpresult /H C:\Temp\gpo-report.html /F
# Then open the HTML report and look for policies under Computer Configuration\Windows Settings\Security Settings
If GPO is the culprit, work with your AD team to create an OU for Batch nodes and exclude that OU from the conflicting policies.
Batch Node Agent Version Mismatch
The Batch node agent updates automatically, but there's a window after a new agent version ships where it can conflict with older pool configurations. Check the current agent version on a node:
az batch node show \
--pool-id <pool-id> \
--node-id <node-id> \
--account-name <account-name> \
--account-endpoint https://<account>.<region>.batch.azure.com \
--query "nodeAgentInfo"
Compare to the latest listed at the Microsoft Support Batch node agent release notes. If there's a known regression in the current version, Microsoft typically pushes a hotfix within 48 hours, but you can temporarily pin your pool to a specific VM image version as a workaround.
Escalate to Microsoft Support if: your pool allocation fails with InternalServerError (their backend problem, not yours), the Batch service endpoint is returning 5xx errors confirmed by Azure Status page, a node goes unusable with no startup task failure and empty agent logs, or you're seeing data corruption in Batch output files. Open a Severity B (business impact) support ticket with your Batch account name, pool ID, node IDs, and the time window of the failures. The faster you provide those four things, the faster they'll respond.
Prevention & Best Practices
Fixing broken HPC Batch workloads is painful. Here's how to avoid most of these problems in the first place.
Monitor quota proactively. Don't wait for an allocation failure at 2 AM. Set up an Azure Monitor alert on the CoreCount metric for your Batch account with a threshold at 80% of your quota. Alert to your team's Slack or Teams channel so you request quota increases well before you need them.
Test startup tasks in isolation. Before deploying a new startup task to a production pool, create a small test pool (2 nodes) with the same VM SKU and startup task. Let it provision, verify nodes reach Idle state, then promote to production. A 10-minute pool test saves hours of production debugging.
Use pool auto-scaling with a conservative formula. Don't target 100% capacity in your autoscale formula, that leaves no headroom for Azure's allocation system. I recommend targeting 90% of your peak workload as the autoscale maximum and keeping at least 2 dedicated nodes running continuously to avoid cold-start delays.
Pin your VM image version. Instead of using the latest marketplace image (which changes over time), pin to a specific image version in your pool configuration. This makes your environment reproducible and avoids being caught by unexpected OS updates breaking your startup task.
Implement task retry logic. Set maxTaskRetryCount to at least 2 for all tasks. Transient node failures, spot eviction, Azure fabric maintenance, network blips, are a reality in cloud HPC. Tasks that auto-retry handle these transparently without human intervention.
Archive task output to Blob Storage. By default, stdout/stderr are only available while the node exists. If the node is deallocated (e.g., autoscale down), you lose those logs. Configure output file upload in your task definition to ship logs to a Blob container immediately on task completion.
- Set
taskSlotsPerNodecorrectly, leaving it at 1 on a 128-core HBv3 wastes 127 cores per node - Enable Batch diagnostic logs to Log Analytics before you have a problem, not after
- Use Managed Identity for Batch pools instead of storage account keys, eliminates the SAS token expiry class of failures entirely
- Tag all Batch resources (account, pools, jobs) with
environmentandcost-centertags to correlate billing spikes with specific workloads
Frequently Asked Questions
Why is my Azure Batch job stuck in Active state and never running any tasks?
A job stuck in Active with zero running tasks almost always means there are no idle nodes to pick up the work. Check your pool's currentDedicatedNodes count, if it's zero or less than targetDedicatedNodes, your pool resize is blocked. The most common causes are quota exhaustion (check the resizeErrors field on the pool) or the target VM SKU being temporarily unavailable in your region. Fix the pool first, and once nodes reach Idle state, the Batch scheduler will automatically dispatch tasks. You don't need to resubmit the job.
What does "node is unusable" mean in Azure Batch and how do I fix it?
Unusable nodes have failed at some point during the provisioning sequence, either startup task failure, application package download failure, or an internal node agent error. The first thing to check is startup/stderr.txt on the node files browser, which shows exactly why the startup task crashed. If that file is empty, check agent-logs/batch_agent.log for download or initialization errors. After fixing the underlying cause (script error, missing package version, expired SAS token), reimage the node with az batch node reimage and watch it re-provision. If it goes unusable again, the error log will give you the next clue.
My MPI job hangs forever on Azure Batch, how do I debug it?
MPI hangs in Batch are almost always NSG or inter-node communication issues. First, confirm your pool has enableInterNodeCommunication: true, if not, you'll need to recreate the pool because this can't be changed after creation. Second, check your VNet NSG for rules that might be blocking inter-node TCP/UDP traffic, specifically, ensure the VirtualNetwork service tag can communicate on all ports within the subnet. Third, check that all nodes are on the same MPI library version and that your coordination command completes successfully. Pull the coordination command stderr from each node (not just rank 0) to see which rank is timing out first.
How do I see why a Batch task failed if the node was already deallocated?
This is why output file persistence matters so much. If the node is gone, az batch task file download will return a 404. The fix going forward is to configure output files in your task definition to upload stdout/stderr to Azure Blob Storage immediately on task completion or failure, use uploadCondition: taskCompletion or taskFailure. For past failures where logs are already gone, check if you have Azure Monitor diagnostic logs enabled for your Batch account, the ServiceLog category captures task completion events including exit codes and failure reasons, even after the node is deallocated.
Can I run Azure HPC Batch on spot/low-priority VMs and how do I handle evictions?
Yes, and it can cut your compute cost by 60–90% for fault-tolerant workloads. Set targetLowPriorityNodes instead of targetDedicatedNodes in your pool config. The catch is eviction: when Azure reclaims a spot node, any task running on it gets requeued automatically if you've set maxTaskRetryCount greater than 0 on your tasks. Design your task logic to be idempotent, use checkpoint files in Blob Storage so a restarted task can resume mid-computation rather than starting from zero. Don't use spot nodes for MPI multi-instance tasks unless you can afford to restart the entire job on eviction, since losing one rank kills the entire MPI job.
How do I increase the Azure Batch dedicated core quota for HPC SKUs?
Quota increases for HPC SKUs (HB, HC, HBv3, NDv4) require a support ticket, they're not available through the self-service quota increase path in the portal. Go to Help + Support → New support request, select Service and subscription limits (quotas), choose your subscription, then select Batch as the quota type. In the details, specify the VM family (e.g., HBv3 Series), target region, and the number of cores you need. Include a brief justification, "running computational fluid dynamics simulations, need 5,000 HBv3 cores in East US", because the Batch team prioritizes requests with clear workload context. Response time is typically 1–3 business days.