Azure HPC Pack Troubleshooting: Fix Every Major Issue

Microsoft Fix Advanced 14 min read Official Docs Grounded Updated April 20, 2026

What's in This Guide

Why This Happens
The Quick Fix
Step-by-Step Solution
Advanced Troubleshooting
Prevention & Best Practices
FAQ

Why Azure HPC Pack Troubleshooting Is So Painful

I've seen this exact situation play out dozens of times in enterprise environments. You've got a high-performance computing cluster running critical workloads , scientific simulations, financial modeling, rendering pipelines , and suddenly something breaks. Jobs sit stuck in canceling mode for hours. A Linux compute node refuses to come online. The HPC Management Service won't start after you restored a backup. And the error messages Azure hands you? Usually cryptic enough to send even experienced engineers straight to a search engine.

Azure HPC Pack is a powerful product, it's Microsoft's full-featured job scheduler and cluster manager built for the Azure cloud, but it's also a genuinely complex piece of infrastructure. You're dealing with multiple services running across head nodes and compute nodes, SQL databases handling scheduler state, Azure IaaS VMs running Linux extensions, and a management layer that ties it all together. When any one of those layers misbehaves, the cascade of failures can look completely unrelated to the actual root cause.

The three issues that cause the most pain for HPC Pack administrators are: the Linux compute node agent extension failing to install, HPC jobs getting permanently stuck in canceling mode, and the HPC Management Service refusing to start after a database restore. Each one has a distinct cause, and none of them are obvious from the error text alone. The scheduler log will tell you "The scheduler server is busy" without mentioning that your Azure SQL database has hit its DTU ceiling. The management event log will throw an InstanceCacheLoadException without explaining that duplicate instance state records in your database are the culprit.

This guide covers all three of those core Azure HPC Pack issues in depth, plus the full range of supporting problems: HPC compute nodes not showing up or sitting in error state, node manager service becoming unreachable, Excel offloading jobs stalling out, and reporting database connection and permission failures. Every fix here is grounded in Microsoft's official documentation and tested against real cluster environments.

If you're dealing with any of these issues right now, you're in the right place. Browse all Microsoft fix guides →

The Quick Fix, Try This First

Before you spend an hour digging through event logs and PowerShell transcripts, run through this fast triage sequence. It catches the majority of Azure HPC Pack troubleshooting scenarios in under five minutes.

If your jobs are stuck in canceling mode: Open the Azure portal, navigate to your SQL database resource, and check the DTU consumption graph. If it's pegged at or near 100%, that's your problem right there, the head node is trying to communicate with the scheduler database and the database literally can't keep up. The fix is to scale up. Go to Configure → Pricing tier for the Azure SQL database powering your HPC scheduler, and bump the DTU to a higher SKU. Microsoft's documentation is explicit: the minimum starting DTU for the HPC scheduler database is 100 DTU. If you're running below that, or your workload has grown past what your current SKU can handle, you'll see exactly this symptom, jobs frozen in canceling, new jobs stuck in queue even with idle compute nodes sitting right there.

If the HPC Management Service won't start: The fastest diagnostic is a single SQL query against your HPC Management database. Connect to the database with SQL Server Management Studio or Azure Data Studio, then run:

SELECT instanceId, count(*) as Number 
FROM Instances 
WHERE instanceState = 2 
GROUP BY instanceId 
HAVING count(*) > 1

If that query returns any rows, you've confirmed the issue: duplicate "Current" state records are confusing the service at startup. The resolution script (covered in detail in Step 3 below) handles this automatically. Don't try to manually edit those rows, the PowerShell script does it safely.

If a Linux compute node agent extension failed to install: The extension log on the node itself is your first and best diagnostic source. Don't spend time guessing, SSH into the node and read the log directly. That's your fastest path to the actual error.

Restarting services or rebooting nodes almost never fixes any of these three issues. I've watched admins restart the head node three times while the real problem, a database DTU limit, quietly continued causing the same failure. Check the root cause first.

Pro Tip

When Azure HPC Pack jobs are stuck and you restart nodes with no improvement, resist the instinct to keep cycling services. Check the PaaS database metrics in the Azure portal first. A saturated SQL database will make every downstream symptom look like a compute or network problem when it's actually a database tier problem, and the fix takes two minutes once you know where to look.

Enable SSH on the Head Node and Access the Extension Log

When the Microsoft.HpcPack.LinuxNodeAgent2016U1 extension fails to install on a Linux compute node, your diagnostic starting point is the extension log file sitting on that node. But to get there, you first need SSH connectivity from the head node, and on a fresh Windows Server head node, SSH server capability isn't enabled by default.

Open an administrative PowerShell console on the head node and run these commands in sequence:

dism /Online /Add-Capability /CapabilityName:OpenSSH.Server~~~~0.0.1.0
Start-Service sshd
Set-Service -Name sshd -StartupType 'Automatic'
Set-Service -Name ssh-agent -StartupType 'Automatic'
Start-Service ssh-agent

The dism command adds the OpenSSH server capability to Windows. The subsequent commands start the SSH daemon and the SSH agent service, and set both to start automatically so they survive a reboot. You'll see output from DISM confirming the operation completed successfully, look for "The operation completed successfully." at the end of that command's output.

Once SSH is running on the head node, connect to the Linux compute node using its private IP address and your domain administrator credentials:

ssh <domain-administrator-name>@<private-ip-address-of-linux-compute-node>

Enter the domain administrator account password when prompted. Once you're connected to the Linux node, elevate to root and check that the extension log file exists:

sudo su ls -la /var/log/azure/Microsoft.HpcPack.LinuxNodeAgent2016U1/extension.log

If that path doesn't exist at all, it typically means the extension installation didn't even begin, which points to a networking or authentication issue between Azure and the VM rather than a configuration problem with the extension itself. If the file exists, open it with cat, less, or whatever text viewer you prefer, and scroll to the most recent entries at the bottom. The log will tell you exactly where the installation failed.

Provision a Test Linux IaaS Node to Isolate the Failure

Reading the extension log gets you the error message, but sometimes you need to rule out whether the problem is specific to one node's configuration or a broader issue with your node template. The way to do that is to spin up a fresh IaaS compute node in a controlled test and observe whether it provisions successfully.

In HPC Cluster Manager, create a new Azure IaaS node template. Work through the template creation wizard and when you reach the Specify VM Image section, configure it with these exact settings:

Field	Value
Image Type	MarketplaceImage
OS Type	Linux
Image Label	Red Hat Enterprise Linux 7.8

Save the template, then create the IaaS compute nodes using it. In the Add Node wizard, at the Specify New Nodes section, use these settings:

Field	Value
Node template	The name of the template you just created
Number of nodes	1
VM Size of nodes	A1 (1 core, 1.75 GB Memory)

Submit a new job in HPC Cluster Manager for this test node. In the Resource Selection section of the job creation wizard, find LinuxNodes in the Available node groups list, select it, and click Add to move it to the Selected node groups list. Submit the job. If the Linux node provisions correctly and the job runs, the issue is isolated to your original node's specific configuration, not the template or Azure extension delivery mechanism. If this test also fails, you're dealing with something at the infrastructure layer that needs deeper investigation.

Fix the Database Instance State to Restore HPC Management Service

This one trips up even experienced HPC Pack administrators. After restoring a corrupted HPC Management database, the HPC Management Service fails with InstanceCacheLoadException in the event log, and the full error reads: "The instance collection of IDs cannot be resolved in the current instance view." The service simply cannot initialize.

What's actually happening: the restore process left duplicate records in the Instances table of the HPC Management database. Each instance should have exactly one record with instanceState = 2 (the "Current" state). After a bad restore, some instances end up with two or three records in that Current state, and the service's startup routine can't resolve which one is authoritative, so it crashes out entirely.

The fix is a PowerShell script that automatically identifies those duplicate records and corrects their state. Save the following as FixInstanceStateError.ps1:

param (
    [Parameter(Mandatory=$true)]
    [string] $ServerInstance,
    [Parameter(Mandatory=$false)]
    [string] $Database = "HpcManagement"
)

$dupInstances = Invoke-Sqlcmd -ServerInstance $ServerInstance -Database $Database `
    -Query "SELECT instanceId, count(*) as Number FROM Instances 
            WHERE instanceState = 2 
            GROUP BY instanceId 
            HAVING count(*) > 1"

$instanceIds = $dupInstances.instanceId
$idsString = $instanceIds -join "','"

$instanceEntries = Invoke-Sqlcmd -ServerInstance $ServerInstance -Database $Database `
    -Query "SELECT * FROM Instances 
            WHERE instanceId IN ('$idsString') 
            AND instanceState = 2"

$sortedEntries = $instanceEntries | Sort-Object `
    -Property @{Expression="instanceId"; Descending=$true},
              @{Expression="instanceVersion"; Descending=$true}

$idMap = @{}
foreach ($entry in $sortedEntries) {
    if ($idMap[$entry.instanceId]) {
        Invoke-Sqlcmd -ServerInstance $ServerInstance -Database $Database `
            -Query "UPDATE Instances SET instanceState = 3 
                    WHERE instanceId = '$($entry.instanceId)' 
                    AND instanceVersion = $($entry.instanceVersion)"
    } else {
        $idMap[$entry.instanceId] = $true
    }
}

Run the script with your SQL server instance name: .\FixInstanceStateError.ps1 -ServerInstance "YourSQLServerName". After it completes, try starting the HPC Management Service again. You should see it initialize cleanly. If it still fails, run the diagnostic SQL query again to verify no duplicate records remain.

Scale Up the Azure SQL Database DTU to Unblock Stuck Jobs

When HPC jobs are stuck in canceling mode and new jobs won't transition from queued to running, even when compute nodes are sitting idle, the scheduler database is the place to look, not the compute layer. The HPC head node's scheduler log will show: "The scheduler server is busy. It can not handle the client request now. Please try again later." That message, combined with jobs frozen in state, is the signature of a DTU-saturated Azure SQL database.

Here's what's happening operationally: the HPC scheduler relies on constant read and write operations to the Azure SQL database to manage job state transitions. When that database's performance tier can't keep up with the transaction volume, it hits 100% DTU consumption, the scheduler falls behind. Cancel operations get queued but never complete. New job submissions sit in queue because the scheduler can't write the state transitions needed to assign them to nodes. Restarting the head node or compute nodes doesn't help at all because the bottleneck is in the database tier, not the compute layer.

To fix it: open the Azure portal and navigate to your Azure SQL database resource, specifically the one serving as the HPC scheduler database. In the left menu, select Configure, then choose Pricing tier (or Compute + storage depending on your Azure portal version). Check the current DTU allocation.

Microsoft's official guidance is clear: the minimum starting DTU for the HPC scheduler database is 100 DTU. If you're running at 50 DTU or below, scale up immediately. For larger clusters or high job submission rates, you may need 200 DTU or more. Scale to the next tier, apply the change (it takes effect within a few minutes without downtime on most tiers), and watch the DTU consumption graph. Once it drops below saturation, the scheduler will catch up on the pending state transitions and your stuck jobs should resolve.

After scaling, check the scheduler log to confirm the error messages stop appearing, and verify that new job submissions begin transitioning to running state normally.

Diagnose and Resolve HPC Compute Node and Service Errors

Beyond the three primary issues above, Azure HPC Pack troubleshooting regularly surfaces a cluster of related problems: compute nodes not appearing in HPC Cluster Manager, nodes stuck in error state, the HPC Node Manager service becoming unreachable, Excel offloading jobs stalling, and reporting database connectivity or permission failures. Here's how to work through each systematically.

Nodes not shown or in error state: Start by checking network connectivity between the head node and the affected compute nodes. Verify that the HPC Node Manager service is running on each compute node, if it's stopped or in a failed state, that's why the head node can't see the node. From HPC Cluster Manager, right-click the affected node and choose Bring Online. If the node transitions to an error state immediately, examine the node's event logs for service-level failures.

HPC Node Manager service unreachable: This usually comes down to firewall rules or certificate issues. Verify that the required HPC Pack communication ports aren't being blocked by Windows Firewall or a network security group. The HPC Node Manager service uses specific ports for head-node-to-compute-node communication, check that those are open in your NSG inbound rules for the compute node subnet.

Excel offloading job stalled: These jobs stall when the Excel workbook session can't establish a connection back to the compute nodes running the calculation. Verify the Excel-related HPC services are running on the head node, and that the compute nodes assigned to LinuxNodes or WindowsNodes groups have the correct capabilities registered. Check the job's detail view in HPC Cluster Manager for specific error messages on the task level.

Reporting database connection or permission problems: Connect to the HPC reporting database directly with SQL Server Management Studio and verify that the HPC service account has the correct permissions, at minimum db_datareader and db_datawriter roles on the reporting database. Connection problems often trace back to a service account password change that wasn't propagated to the HPC service configuration.

Advanced Troubleshooting for Azure HPC Pack

If the standard steps above haven't resolved your issue, it's time to go deeper. These are the techniques I use when a problem isn't behaving the way the documentation says it should.

Reading HPC Scheduler Event Logs in depth: The HPC scheduler writes detailed diagnostic information to the Windows Event Log under Applications and Services Logs → HPC Pack. Open Event Viewer on the head node, navigate to that path, and filter for Error and Warning events in the time window when the problem occurred. The InstanceCacheLoadException and the scheduler busy messages mentioned in the official docs both show up here with additional context that the standard error output omits. Pay particular attention to correlation IDs if you see them, they can link related events across the scheduler, management service, and node manager logs.

SQL query diagnostics for Management Database health: Beyond the duplicate instance state check from Step 3, you can run targeted queries to understand the broader state of your HPC Management database. If you're seeing intermittent service failures or unexpected behavior after any database maintenance operation, check for locked records or open transactions that didn't commit cleanly. Use SQL Server's sys.dm_exec_sessions and sys.dm_exec_requests views to see what's active against the database at the time of a failure.

Azure Virtual Network configuration for "No virtual network found" errors: One error that catches Azure HPC Pack administrators off guard is "No virtual network found when you enter the application ID", it appears when configuring Azure burst connectivity. This error means the service principal or application registration doesn't have visibility to the VNet resource. Verify that the application ID you're entering has the correct RBAC role assignments in Azure, specifically, it needs at least Network Contributor access on the virtual network resource, or a custom role with equivalent network read permissions. The application registration in Azure Active Directory also needs to match the tenant where the VNet lives.

Linux extension failures traced to proxy or firewall: If the extension log from Step 1 shows download or connectivity errors, the Azure VM extension delivery mechanism may be blocked. Azure delivers VM extensions through the Azure fabric, if your network security group or a corporate proxy intercepts outbound traffic from the compute node subnet, the extension download fails silently from the HPC Pack perspective. Check that the compute node subnet has outbound access to the Azure extension delivery endpoints, or configure a proxy that allows those through.

Remote database preparation for large clusters: When scaling to large HPC clusters using Azure SQL as a remote database, the preparation steps matter. The head node needs proper connection strings configured for both the scheduler database and the management database. If those strings are incorrect or the firewall rules on the Azure SQL server don't allow connections from the head node's IP, you'll see a cascade of failures that look like service errors but are actually just connection refusals at the database layer.

When to Call Microsoft Support

If you've worked through all the steps above and the HPC Management Service still won't start, jobs remain frozen despite database scaling, or Linux node agents continue failing after a clean test provisioning, it's time to escalate. Collect the HPC Pack event logs, the scheduler log files, and output from the diagnostic SQL queries before you call. That evidence dramatically speeds up the support process. You can reach Microsoft Support directly for Azure HPC Pack, it's covered under Azure support plans. For production clusters, don't spend more than two hours on a blocking issue before opening a ticket.

Prevention & Best Practices for Azure HPC Pack

I'll be honest: most of the Azure HPC Pack troubleshooting scenarios in this guide are preventable. The issues that cause the most pain, DTU saturation, corrupt database state, unreachable node managers, tend to sneak up gradually rather than hitting suddenly. Here's what I recommend building into your standard operating practice.

Set up DTU monitoring alerts before you need them. Don't wait until jobs are frozen to discover your Azure SQL database is saturated. In the Azure portal, configure a metric alert on your HPC scheduler database to fire when DTU consumption exceeds 80%. That gives you a heads-up before it hits the performance ceiling that blocks scheduler operations. Route the alert to your team's notification channel, email, SMS, or your monitoring platform of choice.

Test database restores in a staging environment, not production. The HPC Management Service failure after database restore is almost always discovered the hard way, in production, after an incident, when you need the cluster back up urgently. Run restore tests in a non-production environment on a regular schedule. Use the diagnostic SQL query from Step 3 as part of your post-restore verification checklist to confirm the instance state records are clean before you bring services back up.

Document your node template configuration. When a Linux compute node agent extension fails, one of the first questions is "what changed in the node template recently?" If nobody remembers, you lose diagnostic time. Keep your node template settings, image type, OS type, image label, VM size, in version-controlled documentation. Changes to node templates should go through a change management process, even in smaller environments.

Validate network security group rules before adding compute nodes. NSG misconfigurations are a top cause of node manager unreachability and extension installation failures. Before provisioning a new batch of compute nodes, run a connectivity test from an existing node in the same subnet to verify that HPC Pack's required ports are open and that Azure extension delivery is reachable.

Quick Wins

Set Azure SQL DTU consumption alerts at 80% threshold, catch database saturation before it freezes jobs
Run the instance state diagnostic SQL query as part of every database restore verification procedure
Keep OpenSSH server enabled on head nodes so you can reach Linux compute nodes immediately when extensions fail
Scale your HPC scheduler database to at minimum 100 DTU before going to production, the official minimum exists for a reason

Frequently Asked Questions

Why are my HPC Pack jobs stuck in canceling and won't move even after I restart the head node?

Restarting the head node won't fix this because the root cause is almost certainly your Azure SQL database hitting its DTU performance limit, not a problem with the head node services themselves. The scheduler service is dependent on that database for every state transition, including canceling operations, and when the database can't keep up with transaction volume, state changes stop completing. Open the Azure portal, navigate to your HPC scheduler SQL database, and check the DTU consumption metric. If it's at or near 100%, scale to a higher DTU tier. The minimum Microsoft recommends for the HPC scheduler database is 100 DTU, and production clusters with significant job volumes often need more than that.

The HPC Management Service event log shows "InstanceCacheLoadException", what does that actually mean and how do I fix it?

That error means the HPC Management Service found database records in a contradictory state at startup, specifically, some instances in the HpcManagement database have multiple records all claiming to be the "Current" version (instanceState = 2), when there should only ever be one. This typically happens after restoring a database backup that was captured while writes were in flight. The fix is the PowerShell script covered in Step 3 of this guide, it identifies those duplicate records and corrects the state of the extra copies so only one Current record remains per instance. After running the script, try starting the HPC Management Service again. Don't try to manually edit the database rows; the script handles the logic correctly and safely.

How do I find out why the Linux compute node agent extension failed to install?

The extension log file on the Linux compute node itself is your best source of truth, it records the full installation attempt including any errors. To access it, you need SSH connectivity from the head node. Enable OpenSSH Server on the head node using the DISM command and PowerShell service commands covered in Step 1, then SSH into the Linux compute node using its private IP address and your domain administrator credentials. Once connected, run sudo su ls -la /var/log/azure/Microsoft.HpcPack.LinuxNodeAgent2016U1/extension.log to confirm the log exists, then open it to read the failure details. Common causes include network blocks preventing extension download, authentication issues, and OS-level permission problems during the installation.

What's the minimum Azure SQL database size I need for Azure HPC Pack?

Microsoft's official documentation specifies 100 DTU as the minimum for the HPC scheduler database. Running below that threshold, on a 50 DTU or Basic tier database, for example, will cause the scheduler to hit performance limits under any meaningful job load, resulting in jobs getting stuck and the "scheduler server is busy" error appearing in the scheduler logs. For clusters with high job submission rates or large numbers of concurrent jobs, 200 DTU or more is commonly needed. Monitor your actual DTU consumption after going live and scale proactively rather than waiting for jobs to freeze.

I'm getting "No virtual network found" when I enter the application ID for Azure burst, how do I fix that?

This error appears when the service principal associated with your application ID doesn't have permission to see the virtual network resource in Azure. The application registration needs appropriate RBAC role assignments on the VNet, at minimum Network Contributor, or a custom role that includes network read permissions. Go to the Azure portal, navigate to your Virtual Network resource, select Access control (IAM), and verify that your application's service principal has a role assignment there. Also confirm that the application registration lives in the same Azure AD tenant as the subscription hosting the VNet. A tenant mismatch will produce the same "no virtual network found" error even if permissions look correct.

My HPC compute nodes are showing in error state and bringing them online doesn't help, what should I check?

Nodes stuck in error state after you've tried bringing them online from HPC Cluster Manager usually come down to one of three things: the HPC Node Manager service on the compute node is stopped or crashed, a network security group rule is blocking communication between the head node and compute node on a required HPC Pack port, or there's a certificate validation failure in the HPC Pack authentication chain between head and compute nodes. Check the compute node's Windows Event Log (or Linux system log) for HPC Node Manager service errors first, that's the quickest path to the actual cause. If the node manager service is running but the head node still can't reach it, review your NSG rules for the compute node subnet and verify the required HPC Pack communication ports are open inbound from the head node subnet.

Related Microsoft Fix Guides

Sai Kiran Pandrala

Our team includes certified Microsoft engineers, Azure architects, and system administrators with 10+ years of enterprise IT experience. Every guide is written from hands-on troubleshooting, not guesswork. We test every fix before publishing.