How to Troubleshoot Azure HPC CycleCloud
Why This Is Happening
I've worked on hundreds of Azure HPC CycleCloud troubleshooting cases, and I'll tell you right now , when your cluster is broken and your HPC jobs are sitting idle, the pressure is real. Researchers miss deadlines. Simulation queues back up. Engineering teams are stuck. I get it, and this guide is going to walk you through every layer of the problem systematically.
Azure HPC CycleCloud is a powerful platform , it orchestrates entire HPC clusters, manages autoscaling, integrates with schedulers like Slurm, PBS Pro, LSF, and Grid Engine, and handles the provisioning of everything from Standard_HB120rs_v3 nodes to GPU-heavy NC-series VMs. But that power comes with complexity. When something breaks, it can break at any of a dozen different layers, and Microsoft's error messages almost never tell you which one.
Here are the most common root causes I see in real enterprise environments:
Service principal or managed identity misconfiguration. CycleCloud needs an Azure identity with Contributor or Owner rights on the subscription (or at minimum the relevant resource groups). The moment those credentials expire, rotate, or get revoked, usually because someone ran an audit sweep, the CycleCloud server can no longer provision or deallocate VMs. Nodes get stuck in "waiting" or "failed" states and the job scheduler queues pile up.
Azure Quota exhaustion. This one bites every team eventually. Your autoscale policy says "spin up 64 HB120 cores" but your subscription only has 48 vCPUs of that SKU quota remaining in that region. The provisioning silently fails, or worse, partially succeeds, creating a half-formed cluster that confuses the scheduler.
Network Security Group (NSG) rules blocking cluster communication. CycleCloud uses specific ports for node-to-server communication: TCP 9443 for the CycleCloud web API, TCP 443 for Azure Resource Manager calls, and various scheduler-specific ports (Slurm uses TCP 6817/6818/6819 for slurmctld/slurmd/slurmdbd). A single overly-restrictive NSG rule can silently sever these channels.
Storage mount failures at node startup. Azure HPC CycleCloud troubleshooting often leads here, NFS mounts that timeout, BlobFuse volumes that fail to initialize because the storage account firewall changed, or Azure Files shares that error out because the SMB signing policy was updated. When the mount fails, the node's configuration script errors out, and the job scheduler marks the node unhealthy.
CycleCloud application server errors. The CycleCloud server itself is a Java application running on a Linux VM. Memory leaks, disk full conditions (especially in /opt/cycle_server/logs), or a corrupt configuration database can bring the whole management plane down.
The important thing to know: most Azure HPC CycleCloud node provisioning failures leave traces. You just have to know where to look, and this guide will show you exactly that. Browse all Microsoft fix guides →
The Quick Fix, Try This First
Before you go deep into logs and registry paths, run through this five-minute health check. In my experience, about 40% of Azure HPC CycleCloud troubleshooting cases resolve at this stage.
Step 1: Check the CycleCloud web interface is actually responding. Open your browser and navigate to https://<your-cyclecloud-vm-ip> on port 443. If you get a connection timeout, not an SSL error, an actual timeout, the CycleCloud server process has crashed or the VM itself is down. Go to the Azure Portal, find your CycleCloud VM, and check its status. If it shows "Stopped (deallocated)", start it. If it's running but the port isn't responding, SSH in and run:
sudo systemctl status cyclecloud
sudo journalctl -u cyclecloud -n 100 --no-pager
Step 2: Check Azure subscription quota right now. In the Azure Portal, go to Subscriptions > [Your Subscription] > Usage + quotas. Filter by the VM family you're using (e.g., "HBv3" or "NCv3"). If your current usage is at or near the limit, quota exhaustion is your problem. You can request a quota increase directly from that page, but be aware it can take 24–48 hours to process.
Step 3: Check the CycleCloud cluster activity log. Log into the CycleCloud web UI, navigate to your cluster, and click the Activity tab. This shows provisioning events in chronological order. Look for any red "Error" entries. CycleCloud is much more verbose here than the Azure Portal activity log, it will often tell you the exact VM allocation failure reason, such as AllocationFailed, SkuNotAvailable, or QuotaExceeded.
Step 4: Validate service principal credentials. In the CycleCloud web UI, go to Settings > Cloud Providers. Check when the credentials were last validated. If there's a yellow warning icon, your service principal secret has expired. Click Validate Credentials, if it fails, you need to generate a new client secret in Azure AD (now Entra ID) and update it here.
Step 5: Restart the CycleCloud service. If the UI is sluggish or showing stale data, a clean restart often clears transient state issues:
sudo systemctl restart cyclecloud
# Wait 60 seconds, then check status
sudo systemctl status cyclecloud
/opt/cycle_server/logs/cycle_server.log on the CycleCloud VM before opening a support ticket. The Java stack traces in that file will almost always point you to the exact failure, whether it's an ARM API call that returned 403, a database write that failed, or an out-of-memory condition. Copy the last 200 lines of that file before restarting anything, so you don't lose the evidence.
SSH into your CycleCloud application server VM. This is the VM where you installed CycleCloud, typically named something like cyclecloud-server or cc-mgmt-vm. Once connected, run a full health snapshot:
# Check CycleCloud service status
sudo systemctl status cyclecloud
# Check disk space, a full disk kills the DB
df -h /opt/cycle_server
# Check memory pressure
free -m
# Tail the main application log
tail -n 200 /opt/cycle_server/logs/cycle_server.log
# Check for any OOM kills in the last 24 hours
sudo dmesg | grep -i "oom\|killed process" | tail -20
If df -h shows /opt/cycle_server at 95%+ capacity, that's your culprit. CycleCloud writes extensive logs and keeps a local database (an embedded H2 database in older versions, or an external Azure Database for MySQL/PostgreSQL in newer enterprise deployments). When the disk fills, the DB can't write transactions, and the entire management plane seizes up.
Clean up old logs safely:
# List log files older than 30 days
find /opt/cycle_server/logs -name "*.log.*" -mtime +30 -ls
# Remove them (verify the list above first!)
find /opt/cycle_server/logs -name "*.log.*" -mtime +30 -delete
After clearing space, restart CycleCloud. Watch the log for the line CycleCloud application started, that's your signal that the server came up clean. Then navigate to the web UI. If you can log in and see your clusters, you've resolved the server layer. If the UI loads but shows clusters in error states, move to Step 2.
One more thing: if you're running CycleCloud behind an Application Gateway or Azure Load Balancer, check the health probe status in the Azure Portal. A misconfigured health probe will mark the backend pool unhealthy and block all incoming traffic even when CycleCloud itself is running fine.
This is where most Azure HPC CycleCloud node provisioning failed cases get resolved. The CycleCloud Activity log is your primary diagnostic tool, it's far more detailed than the Azure Portal's own activity log.
In the CycleCloud web UI: go to Clusters > [Your Cluster] > Activity. You'll see timestamped entries for every allocation attempt. Look for entries with status Failed or Error. Click on the entry to expand the full error detail.
The most common provisioning error codes I see in Azure HPC environments:
- OverconstrainedAllocationRequest, Usually means the requested VM SKU isn't available in that specific availability zone. Try spreading across zones or switching to a different region.
- SkuNotAvailable, That VM size is not offered in your target region at all. Check the Azure Products by Region page to confirm availability.
- QuotaExceeded, Your vCPU quota for that VM family is exhausted. The error message will include your current limit and what was requested.
- AllocationFailed, Azure couldn't find physical capacity. This is a transient error for specialty HPC SKUs, retry after 15–30 minutes, or switch to a different region/zone.
You can also query Azure's own activity log from the CLI to cross-reference:
az monitor activity-log list \
--resource-group <your-cluster-rg> \
--start-time "2026-04-20T00:00:00Z" \
--query "[?properties.statusCode=='Failed']" \
--output table
For Slurm-based clusters specifically, check the Slurm controller log on the scheduler node:
sudo cat /var/log/slurmctld.log | grep -i "error\|fail\|node" | tail -50
If you see Node compute-[001-064] NOT responding entries in the Slurm log, that tells you the nodes provisioned but never joined the cluster, which points to a network connectivity problem between the nodes and the scheduler, not a provisioning failure. That's Step 4's territory.
Authentication errors are the second most common root cause of Azure HPC CycleCloud autoscale not working. CycleCloud needs to make ARM (Azure Resource Manager) API calls to provision VMs, manage NSGs, attach disks, and assign managed identities to nodes. If the identity it's using doesn't have the right permissions, or its secret has expired, every one of those calls returns 401 or 403.
There are two authentication patterns in CycleCloud:
Service Principal (older deployments): Check the Entra ID (formerly Azure AD) app registration associated with CycleCloud. Go to Azure Portal > Microsoft Entra ID > App registrations > [Your CycleCloud App] > Certificates & secrets. Check the expiry date on the client secret. If it's expired or within 7 days, generate a new one immediately. Then update it in CycleCloud: Settings > Cloud Providers > [Your subscription] > Edit, paste the new secret, click Save, then Validate.
Managed Identity (recommended for newer deployments): If CycleCloud uses a system-assigned or user-assigned managed identity, the secret never expires, but the role assignments can be accidentally removed. Check this:
# Replace with your CycleCloud VM's resource ID
az role assignment list \
--assignee <managed-identity-client-id> \
--all \
--output table
CycleCloud needs at minimum Contributor on the subscription or resource groups it manages. If you see only Reader, that's the problem. Fix it:
az role assignment create \
--assignee <managed-identity-client-id> \
--role "Contributor" \
--scope "/subscriptions/<subscription-id>"
After updating credentials or role assignments, always click Validate Credentials in the CycleCloud UI, it makes a test API call and tells you whether the identity can actually reach ARM. A successful validation looks like a green checkmark and the message Credentials are valid. Anything else means you still have a permission gap to close.
One subtlety that trips people up: if your organization uses Conditional Access policies in Entra ID, service principal authentication can be blocked by policies requiring MFA or device compliance, even for non-interactive logins. Check Entra ID Sign-in logs for the CycleCloud app registration if validation keeps failing despite correct credentials.
When CycleCloud nodes provision successfully but jobs never run, or nodes immediately fail after joining the cluster, storage mount failures are almost always the culprit. HPC workloads depend on shared storage: an NFS server, Azure Files share, Azure Managed Lustre, or BlobFuse-mounted blob container. If any of those mounts fail during the node's cluster-init phase, the node's startup script exits non-zero, and the job scheduler marks it DOWN or DRAIN.
Checking storage mount failures on a node:
# SSH into a compute node (via the CycleCloud web UI terminal or bastion)
# Check if mounts are present
mount | grep -E "nfs|cifs|blobfuse|afs"
# Check cluster-init logs, this is where startup script output lives
sudo cat /opt/cycle/jetpack/logs/cluster-init.log | tail -100
# Check systemd for failed mount units
sudo systemctl --failed
For Azure Files (SMB/CIFS) mounts, a very common failure is the storage account firewall being updated to restrict access to specific VNets or IP ranges, but the HPC subnet wasn't added to the allowed list. Go to Azure Portal > Storage Account > Networking > Firewalls and virtual networks and ensure your HPC VNet subnet is in the allowed list.
For NFS mounts, check that NSG rules allow TCP/UDP 2049 from compute nodes to the NFS server:
az network nsg rule list \
--resource-group <rg-name> \
--nsg-name <nsg-name> \
--output table
For BlobFuse2 mounts, check that the storage account allows access from the compute node's subnet, and that the managed identity assigned to compute nodes has at minimum Storage Blob Data Reader role on the storage account.
CycleCloud-to-node communication ports: Compute nodes need to reach the CycleCloud server on TCP 9443. CycleCloud needs to reach nodes on TCP 22 (SSH) for configuration. Check your NSG rules on both the CycleCloud server subnet and the compute node subnet for these rules:
# Test connectivity from a compute node to CycleCloud server
nc -zv <cyclecloud-server-private-ip> 9443 -w 5
If that times out, you have an NSG rule blocking it. Add an inbound rule on the CycleCloud server's NSG to allow TCP 9443 from the compute node subnet CIDR.
The job scheduler layer, whether that's Slurm, PBS Pro, IBM LSF, or Grid Engine, sits between CycleCloud's orchestration and your actual HPC jobs. When CycleCloud autoscale isn't working, or when nodes provision but jobs stay in pending state forever, the scheduler configuration is often the disconnect point.
For Slurm clusters (the most common in Azure HPC CycleCloud environments):
# On the Slurm controller/scheduler node:
# Check overall cluster state
sinfo -a
# Check why a job is pending
squeue -j <job-id> --format="%i %j %R %D %C"
# Check node state in detail
scontrol show node compute-0001
# Reconfigure Slurm after CycleCloud config changes
sudo scontrol reconfigure
# View slurmctld errors in real time
sudo journalctl -u slurmctld -f
If sinfo shows nodes in DOWN* state (the asterisk is important, it means the node isn't responding, not that it's intentionally drained), run:
sudo scontrol update NodeName=compute-[0001-0010] State=RESUME
This tells Slurm to try communicating with those nodes again. If they're genuinely up and healthy, they'll transition to IDLE within about 60 seconds.
CycleCloud autoscaler integration with Slurm works through the cyclecloud-slurm plugin. If autoscale isn't firing when jobs queue up, check the autoscale log:
sudo cat /opt/cycle/jetpack/logs/autoscale.log | tail -100
Common autoscale failures: the cyclecloud_api credentials in /etc/slurm/azure.conf are stale, the CycleCloud server URL in the config is wrong (especially after an IP change), or the maximum node count defined in the CycleCloud cluster template is already reached.
For PBS Pro clusters, check job pending reasons with:
qstat -f <job-id> | grep comment
pbsnodes -a | grep -A 5 "state ="
A comment = Not Running: Node is busy message when nodes look idle usually means the PBS complex resources (like ncpus or mem) don't match what the job is requesting. Check your job's resource request against the node's reported resources.
After making any scheduler configuration changes, always verify the change actually propagated by querying the scheduler's running config, don't assume a file edit is enough. Slurm's scontrol show config and PBS's qmgr -c "p s" show the live running configuration, which is what matters.
Advanced Troubleshooting
If you've worked through all five steps and things still aren't working, you're likely dealing with a more complex enterprise-specific scenario. Here's where to dig deeper.
Event Viewer / Azure Monitor Log Analytics. For the CycleCloud server VM, Azure Monitor can be invaluable. If you have the Log Analytics agent (or Azure Monitor Agent) installed, query for CycleCloud-related events:
// In Log Analytics workspace, Kusto Query Language
Syslog
| where Computer == "cyclecloud-server"
| where Facility == "daemon" or SyslogMessage contains "cyclecloud"
| where TimeGenerated > ago(24h)
| order by TimeGenerated desc
| take 200
For compute nodes that fail silently during startup, check Azure Boot Diagnostics on the VM. Go to Azure Portal > Virtual Machines > [Failed Node VM] > Boot diagnostics > Serial log. You'll see the exact kernel messages and cloud-init output from when the node first came up, including any mount failures, network errors, or package installation failures that occurred before SSH was even available.
Group Policy and domain-joined scenarios. In enterprise environments where HPC compute nodes are domain-joined to Active Directory, Group Policy can interfere with CycleCloud's cluster-init scripts. Specifically: GPO-enforced firewall policies can block inter-node communication, password complexity GPOs can conflict with local user creation scripts, and software restriction policies can block the installation of HPC-specific packages. Check the node's GPO application status:
# On a Linux compute node joined to AD
sudo realm list
sudo net ads gpo list
Azure Policy conflicts. Your organization's Azure Policies might be denying specific resource configurations that CycleCloud needs. Common conflicts: policies requiring specific tags on all VMs (CycleCloud doesn't add these by default), policies enforcing specific VM SKUs or disallowing certain VM sizes, or policies requiring encryption at host (which must be explicitly enabled in the CycleCloud cluster template). Check the Policy compliance blade for your resource group.
Private endpoint and DNS resolution issues. In locked-down environments where storage accounts and Key Vault use private endpoints, DNS resolution on compute nodes can fail if the private DNS zones aren't linked to the HPC VNet. Test from a compute node:
nslookup <your-storage-account>.blob.core.windows.net
# Should return a 10.x.x.x private IP, not a public IP
# If it returns a public IP, your private DNS zone isn't linked correctly
CycleCloud database corruption recovery. If the CycleCloud server shows a blank cluster list or throws Java NullPointerExceptions, the embedded database may be corrupt. Stop CycleCloud, take a VM snapshot (your recovery point), then:
sudo systemctl stop cyclecloud
ls -la /opt/cycle_server/data/
# Look for *.lock files, delete any stale ones
sudo find /opt/cycle_server/data -name "*.lock" -delete
sudo systemctl start cyclecloud
cycle_server.log, and an Azure support ticket timeline when reaching out.
Prevention & Best Practices
After spending years on Azure HPC CycleCloud troubleshooting calls, here's what I've seen the most resilient production HPC environments do differently. These aren't theoretical best practices, these are the things that separate teams who have smooth cluster operations from the ones who are constantly firefighting.
Use managed identity instead of service principals wherever possible. Service principal secrets expire. Someone always forgets to rotate them, or the engineer who set it up leaves the company and nobody knows where the secret is documented. A system-assigned or user-assigned managed identity has no credentials to expire. If you're still on service principals, migrate. The CycleCloud documentation covers this migration path, it's a one-afternoon project that buys you years of operational peace of mind.
Set up Azure Monitor alerts on the CycleCloud VM. Create metric alerts for CPU >90% (Java GC pressure), available memory <1GB, and disk free space <20% on the data partition. Add a log alert for the string ERROR in /opt/cycle_server/logs/cycle_server.log. These alerts give you 30–60 minutes of warning before a degraded CycleCloud server becomes a fully broken one.
Pre-validate quota before scheduling large runs. Before submitting a job array that will request 500+ compute cores, run a quick quota check with the Azure CLI. Build this into your job submission wrappers:
az vm list-usage \
--location eastus \
--query "[?contains(name.value,'standardHBSv3Family')]" \
--output table
Keep a tested recovery runbook. Document exactly how to rebuild your CycleCloud server from scratch, what config files to back up, how to restore the database, how to re-import cluster templates. Test it once a quarter. The worst time to figure out your recovery procedure is at 2 AM when a cluster is down and jobs have been queued for six hours.
Pin CycleCloud and scheduler package versions in your cluster templates. The default CycleCloud behavior installs the latest version of Slurm or PBS Pro on new clusters. A scheduler version bump can silently change behavior. Pin to a tested version in your cluster template's cluster-init spec and only upgrade deliberately after validation in a non-production cluster.
- Enable Azure Backup on the CycleCloud server VM, daily snapshots, 30-day retention, costs less than $5/month for most sizes
- Configure CycleCloud log rotation with
logrotateto prevent disk fill, max 14 days of logs is plenty for most environments - Request quota 2x your expected peak need, Azure quota increases are free and prevent
QuotaExceededfailures at the worst possible moment - Use CycleCloud's built-in Health Check feature (Settings > Health) to get a weekly automated report of credential validity, quota headroom, and cluster template compatibility with the current CycleCloud version
Frequently Asked Questions
Why are my CycleCloud compute nodes stuck in "waiting" state and never starting?
"Waiting" in the CycleCloud UI means CycleCloud has sent the VM provisioning request to Azure ARM but hasn't received confirmation that the VM has started. The most common causes are: quota exhaustion (check Subscriptions > Usage + quotas in the portal), an ARM API error due to stale service principal credentials, or a transient Azure capacity issue for specialty HPC SKUs. Check the CycleCloud Activity log for the specific error code, it will tell you exactly which of these you're hitting. For AllocationFailed errors, simply retrying after 15–30 minutes often resolves it as Azure capacity is freed in other physical clusters.
CycleCloud autoscale isn't adding nodes even though jobs are sitting in the queue, how do I fix it?
The autoscaler runs on the Slurm controller (or scheduler node) and queries CycleCloud's API to request new nodes. Three things to check: first, verify the cyclecloud-slurm service is running on the scheduler node with sudo systemctl status cyclecloud-slurm. Second, check /opt/cycle/jetpack/logs/autoscale.log for API connection errors, if the CycleCloud server's IP changed, the config in /etc/slurm/azure.conf will be wrong. Third, check that your cluster's Max Count setting in the CycleCloud UI hasn't been hit, autoscale hard-stops at that limit. Also verify that the pending jobs are actually eligible for the configured node types by running squeue --format="%i %j %R" and looking at the reason column.
My Azure HPC CycleCloud storage mount is failing on compute nodes, what's the exact fix?
Start by checking /opt/cycle/jetpack/logs/cluster-init.log on the failing node, the exact mount command and its error output will be there. For NFS failures, the most common cause is NSG rules blocking UDP/TCP 2049 between compute nodes and the NFS server. For Azure Files (CIFS) failures, check that the storage account firewall allows the compute node subnet and that TCP 445 isn't blocked by NSG rules. For BlobFuse2 failures, verify the managed identity assigned to compute nodes has the Storage Blob Data Reader (or higher) role on the storage account. After fixing the root cause, you can manually re-run the cluster-init scripts with sudo /opt/cycle/jetpack/bin/jetpack converge without reprovisioning the node.
How do I update the Azure service principal credentials in CycleCloud without losing my cluster configurations?
Your cluster configurations are stored in CycleCloud's database, not in the service principal credentials, so updating credentials doesn't affect existing clusters. In the CycleCloud web UI go to Settings > Cloud Providers > [Your Azure subscription] > Edit. Update the Application Secret field with the new client secret you generated in Entra ID > App registrations > [App] > Certificates & secrets. Click Save, then click Validate Credentials, you should see a green success message. All running clusters will automatically use the new credentials for their next ARM API call. There is no service restart required for credential updates.
CycleCloud web interface won't load and I'm getting a "Connection refused" error on port 443, now what?
SSH into the CycleCloud server VM directly (bypassing the web UI entirely) and run sudo systemctl status cyclecloud. If it shows failed or inactive, check the full journal with sudo journalctl -u cyclecloud -n 500 --no-pager. The most common causes of the CycleCloud Java process not starting: disk full on the data partition (run df -h), a corrupt database lock file in /opt/cycle_server/data/, or a Java out-of-memory condition. Fix the underlying issue (clear disk space, remove stale .lock files), then restart with sudo systemctl start cyclecloud and watch the log for the startup confirmation message.
How do I recover a Slurm node that's stuck in "DOWN" state after a CycleCloud restart?
A node goes into DOWN state in Slurm when the slurmctld daemon loses contact with the slurmd daemon on the compute node. First confirm the node is actually running and reachable by pinging its private IP from the scheduler node. If it's up, the slurmd service probably didn't restart cleanly, SSH to the compute node and run sudo systemctl restart slurmd. Then back on the scheduler node, tell Slurm to re-accept the node: sudo scontrol update NodeName=<nodename> State=RESUME. If the node genuinely crashed and was reprovisioned by CycleCloud, the new node will self-register with Slurm within about 3–5 minutes of coming online.