Azure CycleCloud 8: Fix Setup, Config & Autoscaling Issues
Why Azure CycleCloud 8 Is Giving You Trouble
I've seen this exact situation play out on dozens of Azure HPC projects: an organization has a mature on-premises HPC environment running Slurm or PBSPro, they're told "just move it to Azure CycleCloud," and within 48 hours they're staring at a web UI that won't load, compute nodes that refuse to join the cluster, or an autoscaler that's either doing nothing or spinning up VMs at a rate that makes finance very unhappy.
The frustration is real. Azure CycleCloud 8 is genuinely powerful , it orchestrates the entire HPC stack on Azure, from virtual machines and scale sets all the way up through schedulers, parallel file systems, and authentication hosts. But that power comes with a lot of moving parts, and the error messages you get when something breaks are often about as helpful as "something went wrong."
Here's the honest picture of who runs into Azure CycleCloud 8 problems and why:
HPC admins migrating from on-premises clusters usually hit credential and subscription configuration issues first. CycleCloud is installed as a web application, either on an Azure VM or on-premises, and it needs specific Azure role assignments before it can touch your subscription's compute resources. If those IAM permissions aren't right, you'll see nodes stuck in "Acquiring" state forever while the CycleCloud logs quietly complain about authorization failures.
Users standing up a cluster for the first time typically trip over cluster template syntax. CycleCloud's declarative templating format is expressive and powerful, but it is not particularly forgiving. A misplaced bracket or an incorrectly referenced nodearray will silently produce a cluster that looks fine in the UI but falls apart the moment you try to start it.
Teams running domain-joined or enterprise environments run into a third class of problems: network rules blocking CycleCloud's agent communication, Azure Monitor integration requiring additional permissions, and autoscaling plugins failing to connect back to the CycleCloud server because of restrictive NSGs or missing private DNS entries.
Microsoft's error messages in CycleCloud 8 don't always point you to the right layer. A node that can't start might show a generic provisioning error in the UI when the real cause is a quota limit, a networking issue, or a broken cloud-init script, all completely different root causes that need completely different fixes. This guide cuts through that ambiguity.
Browse all Microsoft fix guides →The Quick Fix, Try This First
Before you spend three hours digging through logs, do this one check: verify that your CycleCloud service principal has the right Azure role and that the CycleCloud web application can actually reach the Azure Resource Manager API. I'd estimate this single root cause is behind more than half of all "my cluster won't start" tickets I've worked.
Open the CycleCloud web application, by default it runs on port 443 of whatever host you installed it on. Navigate to Settings → Cloud Providers and click the pencil icon next to your Azure subscription entry. You should see a green checkmark next to "Connection Status." If you see a yellow warning or red error instead, stop here, that's your problem.
To fix it, open the Azure portal and navigate to Subscriptions → [your subscription] → Access control (IAM) → Role assignments. Look for the service principal or managed identity that CycleCloud is using. It needs at minimum the Contributor role at the subscription scope (or a custom role that covers Microsoft.Compute/*, Microsoft.Network/*, and Microsoft.Storage/*). If you're using managed identity on the CycleCloud VM, also confirm that the VM's system-assigned identity is enabled under the VM's Identity blade.
Once you've confirmed or corrected the role assignment, go back to CycleCloud → Settings → Cloud Providers and click Validate. It should turn green within 30 seconds. If it does, try starting your cluster again, there's a good chance everything else starts working.
If the connection shows green but nodes still won't start, scroll down and check the cluster's Event Log (click the cluster name → Activity tab). That log is far more descriptive than the node status icons in the main view and will usually name the specific Azure API call that failed.
The first thing Azure CycleCloud 8 does when you ask it to start a cluster is make Azure API calls to create VMs, network interfaces, and scale sets. If the identity it's acting under doesn't have permission to do that, everything downstream breaks, but the error often surfaces several layers up as a vague node provisioning failure.
Start by confirming which identity CycleCloud is using. If you installed CycleCloud on an Azure VM and configured it to use a managed identity, run this from the CycleCloud VM's command line:
# Verify managed identity is active
curl -s -H "Metadata: true" \
"http://169.254.169.254/metadata/identity/oauth2/token?api-version=2018-02-01&resource=https://management.azure.com/" \
| python3 -m json.tool | grep expires_on
If that returns a token with a future expiry, managed identity is working. If it fails or returns an error, go to the Azure portal → your CycleCloud VM → Identity → System assigned, and toggle Status to On.
Next, assign the right role. In the Azure portal, navigate to Subscriptions → [your subscription] → Access control (IAM) → + Add → Add role assignment. Choose Contributor, select your managed identity or service principal, and save. Role assignments in Azure can take up to 5 minutes to propagate, don't test immediately.
Back in the CycleCloud UI, go to Settings → Cloud Providers → [your Azure account] → Edit → Validate. A green checkmark means CycleCloud can reach ARM and your credential setup is good. This step alone resolves the majority of Azure CycleCloud HPC cluster startup failures I've worked through.
Azure CycleCloud 8's cluster templates are written in a declarative format that describes every component of your HPC environment, scheduler nodes, execute nodes, file systems, network settings, in a single template file. The format is powerful, but small errors produce confusing behavior rather than clear syntax errors.
If your cluster is created but shows errors on start, or if nodes are created in the wrong configuration, download your current template first:
# Using the CycleCloud CLI
cyclecloud export_cluster MyClusterName -f my_cluster_template.txt --force
Open the exported template and look for these common problems. First, check that every [[[node]]] or [[[nodearray]]] block that references a cluster-init script has a valid ClusterInitSpecs entry. A reference to a project or spec that doesn't exist in your CycleCloud locker will silently prevent the node from configuring itself correctly after it boots.
Second, verify your subnet reference. The SubnetId parameter needs to be a full Azure resource ID, not just a subnet name. It should look like:
SubnetId = /subscriptions/<sub-id>/resourceGroups/<rg>/providers/Microsoft.Network/virtualNetworks/<vnet>/subnets/<subnet>
Third, confirm that any MachineType values you've specified exist in your target Azure region and are available in your subscription quota. The CycleCloud CLI command cyclecloud show_cluster MyClusterName will surface most validation errors before you waste time starting a broken cluster.
After fixing the template, import it back and restart the cluster:
cyclecloud import_cluster -f my_cluster_template.txt --force
cyclecloud start_cluster MyClusterName
If the cluster starts and nodes reach "Ready" state in the UI, your template is now valid.
One of the things that makes Azure CycleCloud worth using is that it ships autoscaling plugins for the major HPC schedulers, Slurm, PBSPro, LSF, Grid Engine, and HTCondor. Instead of building your own job-aware scaling logic, the plugin watches the scheduler's queue and tells CycleCloud when to add or remove nodes. But the plugin has its own configuration that breaks in specific ways.
For Slurm specifically, the autoscaling integration requires that the cyclecloud-slurm package is correctly installed on the scheduler node and that the slurm.conf file generated by CycleCloud matches the actual node names and partition configuration. If you've manually edited slurm.conf or if the CycleCloud-managed node names don't match what Slurm expects, jobs will queue indefinitely without triggering scale-out.
Check the autoscaler logs on the scheduler node:
# For Slurm
sudo journalctl -u cyclecloud_slurm -n 100 --no-pager
# For PBSPro
sudo cat /opt/cycle/pbspro/logs/autoscale.log | tail -100
Common messages that indicate autoscaling is broken:
ERROR: Could not connect to CycleCloud at https://<cc-host>
ERROR: Cluster 'MyCluster' not found or not accessible
WARNING: No nodearray found for queue 'htc'
The first two errors mean the scheduler node can't reach the CycleCloud web application, check your NSG rules to make sure port 443 from the scheduler node's subnet to the CycleCloud VM's private IP is allowed. The third error means the Slurm partition name doesn't map to a CycleCloud nodearray name, these need to match exactly in your cluster template's [[[nodearray]]] section.
After fixing, restart the autoscaler: sudo systemctl restart cyclecloud_slurm. Submit a test job and watch the CycleCloud UI's Nodes tab, within 60 seconds you should see new nodes appearing in "Acquiring" state if autoscaling is working.
Every compute node that CycleCloud starts runs a CycleCloud agent that maintains a connection back to the CycleCloud server. This is how CycleCloud knows the node's state, pushes configuration, and coordinates orderly shutdown. If this agent can't connect, which happens frequently in enterprise environments with strict network controls, nodes get stuck in "Preparing" state and never become usable.
The agent connects to the CycleCloud server on port 443 (HTTPS) and also uses port 9443 for a secondary management channel in CycleCloud 8. Check your Azure Network Security Groups. The NSG on the subnet where your compute nodes live needs an outbound rule allowing TCP 443 and TCP 9443 to the CycleCloud server's private IP address. If you're using Azure Firewall or a third-party NVA, add the same rules to those policies.
To test from a stuck compute node (SSH in via the scheduler node as a jump host):
# Test HTTPS connectivity to CycleCloud
curl -sk https://<cyclecloud-private-ip>/health_check
# Test port 9443
nc -zv <cyclecloud-private-ip> 9443
A successful health check returns JSON with "status": "OK". If curl times out, you have a network path issue, not a software bug.
Also check that your Azure Virtual Network has the right DNS configuration. Compute nodes need to resolve the CycleCloud server's hostname. If you're using a custom DNS server in your VNet and the CycleCloud VM's private DNS entry isn't propagated, nodes may time out on TLS certificate validation because the hostname in the cert doesn't match what they resolved. Either use the private IP directly in CycleCloud's server URL configuration, or add a private DNS A record for the CycleCloud host.
Once you fix network connectivity, nodes that were stuck in "Preparing" will either self-heal within 5 minutes (the agent retries automatically) or you may need to terminate and re-acquire them from the CycleCloud UI's Nodes tab.
Azure CycleCloud 8 supports multiple file system types that HPC workloads commonly need, NFS exports from a dedicated NFS server node, BeeGFS parallel file systems, Azure NetApp Files, Azure Managed Lustre, and blob storage mounted via blobfuse. Each one has a distinct configuration path in CycleCloud, and mixing them up or misconfiguring the mount options causes jobs to fail with cryptic I/O errors.
The most common setup is an NFS server provided by CycleCloud itself, a dedicated VM in the cluster that exports /shared and /home to all other nodes. If compute nodes can't mount this NFS export, you'll see errors like:
mount.nfs: Connection timed out
mount.nfs: No route to host
First, confirm that the NFS server node is actually running. In the CycleCloud UI, click your cluster name → Nodes → look for a node with Role = "NFS." If it's in an error state, fix that before touching compute nodes.
Second, check that the NFS exports are correct on the server node itself:
sudo showmount -e <nfs-server-private-ip>
This should list /shared and /home with appropriate client CIDR blocks. If showmount returns nothing, the NFS server's export configuration failed, usually because the CycleCloud cluster-init scripts didn't complete successfully. Check /var/log/cluster-init/ on the NFS node for errors.
For BeeGFS parallel file systems configured through CycleCloud, the critical parameter is ensuring the BeeGFS management node hostname is resolvable from all client nodes. Add a [[[configuration]]] block to your cluster template that sets beegfs.management_host to the management node's private IP rather than its hostname if you're having DNS resolution problems in your environment.
After correcting file system configuration, you can force a re-mount on a compute node without replacing it: sudo cyclecloud-agent --restart-mounts or simply restart the node through the CycleCloud UI if that command isn't available on your version.
Advanced Troubleshooting for Azure CycleCloud 8
If the step-by-step fixes above didn't solve your problem, you're likely dealing with one of the more complex enterprise-specific issues. Here's where to dig next.
CycleCloud REST API for diagnosing cluster state programmatically. CycleCloud 8 ships a full RESTful API that lets you query cluster state, node state, and event logs without going through the web UI. This is invaluable when the UI itself is hanging or when you need to script diagnostics. The API is documented in the CycleCloud reference docs and is accessible at https://<cyclecloud-host>/api/. For example, to get a list of all nodes in a cluster and their statuses:
curl -sk -u admin:<password> \
"https://<cyclecloud-host>/api/clusters/MyCluster/nodes" \
| python3 -m json.tool | grep -E '"Name"|"Status"'
This is also how you can identify nodes that are in a transitional state the UI isn't surfacing clearly, the API will show you exact state machine transitions and timestamps.
Azure quota limits causing silent provisioning failures. Azure CycleCloud's autoscaler will request as many VMs as the scheduler says are needed, but if your Azure subscription is at quota for a given VM family in a given region, the ARM API returns a quota error and CycleCloud marks the node as failed. The cluster's Activity log will show the error, but you need to know to look for it. Go to Subscriptions → [your subscription] → Usage + quotas, filter by the VM family you're using (Standard_HB, Standard_HC, Standard_ND, etc.), and confirm you have headroom. Submit a quota increase request through the Azure portal if you don't. This is one of the most common reasons Azure HPC cluster scaling suddenly stops working after a period of normal operation.
Event Viewer on the CycleCloud server VM. If you installed CycleCloud on a Windows Azure VM (less common, but supported), check the Windows Event Viewer under Applications and Services Logs → CycleCloud. On Linux VMs, the CycleCloud service writes to /opt/cycle_server/logs/cycle_server.log, errors at the ERROR or FATAL level here usually indicate database corruption, certificate issues, or Java heap exhaustion. If you see java.lang.OutOfMemoryError in those logs, the CycleCloud VM is undersized for your cluster count. The recommended minimum for production Azure CycleCloud deployments managing more than 5 clusters is a Standard_D4s_v5 or larger.
Domain-joined node configuration. Enterprise customers who need compute nodes joined to an Active Directory domain will often find that the CycleCloud-managed cloud-init sequence conflicts with domain join scripts. CycleCloud runs its initialization (installing the agent, running cluster-init specs) before your domain join completes, which can cause timing issues where the node registers with CycleCloud using a hostname the domain doesn't yet know about. The solution is to use CycleCloud's chef_timing configuration in your cluster template to defer certain initialization steps until after domain join succeeds, or to structure your domain join as an early cluster-init spec that blocks later specs until it completes successfully.
If you've verified permissions, network connectivity, cluster template syntax, file system configuration, and quota, and nodes are still failing to start or the CycleCloud web application itself is throwing 500 errors, it's time to escalate. Collect the following before you call: your CycleCloud version number (Settings → About), the full contents of /opt/cycle_server/logs/cycle_server.log (last 2,000 lines), the cluster Activity log exported from the UI, and the output of cyclecloud show_cluster MyClusterName. Open a support case at Microsoft Support under the Azure category and select CycleCloud as the product. Having those logs ready will cut your resolution time significantly.
Prevention & Best Practices for Azure CycleCloud 8
I want to be direct here: most of the Azure CycleCloud 8 problems I've seen in production were entirely preventable. The platform is solid when it's set up with the right foundations. Here's how to make sure you're not the person filing an emergency support ticket at 2am because your HPC cluster won't start.
Size your CycleCloud VM correctly from the start. The CycleCloud web application runs a Java-based server with an embedded database. For small test clusters (fewer than 50 nodes total), a Standard_D2s_v5 is fine. For production workloads with hundreds of compute nodes across multiple clusters, start at Standard_D8s_v5 and monitor memory usage in Azure Monitor. Running out of memory causes CycleCloud to drop node state, which produces mysterious failures that are hard to diagnose after the fact.
Use managed identity instead of service principal secrets. Service principal secrets expire. When they do, every cluster stops dead. Managed identity on the CycleCloud VM eliminates that failure mode entirely, the credential rotation is handled by Azure automatically and never expires on you at the worst possible moment. If your CycleCloud installation is on an Azure VM, switch to managed identity now before it becomes an incident.
Pin your cluster templates to a specific CycleCloud project version. In your cluster template's [[node defaults]] section, always specify an explicit version for each project reference rather than using latest. Using latest means a project update can change node behavior without warning the next time a node is acquired. In an HPC environment where job reproducibility matters, that's a serious operational risk.
Test autoscaling with small node counts before production jobs. After initial setup, submit a job that requires exactly two nodes and watch the CycleCloud UI in real time. Confirm nodes go through the full lifecycle: Acquiring → Preparing → Ready → the job runs → nodes terminate when idle. This end-to-end test takes 10 minutes and will catch 80% of configuration problems before real production workloads expose them in the worst way.
Set up Azure Monitor integration early. CycleCloud integrates with Azure Monitor to expose cluster and node metrics. Turn this on during initial setup, not after something breaks. Having a week of baseline performance data is invaluable when you need to prove to your team that a slowdown is caused by a noisy neighbor VM problem rather than your cluster configuration. It also gives you alerting on quota exhaustion before it hits zero.
- Switch from service principal secrets to managed identity on your CycleCloud VM, eliminates the entire class of "credential expired" incidents
- Export and version-control your cluster templates in Git, treat them like infrastructure code, because that's what they are
- Set idle node termination timeouts conservatively (15–30 minutes) during initial setup; you can tighten them once you understand your job submission patterns
- Enable CycleCloud's built-in cost reporting integration with Microsoft Cost Management to catch runaway autoscaling before it becomes a budget problem
Frequently Asked Questions
What is Azure CycleCloud 8 and how is it different from Azure Batch?
Azure CycleCloud is an installable web application that lets you deploy and manage complete HPC environments on Azure, including the scheduler, compute nodes, file systems, and authentication infrastructure, using the specific HPC scheduler your team already knows (Slurm, PBSPro, LSF, Grid Engine, HTCondor). Azure Batch, by contrast, is a managed "Scheduler as a Service" where Microsoft handles the scheduler layer and you just submit jobs. If your team has deep expertise in a particular scheduler and you need to bring that exact setup to Azure with minimal re-tooling, CycleCloud is the right choice. If you want to skip the scheduler admin work entirely and just run jobs, Batch is simpler.
My Azure CycleCloud nodes are stuck in "Acquiring" state, how do I fix this?
Nodes stuck in "Acquiring" almost always mean the Azure API call to create the VM is failing. Go to your cluster's Activity tab in the CycleCloud UI and look at the most recent events, you'll see the actual ARM error, which is usually one of three things: insufficient permissions on your CycleCloud service identity, Azure quota exhausted for the VM family you're requesting, or a policy deny from Azure Policy that's blocking VM creation in that region or with those tags. Fix the specific error you see in the Activity log rather than guessing, each of those three root causes needs a different fix.
How do I set up Slurm autoscaling with Azure CycleCloud 8?
CycleCloud ships a Slurm cluster template that includes the autoscaling integration out of the box. When you create a Slurm cluster from that template, CycleCloud installs the cyclecloud-slurm package on the scheduler node, which adds a plugin that watches the Slurm queue and calls the CycleCloud API to request or release nodes based on pending jobs. The key configuration that most people miss is that the Slurm partition names in slurm.conf need to exactly match the nodearray names in your CycleCloud cluster template. If they don't match, the autoscaler won't know which nodearray to scale for each partition and scaling simply won't happen. Check sudo journalctl -u cyclecloud_slurm on the scheduler node for detailed autoscaler logs.
Can I use Azure CycleCloud 8 to mount Azure NetApp Files or Azure Managed Lustre?
Yes, Azure CycleCloud 8 supports mounting external file systems including Azure NetApp Files and Azure Managed Lustre on compute nodes through cluster template configuration. For Azure NetApp Files, you configure the NFS mount point in your cluster template's [[[configuration]]] block using the volume's mount target IP and export path. For Azure Managed Lustre, CycleCloud can configure the Lustre client on nodes and mount the file system during node initialization. The file system itself must exist in Azure before your cluster starts, CycleCloud mounts it, but doesn't create it. Make sure your cluster's VNet has a peering or private endpoint connection to wherever the file system is hosted.
Why does the CycleCloud web application keep going to a 502 error or timing out?
A 502 or timeout on the CycleCloud web application almost always means the CycleCloud server process crashed or is overwhelmed. Check /opt/cycle_server/logs/cycle_server.log on the CycleCloud VM for OutOfMemoryError entries, if you see them, your CycleCloud VM is undersized. Resize it to at least Standard_D4s_v5 and the CycleCloud service will get more heap space. If there's no OOM error, look for database lock errors, which can happen if the embedded database (CycleCloud uses an embedded datastore) got corrupted from an unclean shutdown. In that case, stop the CycleCloud service, run /opt/cycle_server/bin/cycle_server verify, and restart, this triggers a consistency check that often self-heals minor corruption.
How do I use the CycleCloud REST API or Python API to manage clusters programmatically?
CycleCloud 8 ships a full RESTful API at https://<your-cyclecloud-host>/api/ that covers cluster creation, node management, template import/export, and event log queries, all the same operations you can do through the web UI. Microsoft also provides an official Python API wrapper (the CycleCloud Python client) that makes it much easier to script against the REST API without handling raw HTTP calls. The Python API reference is available in the CycleCloud documentation under the "Python API" section. Authentication uses the same username and password as the web UI, or you can use an API token that you generate under your CycleCloud user profile settings. Using the API is the right approach for integrating CycleCloud into existing automation pipelines or CI/CD workflows.