Azure Compute Fleet: Fix Setup & Configuration Errors
Why This Is Happening
You've heard about Azure Compute Fleet , Microsoft's answer to large-scale, mixed VM workloads , and you're trying to get it running. Maybe you're provisioning a big data pipeline, a CI/CD farm, or a financial modeling job that needs hundreds of cores on demand. You go through the setup, hit Create, and get a vague deployment failure. Or the fleet spins up but your Spot VMs keep disappearing without replacement. Or you're staring at a quota error you didn't expect. I get it. This product is still maturing and the error messages are not always helpful.
Azure Compute Fleet is a relatively recent addition to the Azure portfolio. At its core it gives you a single API call to deploy up to 10,000 VMs, mixing Spot VMs and standard pay-as-you-go instances in the same fleet, all managed automatically. It handles eviction replacement, capacity monitoring, and SKU diversification so you don't have to babysit dozens of individual scale sets. That sounds great. But the configuration surface area is wide, and the concepts around minimum starting capacity, allocation strategies, and quota scoping catch people off guard constantly.
The most common reasons Azure Compute Fleet deployments fail or misbehave:
- Minimum starting capacity not satisfied. If Azure can't provision at least the number of VMs you set as the minimum, the whole deployment fails, even if you only needed 20 out of a 100-VM target.
- Regional quota exhaustion. Each region has hard limits: 500 active Compute Fleets per region, a 10,000 VM cap per fleet, and a 100,000 VM cap across all fleets in a single region. Bumping into these silently fails provisioning.
- Misconfigured allocation strategy. Choosing the wrong strategy for Spot vs. Standard VMs means you either overpay or constantly fight capacity shortages.
- Trying to deploy in unsupported regions. Azure Compute Fleet is available in all public Azure regions except China-based regions. Attempting deployment there throws a non-obvious error.
- SDK or ARM template version mismatch. If you're deploying via infrastructure-as-code rather than the portal, outdated SDK versions or ARM schema mismatches cause silent misconfigurations.
I know this is frustrating, especially when the fleet concept is supposed to simplify compute management and you're spending more time debugging it than running workloads. This guide covers every major failure mode I've seen, with exact steps to resolve each one. Browse all Microsoft fix guides →
The Quick Fix, Try This First
If your Azure Compute Fleet deployment is failing outright on creation, the number-one culprit is a minimum starting capacity mismatch. This is the single most common issue I see with new fleet setups, and it's fixable in under two minutes.
Here's what's happening: when you configure a Compute Fleet, you can optionally set a minimum starting capacity, a threshold number of VMs that must be provisionable before the fleet is considered successfully deployed. If Azure can't fulfill that threshold (because of regional capacity pressure or Spot availability), the entire deployment fails, even if your actual target is much higher.
To fix it:
- Open the Azure portal and navigate to your Compute Fleet resource (or to the creation wizard if you're mid-setup).
- In the Basics tab, locate the Minimum starting capacity field.
- Reduce this number. If your target is 100 VMs and your minimum is 50, try dropping the minimum to 10 or 15. You want just enough to validate that capacity exists in that region.
- Alternatively, if your configuration type is set to Maintain capacity, note that minimum starting capacity is not configurable in that mode, Azure handles it automatically. Switching to that mode removes the constraint entirely.
- Click Review + create and deploy again.
If deployment succeeds this time, your fleet is healthy. The minimum starting capacity setting was simply too aggressive for current regional Spot availability. You can always adjust target capacity upward later once the fleet is running.
If the fleet still fails, keep reading, the remaining steps cover quota issues, allocation strategy misconfigs, and portal-level errors.
Getting the Azure Compute Fleet setup started correctly in the portal avoids a surprising number of downstream headaches. The fleet creation UI is not surfaced from the standard "Virtual Machines" blade, that trips people up right away.
Here's the exact navigation path:
- Sign in to portal.azure.com.
- In the top search bar, type
Compute Fleet. Under the results, select the result listed under Marketplace, not Services. This distinction matters; selecting the wrong result drops you in the wrong blade. - On the Compute Fleet Marketplace page, click Create.
- In the Basics tab under Project details, confirm the correct subscription. If you're working across multiple tenants or subscriptions, double-check here, I've seen engineers burn 30 minutes wondering why quota errors show up, only to realize they were deploying into a dev subscription with a 10-VM cap.
- Create a new resource group. The docs suggest naming it something like
myFleetResourceGroup. Keep it descriptive, something likerg-computefleet-prod-eastussaves a lot of confusion when you have multiple fleets. - Set the fleet name under Instance details.
For the image selection, Azure Compute Fleet supports Windows Server images and Linux images including RHEL, CentOS, Ubuntu, and SLES. Pick the image that matches your workload. If you're running containerized jobs, Ubuntu LTS with Docker pre-installed is a common choice.
What you should see if this step worked: The Basics tab validates cleanly with no red error indicators, and you can advance to the next tab. If you see a subscription or region validation error here, that's your first sign of a quota or access issue, resolve that before going further.
This is the step most people rush through, and it causes more ongoing headaches than anything else in the Azure Compute Fleet configuration. Your allocation strategy determines how the fleet picks VMs when provisioning, and a bad match between strategy and workload means you'll either overpay or constantly fight capacity gaps.
Azure Compute Fleet offers distinct allocation strategies for Spot VMs and Standard VMs. Here's how to think about each:
For Spot VMs:
- Lowest price, Fleet picks whichever SKUs are currently cheapest. Great for fault-tolerant batch jobs where cost is king and eviction is tolerable.
- Capacity optimized, Fleet picks SKUs with the most available capacity, reducing eviction risk. Good for jobs where interruption is expensive even if you're using Spot.
- Price & capacity balanced, A blend of both. This is the right default for most workloads. It's what I'd recommend starting with unless you have a specific cost target or uptime requirement that forces you toward one extreme.
For Standard (pay-as-you-go) VMs:
- Lowest price, Picks the cheapest standard SKUs in your allowed list. Works well when your SKU preference list is broad.
- Prioritized, Respects the order of your SKU list. You define priority, and the fleet honors it. Use this when you have infrastructure requirements that mandate specific VM families.
To set the strategy in the portal, look for the Allocation strategy section within the fleet creation wizard. Set it separately for Spot and Standard if you're mixing both types.
One thing the Azure Compute Fleet docs don't shout loudly enough: you can also mix purchasing models inside a single fleet, Reserved Instances, Savings Plans, Spot, and pay-as-you-go all coexist. There's no extra charge from Microsoft for running a fleet itself. You only pay for the underlying VMs launched per hour.
What you should see: After selecting strategies, the configuration panel shows your combined fleet setup without errors. If the strategy dropdown is grayed out, check whether you accidentally left the fleet type as a one-time request rather than a managed fleet, the UI behavior differs between those two modes.
Target capacity is where Azure Compute Fleet gets powerful, and where it's easy to misconfigure in ways that cause silent failures. You set target capacity separately for Spot VMs and pay-as-you-go VMs. These are independent numbers, and the fleet manages them independently based on your workload.
To configure this correctly:
- In the fleet creation wizard, locate the Target capacity section.
- Enter your desired Spot VM count and your desired Standard VM count separately. The combined total across all fleets in a region cannot exceed 100,000 VMs, and a single fleet is capped at 10,000 VMs.
- Decide on Minimum starting capacity. This field tells Azure: "Don't finish the deployment unless you can give me at least this many VMs." If you set this too high relative to regional availability, the entire deployment will fail with a capacity error.
Here's the exact behavior described in the official docs, and it's worth understanding precisely: if your target is 100 VMs and your minimum starting capacity is 20, the deployment only succeeds if Azure can provision at least 20. If regional capacity is tight and only 15 are available, you get a hard failure, not a partial deployment.
Important constraint: If you configure your fleet with Maintain capacity as the capacity preference type, you cannot set a minimum starting capacity. Azure takes over that responsibility automatically and attempts to keep the fleet at your target. This is the right mode for long-running services where you need consistent compute.
# To check current regional quota usage via Azure CLI:
az quota show \
--resource-name "computeFleets" \
--scope "/subscriptions/{subscriptionId}/providers/Microsoft.Compute/locations/{region}"
What you should see: After entering capacity values, the portal validates the numbers against your current quota. If you see a red warning about quota limits before you even try to deploy, you need to request a quota increase first via the Azure portal's Quotas blade.
One of the more underused features of Azure Compute Fleet is attribute-based VM selection. Instead of manually listing every VM SKU you'll accept, you tell the fleet what your workload actually needs, a certain amount of memory, a vCPU range, storage type, and Azure picks from whatever SKUs satisfy those attributes at any given moment. This is especially valuable for Azure Compute Fleet Spot configurations where SKU availability shifts constantly.
To set this up in the portal:
- In the VM configuration section, look for the option to specify VM sizes. You'll see two modes: Specify VM sizes (manual list) and Let Azure Compute Fleet decide based on attributes.
- Select the attribute-based option.
- You can then set constraints for:
- vCPU range, e.g., minimum 4 vCPUs, maximum 32
- Memory range, e.g., minimum 16 GB RAM
- Local storage, whether local temp disk is required
- VM architecture, x64 vs. Arm64 if your workload cares
The advantage here is significant: instead of a static SKU list that can run dry on capacity, you're effectively giving the fleet a wide pool to draw from. If the D4s_v5 is fully allocated in your region, the fleet can fall back to E4s_v5 or any other SKU that meets your attribute thresholds. This is why attribute-based VM selection in Azure Compute Fleet dramatically reduces capacity-related deployment failures.
Note that attribute-based selection is currently in Preview. That means the UI might have rough edges, and you should test in a non-production environment before counting on it for critical workloads. The feature is surfaced in the portal as of the latest update cycle.
# Example ARM template snippet for attribute-based selection:
"vmAttributes": {
"vCpuCount": {
"min": 4,
"max": 32
},
"memoryInGiB": {
"min": 16
},
"localStorageSupport": "Excluded"
}
What you should see: The portal displays a summary of matching VM SKUs based on your attribute filters. If the count of matching SKUs shows as very low (under 5), consider loosening your constraints, a narrow attribute set behaves the same as a short manual SKU list and carries the same capacity risk.
If you're running Spot VMs inside your Azure Compute Fleet, and most people are, because that's where the cost savings come from, eviction is not an if, it's a when. Azure Compute Fleet has built-in eviction handling, but you have to configure it correctly or you'll find your fleet silently shrinking over time.
Here's what happens when a Spot VM gets evicted:
- The eviction can be triggered by Azure needing that capacity back (a capacity eviction) or because the Spot price exceeded your maximum price threshold (a price eviction).
- Azure Compute Fleet can automatically attempt to replace evicted Spot VMs to maintain your target capacity, but this only happens if your fleet is configured with Maintain capacity mode.
To verify your eviction replacement is configured:
- Open your fleet resource in the Azure portal.
- Under Configuration, confirm the capacity type is set to Maintain capacity rather than Request capacity (one-time).
- Check your allocation strategy for Spot VMs. If it's set to Lowest price, you're most exposed to eviction. Consider switching to Capacity optimized or the balanced option if eviction frequency is hurting your workload.
You should also make sure your applications running on these VMs handle eviction gracefully. Azure sends a 30-second eviction notice to the VM before termination. Your app should listen for the scheduled events endpoint to catch this signal and checkpoint work before the VM goes away:
# Poll the Instance Metadata Service for scheduled eviction events:
curl -H "Metadata: true" \
"http://169.254.169.254/metadata/scheduledevents?api-version=2020-07-01"
If EventType returns Preempt, your VM has 30 seconds. Use that window to flush state, write checkpoints, or gracefully drain connections.
To monitor evictions at the fleet level, check the fleet's Activity log in the portal. Eviction events appear there with timestamps and affected VM instance IDs, which helps you correlate evictions with workload failures.
What you should see: In a healthy fleet under Maintain capacity mode, the instance count under Overview should hover near your target even after eviction events. If it keeps dropping, your allocation strategy may not have access to enough diverse SKUs to replace evicted capacity, widen your SKU list or switch to attribute-based selection.
Advanced Troubleshooting
Quota Errors and How to Read Them
Azure Compute Fleet quota errors are frustratingly generic in the portal UI but carry specific numeric limits you can check against. The three limits that bite people most often:
- 500 active Compute Fleets per region, rare unless you're a large enterprise running many separate fleets, but real in managed service provider scenarios.
- 10,000 VMs per fleet, hard cap. If you're designing a workload that needs more, you need multiple fleets.
- 100,000 VMs across all fleets in a region, this is the one that catches scale-ups by surprise. You might be at 90,000 VMs spread across several fleets and try to add a new large fleet, only to hit this wall.
To check and request quota increases:
- In the Azure portal, search for Quotas.
- Select Compute from the provider list.
- Filter by your target region and look for Compute Fleet-related entries.
- Use the Request increase button directly in that blade. For production workloads, request at least 2x your expected peak, quota increase requests take time to process.
ARM Template Deployments: Schema Version Issues
If you're deploying Azure Compute Fleet via ARM templates or Bicep and the deployment keeps returning schema validation errors, the most likely cause is using an outdated API version. Make sure your ARM template targets the correct API version for the Microsoft.AzureFleet/fleets resource type. Check the Azure REST API docs for the current stable version, using a preview API version in a production pipeline can cause silent property mismatches where fields you've set are simply ignored.
{
"type": "Microsoft.AzureFleet/fleets",
"apiVersion": "2024-11-01",
"name": "[parameters('fleetName')]",
"location": "[parameters('location')]",
...
}
Multi-Region Fleet Configuration (Preview)
Azure Compute Fleet now supports distributing workloads across multiple regions, up to three regions per fleet, in a preview feature. If you're configuring multi-region fleet deployment and the secondary regions aren't getting capacity, confirm:
- Each region is listed explicitly in your fleet's region configuration.
- You have sufficient quota in each target region independently, quota is not shared cross-region.
- Your VM SKU list (or attribute filters) includes SKUs available in all target regions. A SKU that exists in East US may not exist in West Europe.
SDK Integration Problems
Azure Compute Fleet SDKs are available for Java, JavaScript, Go, and Python. If SDK calls are returning unexpected errors, first verify you're using the current SDK version. To find the right SDK:
- Go to azure.github.io/azure-sdk.
- Search for
Compute Fleetin the search bar. - Select your language and verify the SDK version you're using matches the latest stable release.
Authentication errors from SDK calls almost always trace back to service principal permission scope. The service principal or managed identity making fleet API calls needs Contributor role (or a custom role with Microsoft.AzureFleet/* permissions) on the subscription or resource group.
Prevention & Best Practices
Most Azure Compute Fleet problems I've seen in production are preventable with a bit of upfront planning. The product is designed to remove operational complexity from large-scale compute, but "fire and forget" only works well if the initial configuration is solid. Here's how to set yourself up for long-term stability.
Design your SKU list wide, not narrow. A fleet with three allowed VM types will fail far more often than a fleet with fifteen. SKU availability fluctuates across regions and availability zones constantly. The more VM types you allow, or the broader your attribute filter, the more fallback options the fleet has. For most batch workloads, the exact CPU microarchitecture doesn't matter. If your workload is CPU-generation agnostic, let the fleet pick from a wide pool.
Use Maintain capacity mode for persistent workloads. If your fleet is running a service that needs to stay at a target size, the one-time request mode is the wrong choice. Maintain capacity mode keeps the fleet actively monitoring and replacing VMs, handling Spot evictions automatically. Think of it like Azure managing the fleet health loop so you don't have to.
Separate your Spot and Standard capacity targets intentionally. Don't just put everything in Spot and assume the fleet handles it. Define a stable baseline of Standard VMs for the minimum viable workload, then use Spot VMs for burst capacity above that baseline. This way, even when Spot capacity tightens in a region, your core operation continues uninterrupted.
Tag your fleet resources consistently. Compute Fleets can spawn a large number of VMs quickly, up to 10,000 in a single fleet. Without proper tagging, cost attribution across projects, environments, or teams becomes a nightmare. Set resource tags at fleet creation time and confirm they propagate to child VM resources.
- Request 2x your expected peak quota before you need it, quota increases take time and you don't want to be blocked mid-scale-up.
- Enable Azure Monitor alerts on your fleet's instance count so you know the moment a sustained capacity drop happens.
- If using Spot VMs, always implement the Instance Metadata Service eviction handler in your workload code, 30 seconds is enough to checkpoint most jobs.
- Test your fleet configuration with a small target capacity first, then scale. A 10-VM test deployment validates quotas, SKU availability, networking, and image configuration at a fraction of the cost of debugging a 1,000-VM failure.
Frequently Asked Questions
What exactly is Azure Compute Fleet and how is it different from a VM Scale Set?
Azure Compute Fleet is a higher-level building block that lets you deploy up to 10,000 VMs in a single API call, mixing Spot and Standard VM types across multiple SKUs with automated management. A VM Scale Set (VMSS) is scoped to a single VM type and requires more manual management for multi-SKU scenarios. Compute Fleet is designed for workloads that benefit from SKU diversity, batch processing, CI pipelines, financial simulations, where you want Azure to handle SKU selection, Spot eviction recovery, and capacity optimization automatically. If your workload is a traditional autoscaled web tier with a fixed VM type, VMSS may still be the right tool. Compute Fleet shines when you need scale, price flexibility, and hands-off management simultaneously.
Why does my Azure Compute Fleet deployment keep failing with a capacity error?
The most common cause is a minimum starting capacity that's set too high relative to available regional capacity at the time of deployment. If you've set a minimum of 50 VMs and the region only has 30 Spot units available for your allowed SKUs, the whole deployment fails. Lower the minimum starting capacity or switch to Maintain capacity mode, which removes that threshold check entirely. It's also worth widening your allowed VM SKU list, a narrow SKU list means fewer options when capacity is tight, and a single unavailable SKU can block an otherwise viable deployment. You can verify capacity indirectly by checking Spot VM availability in the region via the Azure portal's Spot pricing page before deploying.
Is there any extra cost for using Azure Compute Fleet on top of VM costs?
No, Microsoft doesn't charge a management fee or platform surcharge for Azure Compute Fleet itself. You pay only for the VMs the fleet provisions, billed per hour at the same rates as if you had launched those VMs directly. Spot VMs in the fleet are billed at Spot prices, Standard VMs at pay-as-you-go or Reserved Instance rates depending on how you've configured your purchasing model. This makes Compute Fleet effectively free infrastructure-as-a-service on top of your existing compute spend, the value is in the automation and capacity management, not in a premium product tier.
Can I deploy Azure Compute Fleet across multiple Azure regions?
Yes, and this is one of the more powerful features of the product. A single Compute Fleet can span up to three Azure regions simultaneously, distributing your workload across geographic locations for better capacity availability and resilience. This multi-region deployment capability is currently in Preview. Each region in your fleet configuration is independent from a quota perspective, your 100,000 VM cap applies per region, so a three-region fleet could theoretically run 300,000 VMs total (subject to your subscription limits). Keep in mind that networking costs for cross-region VM communication can add up quickly, so architect your workload to minimize data movement between regions if cost is a factor.
What programming languages can I use to manage Azure Compute Fleet via SDK?
Azure Compute Fleet SDKs are available for Java, JavaScript, Go, and Python. Each SDK is designed to follow that language's idioms, you're not just wrapping raw REST calls, you get proper objects, error handling, and async patterns appropriate for the language. To find the current SDK version and documentation, go to the official Azure SDKs page and search for "Compute Fleet." You'll see the available packages grouped by language. Always use the latest stable release rather than preview versions in production code, and make sure the service principal or managed identity your code authenticates with has Contributor permissions on the fleet resource or its resource group.
My Spot VMs keep getting evicted. How do I reduce eviction frequency in Azure Compute Fleet?
Eviction frequency is primarily controlled by two things: your allocation strategy and your SKU diversity. If you're using the Lowest price allocation strategy, you're selecting SKUs based on cheapest current price, which often means SKUs under high demand pressure and therefore higher eviction risk. Switching to Capacity optimized or the balanced price-and-capacity strategy significantly reduces eviction frequency because the fleet prioritizes SKUs with more available headroom. Additionally, widening your allowed SKU list (or using attribute-based VM selection) gives the fleet more fallback options, if one SKU type is being reclaimed by Azure, the fleet can quickly substitute a different SKU that meets your attribute requirements but has more available capacity at that moment.