Azure Virtual Machine Scale Sets: Fix Common Errors
Why This Is Happening
I've seen this exact situation play out dozens of times: an Azure engineer deploys a Virtual Machine Scale Set, everything looks fine in the portal, and then , nothing works. VMs can't reach the internet. The load balancer won't attach. Backup fails silently. The Azure portal just says "Provisioning failed" with an error code that tells you almost nothing useful.
Here's the thing about Azure Virtual Machine Scale Sets that trips people up most often: Microsoft fundamentally changed how they work when they introduced Flexible orchestration mode. If you're coming from an older deployment or a tutorial written before 2023, you're probably mixing Flexible-mode assumptions with Uniform-mode configurations , and Azure will let you deploy it, then fail in strange ways at runtime.
Azure Virtual Machine Scale Sets give you the ability to run and manage a group of load-balanced VMs automatically. The platform handles scaling, fault distribution, and availability. That's the pitch. The reality is that there are now three separate deployment models, Flexible orchestration, Uniform orchestration, and classic Availability Sets, and each one has a completely different feature matrix. Picking the wrong one, or assuming features carry over between them, is the root cause of most Azure Virtual Machine Scale Sets problems I see in support queues.
Flexible orchestration mode is the current recommended path. It supports up to 1,000 VMs per scale set, spreads instances across up to three fault domains in a region, and works with availability zones. But it drops support for several things Uniform mode had, and Microsoft's error messages rarely spell out which specific unsupported parameter is causing your deployment to fail.
For example: if you try to use a Basic Load Balancer with a Flexible orchestration scale set, the deployment fails. If you configure port forwarding using a NAT Pool (the Uniform way of doing it) instead of NAT Rules (the Flexible way), it fails. If you enable a system-assigned Managed Identity instead of a user-assigned one, it fails. None of these error messages say "you used the wrong orchestration mode", they just fail.
The other major pain point is outbound connectivity. In Uniform orchestration, VMs had default outbound internet access. In Flexible orchestration, that's gone. You must explicitly configure outbound connectivity, either via a NAT gateway, a Standard Load Balancer with outbound rules, or a public IP on each instance. Skip this, and your VMs will be silently cut off from the internet and from Azure services that require outbound access.
I know this is frustrating, especially when you're in the middle of a production deployment and the Azure documentation splits critical information across four different pages. Let me give you a clear, sequential path through the most common Azure Virtual Machine Scale Sets errors and how to fix them. Browse all Microsoft fix guides →
The Quick Fix, Try This First
Before you touch any advanced settings, run this PowerShell command to pull the current state of your scale set, including its orchestration mode and any deployment errors. This single output will tell you whether you're dealing with a mode mismatch, a networking gap, or a parameter conflict.
$vmss = Get-AzVmss -ResourceGroupName "YourResourceGroup" -VMScaleSetName "YourScaleSetName"
$vmss.OrchestrationMode
$vmss.ProvisioningState
$vmss.Sku
If OrchestrationMode returns Flexible, then confirm that every parameter you've configured is on the Flexible-supported list. The most common blockers are:
- Basic Load Balancer attached, Flexible does not support Basic LB. You must use Standard SKU.
- NAT Pool configured for port forwarding, Flexible uses NAT Rules per individual instance, not NAT Pools. Delete the pool and create individual NAT Rules instead.
- System-assigned Managed Identity enabled, Flexible only supports user-assigned Managed Identity. Go to your scale set's Identity blade, disable system-assigned, and attach a user-assigned identity.
- Single placement group set to
true, Leave this null in Flexible mode. The platform picks the correct value automatically. - Upgrade policy set to anything other than null/empty, In Flexible mode, the upgrade policy field must be left blank.
To fix a Managed Identity issue via PowerShell without redeploying from scratch:
# Remove system-assigned identity and assign user-assigned instead
$identity = Get-AzUserAssignedIdentity -ResourceGroupName "YourRG" -Name "YourManagedIdentity"
Update-AzVmss `
-ResourceGroupName "YourResourceGroup" `
-VMScaleSetName "YourScaleSetName" `
-IdentityType "UserAssigned" `
-IdentityId $identity.Id
If your scale set provisioned successfully but VMs can't reach the internet, that's the outbound connectivity gap. Jump to Step 2 below, that's almost certainly your issue.
singlePlacementGroup, upgradePolicy, and identity.type. These three fields cause 80% of silent deployment failures I've seen in enterprise environments.
The first thing to nail down is which orchestration mode you're running and whether your expected scale is actually supported. This isn't obvious from the Azure portal's main overview page, you have to look for it.
In the Azure portal, navigate to your scale set resource, then go to Properties in the left sidebar. Look for Orchestration mode. If you don't see it, scroll down, it's easy to miss. Alternatively, run this CLI command:
az vmss show \
--resource-group YourResourceGroup \
--name YourScaleSetName \
--query "{mode:orchestrationMode, capacity:sku.capacity, state:provisioningState}"
Here's what the limits mean for your planning:
- Flexible orchestration: Maximum 1,000 VM instances per scale set. Fault domain availability guaranteed up to 1,000 instances spread across up to 3 fault domains. No update domains, Azure handles maintenance fault-domain-by-fault-domain instead.
- Uniform orchestration: Maximum 3,000 VM instances with fault domain availability guarantees, up to 5 update domains.
- Availability Sets: Maximum 200 instances, up to 20 update domains.
If you need more than 1,000 VMs in a single Flexible scale set, that's currently not possible. Your options are to use Uniform orchestration (which supports up to 3,000 with FD guarantees) or split workloads across multiple scale sets behind a shared load balancer.
Also confirm availability zone configuration. Flexible and Uniform both support deploying across availability zones. Availability Sets do not. If your architecture diagram shows multi-zone deployment but you're running an Availability Set, that's your gap, you'll need to migrate to a scale set.
If everything checks out and provisioning state shows Succeeded but VMs aren't healthy, move to the next step.
This is the single most common Azure Virtual Machine Scale Sets problem I see with Flexible mode deployments. In Uniform orchestration, virtual machines got default outbound internet connectivity automatically. That default is gone in Flexible. Your VMs are deployed with no outbound path unless you explicitly build one.
You have three solid options. Pick the one that fits your architecture:
Option A, NAT Gateway (recommended for most cases):
# Create a public IP for the NAT gateway
az network public-ip create \
--resource-group YourRG \
--name NatGatewayIP \
--sku Standard \
--allocation-method Static
# Create the NAT gateway
az network nat gateway create \
--resource-group YourRG \
--name ScaleSetNatGateway \
--public-ip-addresses NatGatewayIP \
--idle-timeout 10
# Associate it with the subnet your scale set uses
az network vnet subnet update \
--resource-group YourRG \
--vnet-name YourVNet \
--name YourSubnet \
--nat-gateway ScaleSetNatGateway
Option B, Standard Load Balancer with outbound rules: Attach a Standard SKU Load Balancer (not Basic, Basic is unsupported in Flexible mode) and define an outbound rule. In the Azure portal, go to your Load Balancer → Outbound rules → Add, select your backend pool and a frontend public IP, then save.
Option C, Public IP per instance: If you need each VM to have its own public IP (uncommon but valid for certain scenarios), configure this in the scale set's Network Interface settings under IP configurations.
After applying any of these, SSH or RDP into one of your scale set instances and run curl -s https://ifconfig.me (Linux) or Invoke-WebRequest -Uri https://ifconfig.me (Windows) to confirm outbound internet is working.
If you're migrating from Uniform orchestration to Flexible, or if you copied a Uniform-mode ARM template and applied it to a Flexible scale set, you may have NAT Pools configured for port forwarding. NAT Pools are not supported in Flexible orchestration. The deployment will either fail or silently create a broken configuration.
Here's how to identify the problem. In the Azure portal, go to your Load Balancer and click Inbound NAT pools in the left menu. If you see rules listed there and your scale set is in Flexible mode, those rules won't work as expected.
The fix is to delete the NAT Pool and create individual NAT Rules instead. NAT Rules in Flexible mode target specific VM instances rather than a dynamic pool:
# List existing NAT rules on your load balancer
az network lb inbound-nat-rule list \
--resource-group YourRG \
--lb-name YourLoadBalancer
# Create a NAT rule targeting a specific instance (e.g., SSH to instance 0 on port 50000)
az network lb inbound-nat-rule create \
--resource-group YourRG \
--lb-name YourLoadBalancer \
--name NatRule-Instance0-SSH \
--protocol Tcp \
--frontend-port 50000 \
--backend-port 22 \
--frontend-ip-name YourFrontendIPConfig
Then associate the NAT rule with the specific VM's NIC:
az network nic ip-config inbound-nat-rule add \
--resource-group YourRG \
--nic-name YourVMNicName \
--ip-config-name ipconfig1 \
--lb-name YourLoadBalancer \
--inbound-nat-rule NatRule-Instance0-SSH
If you have many instances, you'll want to script this in a loop. It's more work than a NAT Pool, but it's the correct pattern for Flexible orchestration and gives you per-instance control that NAT Pools never really offered anyway.
After this change, test connectivity by attempting to connect to a specific instance through the mapped frontend port on your load balancer's public IP. A successful connection confirms the NAT rule is working.
Azure Backup support across orchestration modes is one of the most important, and least-documented, differences between the three deployment types. Getting this wrong means you think you have backups when you don't.
Here's the hard rule: Azure Backup works with Flexible orchestration and Availability Sets, but does not work with Uniform orchestration scale sets. Similarly, Azure Site Recovery works with Flexible (via PowerShell) and Availability Sets, but not Uniform. If you're running Uniform orchestration and relying on Azure Backup, you currently have no backup coverage at the scale set level.
For Flexible orchestration scale sets, set up Azure Backup through the Recovery Services vault. In the Azure portal:
- Navigate to Recovery Services vaults and create or open your vault
- Click Backup → set Where is your workload running? to Azure
- Set What do you want to back up? to Virtual Machine
- Click Backup, then select your scale set instances from the VM list
- Assign a backup policy and click Enable Backup
For Azure Site Recovery on a Flexible scale set, use PowerShell, the portal UI doesn't fully support it yet:
# Enable replication for a scale set VM via PowerShell
$vault = Get-AzRecoveryServicesVault -Name "YourVaultName" -ResourceGroupName "YourRG"
Set-AzRecoveryServicesAsrVaultContext -Vault $vault
# Then configure replication fabric, container, and policy as per your DR design
# See: New-AzRecoveryServicesAsrReplicationProtectedItem
After enabling backup, go back to your Recovery Services vault, click Backup items → Azure Virtual Machine, and verify your instances appear with status Backup enabled. If instances are missing, check that the VM agent is installed and healthy on each scale set instance.
Here's something that catches a lot of teams off guard when moving to Flexible orchestration: you can't use an SLB (Standard Load Balancer) health probe to report application health back to the scale set. That feature is Uniform-only. In Flexible mode, you need the Application Health Extension installed on each instance to get real health signal back to the scale set orchestrator.
Without the Application Health Extension, your scale set doesn't know whether the application inside the VM is actually working, it only knows the VM is running. That means automatic repair policies and rolling upgrades can't function correctly.
Install the Application Health Extension via Azure CLI:
az vmss extension set \
--resource-group YourRG \
--vmss-name YourScaleSetName \
--name ApplicationHealthLinux \
--publisher Microsoft.ManagedServices \
--version 1.0 \
--settings '{"protocol": "http", "port": 80, "requestPath": "/health"}'
For Windows VMs, replace ApplicationHealthLinux with ApplicationHealthWindows.
The extension sends HTTP or TCP probes to a local endpoint on each VM. Your application needs to expose a health endpoint that returns HTTP 200 when healthy. If you don't have one yet, a simple response on /health is enough to start, the exact content doesn't matter, just the status code.
After installation, update your instances to apply the extension:
az vmss update-instances \
--resource-group YourRG \
--name YourScaleSetName \
--instance-ids "*"
Then check instance health in the portal under your scale set → Instances. Each instance should show a health state of Healthy once the extension is running and the endpoint is responding. If an instance shows Unhealthy, SSH/RDP in and verify the application health endpoint is reachable locally first: curl http://localhost:80/health.
Advanced Troubleshooting
When the basic steps don't resolve things, it's time to dig into platform-level diagnostics. Azure Virtual Machine Scale Sets failures leave traces in Activity Logs, Azure Monitor, and the VM's own event system, you just have to know where to look.
Azure Activity Log, your first stop: In the Azure portal, open your scale set resource and click Activity log in the left sidebar. Filter by Timespan: Last 1 hour and Status: Failed. Expand any failed operations and read the Status message field in the JSON. This is where Azure actually tells you what went wrong, far more detail than what surfaces in the portal notifications. Copy the full statusMessage text and search for it; Microsoft docs often have specific KB articles keyed to exact status message strings.
VM-level Event Viewer (Windows): RDP into a scale set instance and open Event Viewer. Navigate to Windows Logs → Application and Windows Logs → System. For Azure extension failures, also check Applications and Services Logs → Microsoft → WindowsAzure → GuestAgent → Admin. Event ID 1 in the GuestAgent log with a red error icon usually points to a VM agent communication failure with the Azure fabric.
Diagnose scale-in/scale-out failures:
# Check autoscale history and see why scale operations fired or failed
az monitor autoscale show \
--resource-group YourRG \
--name YourAutoscaleSetting
az monitor activity-log list \
--resource-group YourRG \
--offset 2h \
--query "[?contains(operationName.value, 'scale')]"
Ultra disk configuration errors: If your deployment template includes diskIOPSReadWrite or diskMBpsReadWrite fields for ultra disk tuning, remove them entirely when using Flexible orchestration. These parameters are not supported. The error you'll see is typically a generic InvalidParameter with a body that only mentions "disk configuration", not obvious at all.
Image-based Automatic OS Upgrades: This feature doesn't work in Flexible orchestration. If you had it enabled in a Uniform scale set and you're migrating, remove the automaticOSUpgradePolicy block from your ARM template before deploying to Flexible. Leave the upgrade policy null.
Infiniband networking: If your workload uses high-performance computing with Infiniband networking, Flexible orchestration isn't your answer, Infiniband is only supported in Uniform orchestration with single placement group enabled. HPC workloads requiring Infiniband should stay on Uniform for now.
Unmanaged disks: Flexible orchestration doesn't support unmanaged disks. If you're running any instances with VHDs stored directly in Storage Account blobs (as opposed to Azure Managed Disks), you'll need to migrate those to Managed Disks before moving to a Flexible scale set. Use the az vm convert command on individual VMs before bringing them into the scale set.
ProvisioningState: Failed that doesn't produce a clear error in the Activity Log, that's a platform-level issue and you need to escalate. Also escalate if your scale set is stuck in Updating state for more than 30 minutes, or if you're seeing quota errors that portal quota increase requests aren't resolving. Reach out directly at Microsoft Support and include your scale set resource ID, the Activity Log JSON, and your subscription ID in the first message, it cuts response time significantly.
Prevention & Best Practices
Most Azure Virtual Machine Scale Sets problems are avoidable. The teams that run scale sets cleanly share a few habits that stop issues before they become incidents.
Always specify orchestration mode explicitly in your ARM templates or Bicep files. Don't rely on portal defaults. Set "orchestrationMode": "Flexible" explicitly in every deployment artifact. When you migrate templates between environments or hand them off to another engineer, the mode is right there, no guessing, no accidental mode mismatch.
Build a pre-deployment checklist specific to Flexible orchestration. Before any new scale set goes out, run through: Standard LB only? NAT Rules (not Pools)? User-assigned Managed Identity? No upgrade policy set? No ultra disk config fields? No unmanaged disks? Application Health Extension in the template? Outbound connectivity method chosen? That checklist, run in five minutes, eliminates the vast majority of deployment failures I see.
Use Azure Policy to enforce configuration standards. You can create a custom Azure Policy definition that audits or denies scale set deployments that don't meet your configuration standards, for example, requiring a Standard SKU Load Balancer or blocking Basic LB attachment. This catches human error before it reaches production. The policy JSON can reference Microsoft.Compute/virtualMachineScaleSets resource type.
Enable VM Insights on your scale set instances. VM Insights can be installed directly into individual scale set VMs regardless of orchestration mode. It gives you CPU, memory, disk, and network performance data without requiring separate agent setup. In the Azure portal, go to your scale set → Monitoring → Insights and follow the onboarding wizard. This is especially helpful for diagnosing performance-driven scale events that look like errors but are actually capacity decisions.
Test backup and recovery before you need it. If you're using Flexible orchestration with Azure Backup, run a test restore every quarter. Go to your Recovery Services vault, pick a backup item, and restore it to an alternate location. The first time you discover your backup isn't configured correctly should never be during an actual outage.
- Tag all scale set resources with
orchestrationMode: Flexibleso any team member can instantly identify the deployment type without running CLI commands - Set up Azure Alerts on
Provisioning Failedevents for your scale sets so failures page you immediately, don't find out from a user report - Keep your ARM/Bicep templates in source control and enforce pull request review for any changes to orchestration mode, identity type, or load balancer SKU
- Never assume Uniform-mode documentation applies to Flexible, always verify each feature in the Flexible orchestration feature comparison table before building on it
Frequently Asked Questions
How much scale does Flexible orchestration support, what's the actual VM limit?
Flexible orchestration mode supports up to 1,000 VM instances per scale set. That's the hard cap per scale set, with fault domain availability guaranteed across all 1,000 instances, spread across up to 3 fault domains depending on what your region supports. If you need more than 1,000 VMs, your options are Uniform orchestration (which supports up to 3,000 instances with fault domain guarantees) or deploying multiple Flexible scale sets behind a shared Standard Load Balancer or Application Gateway. There's no native way to get a single Flexible scale set beyond 1,000 instances right now.
How does availability in Flexible orchestration compare to Uniform orchestration and Availability Sets?
All three support fault domain distribution, but the numbers and mechanics differ significantly. Flexible orchestration spreads up to 1,000 instances across up to 3 fault domains and supports deployment across availability zones, but has no update domains; Azure handles maintenance fault-domain-by-fault-domain. Uniform orchestration supports up to 3,000 instances with FD guarantees, up to 5 update domains, and also supports availability zones. Availability Sets max out at 200 instances with up to 20 update domains, but do not support deployment across availability zones at all. For new workloads needing zone redundancy at scale, Flexible orchestration is the right choice.
How much does it cost to use Azure Virtual Machine Scale Sets?
The scale set itself is free, Microsoft charges nothing for the orchestration layer on top of your VMs. What you pay for is exactly what's running inside: the virtual machine compute hours, the storage attached to those VMs, networking bandwidth, and any other Azure services connected to your scale set like load balancers or NAT gateways. This means you can spin up a scale set with zero instances and pay nothing until you actually add VMs. Scale sets are genuinely just a management wrapper around standard Azure VMs from a billing perspective.
How many VMs can I actually put in a scale set, I've seen different numbers in different docs?
The numbers vary based on your image type. If you're deploying from a platform image (Windows Server, Ubuntu, etc. from the Azure Marketplace), a single scale set can hold 0 to 1,000 VMs. If you're using a custom image, one you've built yourself and stored as a Managed Image or in Azure Compute Gallery, the limit drops to 0 to 600 VMs per scale set. The 1,000-VM figure you see in Flexible orchestration documentation specifically refers to platform images. If you're building on a custom image and planning for large scale, factor the 600 cap into your architecture now before you hit it unexpectedly.
Are data disks supported in Azure Virtual Machine Scale Sets?
Yes, data disks are fully supported in scale sets. You define the data disk configuration at the scale set level, and that configuration applies uniformly to every VM instance in the set, same disk size, same SKU, same caching settings across all instances. This is a good fit for workloads where every VM needs identical attached storage. The important caveat is that Flexible orchestration does not support unmanaged disks; you must use Azure Managed Disks. Also, the ultra disk performance tuning fields (diskIOPSReadWrite and diskMBpsReadWrite) are not supported in Flexible orchestration mode, though ultra disk attachment itself may work depending on your VM series and region.
Can I use Azure Backup with Virtual Machine Scale Sets, or do I need a different strategy?
Azure Backup support depends entirely on which orchestration mode you're using. Flexible orchestration scale sets support Azure Backup natively, and Azure Site Recovery works too (via PowerShell, the portal UI has limited support). Availability Sets also support both. Uniform orchestration scale sets, however, support neither Azure Backup nor Azure Site Recovery at the scale set level. If you're running Uniform orchestration and need backup, you're currently limited to OS-level backup agents or application-consistent snapshots using custom scripts, not the managed Azure Backup service. This is one of the more compelling operational reasons to migrate workloads toward Flexible orchestration where possible.