How to Troubleshoot Azure Virtual Machine Scale Sets
Why Azure Virtual Machine Scale Sets Break , And Why the Errors Are So Confusing
I've spent years watching Azure Virtual Machine Scale Sets go sideways at exactly the wrong moment , right before a product launch, during a Black Friday spike, or on a Monday morning when your boss is watching dashboards. And the maddening part is that Azure's error messages are often spectacularly unhelpful. "Provisioning failed." Thanks. Really narrows it down.
Here's the reality. Azure Virtual Machine Scale Sets are genuinely complex orchestration machinery under the hood. When you ask a scale set to spin up five new instances, Azure is simultaneously coordinating with the Compute fabric, the network stack, the storage layer, your load balancer's backend pool, the health extension agent, and, if you're using custom images, the image gallery replication pipeline. Any one of those can fail silently, produce a cryptic allocation error, or just quietly leave instances stuck in a Creating or Failed provisioning state while the portal shows you a spinner for 20 minutes.
The most common root causes I see in the field break down like this:
- Quota exhaustion, Your subscription hit its regional vCPU limit. The error code is
OperationNotAllowedwith message text containing "quota" or the specific quota name likestandardDSv3Family. This is by far the number one cause I see. - Unhealthy health probes, Your load balancer health probe is returning failures, so new instances get drained immediately after provisioning, triggering a cascade of scale-in and scale-out events.
- Extension failures, A VM extension (Custom Script Extension, Azure Monitor agent, Microsoft Antimalware) fails during provisioning, which marks the whole instance as failed. The VM is actually running, the extension just timed out or hit a transient error.
- Autoscale policy misconfiguration, Conflicting scale-in and scale-out rules, missing cooldown periods, or a metric that never fires because the namespace wasn't enabled.
- Upgrade policy conflicts, Manual vs. Automatic vs. Rolling upgrade modes interact with instance refresh operations in ways that leave instances on outdated model versions without obvious warning.
- Spot instance evictions, If you're running VMSS on Azure Spot, evictions can look like mysterious instance disappearances if you haven't wired up the Spot eviction scheduled event handler.
- Image replication lag, Using a Shared Image Gallery (Azure Compute Gallery) image that hasn't fully replicated to your target region yet produces
AllocationFailederrors that look like capacity problems but aren't.
I know this is frustrating, especially when your scale set backs a production workload and the Azure portal just shows you a red X without telling you which of the seven things above actually broke. The good news: almost every VMSS problem leaves evidence in the Activity Log, the instance health model, or Azure Monitor. You just need to know where to look.
The Quick Fix, Check the Activity Log First, Always
Before you touch a single setting, go straight to the Activity Log. This sounds obvious, but most people instinctively click into the individual instances list and start guessing. The Activity Log is where Azure records exactly what operation failed and why, often with the specific error code your team or Microsoft Support will need.
Here's how to pull it up for your specific scale set:
- Go to the Azure Portal (portal.azure.com) and navigate to your Virtual Machine Scale Set resource.
- In the left-hand menu, under Monitoring, click Activity log.
- Set the Timespan dropdown to Last 24 hours (or the window when you first noticed the problem).
- Filter by Status = Failed. This cuts through the noise immediately.
- Click any failed operation to expand it. Then click the JSON tab in the detail pane on the right.
In the JSON, look for the statusMessage field. Ignore the outer wrapper, drill down until you find it nested inside the properties object. The real error is almost always buried two or three levels deep. You'll see something like:
"statusMessage": "{\"status\":\"Failed\",\"error\":{\"code\":\"OperationNotAllowed\",
\"message\":\"Operation results in exceeding quota limits of Core. Maximum allowed:
100, Current in use: 96, Additional requested: 10.\"}}"
That's your answer. Quota. Done. No need to dig further, go to Subscriptions > Usage + quotas and request an increase for the relevant family.
If the Activity Log shows a successful scale operation but instances are still stuck in Creating state for more than 15 minutes, the issue is typically an extension failure, not a provisioning failure. In that case, jump to Step 3 in this guide.
If the Activity Log is clean, no failures at all, and your VMSS simply isn't scaling out when you expect it to, your autoscale rules aren't firing. Skip to Step 4 for the autoscale diagnostic flow.
Once you've reviewed the Activity Log, your next stop is Resource Health, specifically at the individual instance level. This is where Azure Virtual Machine Scale Sets troubleshooting gets granular.
In the Azure Portal, navigate to your scale set, then click Instances in the left menu. You'll see a list of all instances with their Provisioning state and Power state. Any instance showing Failed in the Provisioning state column is your target.
Click the failed instance name. In the instance detail view, go to Support + troubleshooting > Resource health. This gives you a health timeline specific to that VM, separate from the scale set's overall health. Look for events marked Degraded or Unavailable with timestamps matching when your problem started.
Now run this Azure CLI command to get the raw instance view, which includes extension-level failure detail that the portal sometimes hides:
az vmss get-instance-view \
--resource-group "YourResourceGroup" \
--name "YourScaleSetName" \
--instance-id 0
Replace --instance-id 0 with the specific instance number. In the output, look for the extensions array. Each extension has a statuses field. A healthy extension shows "code": "ProvisioningState/succeeded". A broken one shows "code": "ProvisioningState/failed" with a message that tells you exactly which extension choked and why.
Common extension failures I see: Custom Script Extension timing out on a slow download, the Azure Monitor Agent failing because the workspace ID was wrong, or the Microsoft Dependency Agent crashing because the VM SKU doesn't have the right kernel version for the VM insights agent.
If you confirm an extension failure and the underlying VM is actually healthy, you can remediate without reprovisioning the whole instance by reapplying just the extension through the portal under Extensions + applications, or via:
az vmss update-instances \
--resource-group "YourResourceGroup" \
--name "YourScaleSetName" \
--instance-ids 0
When it works, the instance provisioning state changes to Succeeded within 5–10 minutes and the instance moves into the load balancer's healthy pool.
One of the most insidious Azure Virtual Machine Scale Sets problems is when everything looks like it's working, instances provision successfully, the scale set shows a healthy status, but you're seeing constant scale-out and scale-in events. Instances spin up, get marked unhealthy almost immediately, and the scale set frantically tries to compensate. Your costs skyrocket and nothing actually handles traffic.
This is almost always a health probe misconfiguration. Here's how to diagnose it systematically.
First, check what health mode your VMSS is using. Navigate to your scale set > Health and repair in the left menu. You'll see either Application health extension or Load balancer health probe configured. Note which one.
If you're using a load balancer health probe, navigate to your Load Balancer resource, then Health probes. Check the probe settings carefully:
- Protocol: Is it HTTP or HTTPS? If your app only listens on HTTP but the probe is HTTPS, every probe silently fails.
- Port: Is the probe port actually open in your NSG (Network Security Group)? Azure's internal health probe traffic comes from the address
168.63.129.16, if your NSG blocks this, probes fail silently. - Path: For HTTP probes, does the path return a 200 status code? Not a 302 redirect, not a 401, specifically 200. Many apps return 302 on the root path.
To verify NSG rules aren't blocking probe traffic, run:
az network nsg rule list \
--resource-group "YourResourceGroup" \
--nsg-name "YourNSGName" \
--output table
Look for an inbound rule that allows TCP traffic from source AzureLoadBalancer service tag. If that rule is missing or has lower priority than a Deny rule, add it:
az network nsg rule create \
--resource-group "YourResourceGroup" \
--nsg-name "YourNSGName" \
--name "AllowAzureLBHealthProbe" \
--priority 100 \
--source-address-prefixes AzureLoadBalancer \
--destination-port-ranges 80 443 \
--access Allow \
--protocol Tcp \
--direction Inbound
After fixing probe configuration, give it 5 minutes and check the Load Balancer's backend pool health under Backend pools > click your pool > view the health column. When it shows green checkmarks, your instances are genuinely healthy and the scale churn should stop.
Your autoscale rules look fine on paper. The CPU metric should be triggering a scale-out at 70%. But the VMSS just... sits there at two instances while your CPU screams at 90%. What's happening?
Azure autoscale has a dedicated diagnostic log that most people never look at. It's separate from the Activity Log and the Insights metrics. Go to your scale set > Scaling > then click the Run history tab in the autoscale blade. This shows every autoscale evaluation and exactly why it scaled, didn't scale, or got blocked.
Common reasons autoscale silently fails:
- Cooldown period active, After any scale event, autoscale won't fire again until the cooldown expires (default 5 minutes for scale-out). Check if your rules are triggering but getting blocked by an active cooldown.
- Min/Max capacity boundary, If your maximum instance count is set to 2 and you already have 2 instances, autoscale literally cannot scale out. Azure logs this as "already at maximum capacity" in the run history.
- Metric namespace not enabled, If you're scaling on a custom metric or a metric from Azure Monitor, the Diagnostic Settings for that metric must be enabled and flowing to the autoscale metric source. Check under your VMSS > Diagnostic settings.
To validate your autoscale configuration and simulate a scale event, use the Azure CLI autoscale settings command:
az monitor autoscale show \
--resource-group "YourResourceGroup" \
--name "YourScaleSetName-autoscale" \
--output json
Check the profiles array. Each profile has rules, and each rule has metricTrigger and scaleAction. Verify that metricName exactly matches the metric name Azure exposes (spelling and case matter), and that metricResourceUri points to the correct resource ID.
If you want to force a test scale event without waiting for a metric threshold, you can manually override the instance count from the portal: Scaling > Manual scale > set a count > Save. This confirms the scale machinery itself works, and the issue is specifically with metric-triggered autoscale.
When autoscale starts firing correctly, you'll see green entries in the Run history tab with the reason "Scale out initiated" and the specific metric value that crossed the threshold.
This one catches experienced Azure engineers off guard. You updated your scale set's VM image, changed an environment variable, or modified the OS disk size. You saved the changes. But when you look at your instances, the Latest model column in the Instances view shows No for some or all of them. They're running the old configuration. How?
This is by design, and it's the Upgrade Policy biting you. Azure Virtual Machine Scale Sets support three upgrade modes:
- Manual, Instance models never update automatically. You must explicitly trigger an upgrade.
- Automatic, Azure updates instances automatically, restarting them in rolling batches.
- Rolling, Like Automatic, but with fine-grained control over batch size and health gates.
If you're on Manual mode and your instances show "No" in the Latest model column, you need to apply the model update. You can do this for all instances at once:
az vmss update-instances \
--resource-group "YourResourceGroup" \
--name "YourScaleSetName" \
--instance-ids "*"
The * argument selects all instances. Warning: this causes rolling restarts. Do it during a maintenance window, or select specific instance IDs to stagger the updates.
If you're on Automatic or Rolling mode and updates still aren't applying, check the Health and repair configuration. Rolling upgrades pause when instances are unhealthy. If your health probe is returning failures (see Step 2), the rolling upgrade engine sees this as "upgrading broke the instance" and halts. Fix the health probe first, then the upgrade will resume automatically.
To check the current upgrade policy and spot potential conflicts:
az vmss show \
--resource-group "YourResourceGroup" \
--name "YourScaleSetName" \
--query "upgradePolicy" \
--output json
The output shows your current mode and, for Rolling mode, the rollingUpgradePolicy settings including maxBatchInstancePercent and maxUnhealthyInstancePercent. When instances successfully apply the new model, the Latest model column switches to Yes and provisioning state shows Succeeded.
Quota errors are the most fixable VMSS problem and the one people panic about the most. When your scale set tries to provision new instances and hits the regional vCPU limit, you get the error code OperationNotAllowed and Azure simply refuses to allocate any new VMs. Your autoscale rules keep trying, keep failing, and the instances stay at their current count while your workload suffers.
First, find out exactly which quota you've hit. In the portal: Subscriptions > select your subscription > Usage + quotas in the left menu. In the search box, type the VM family name you're using (e.g., Standard DSv3). You'll see your current usage vs. your limit.
If you're at or near 100%, you have two paths forward:
Path 1, Request a quota increase. Click the pencil icon next to the quota entry and submit a quota increase request. Azure typically approves these within minutes for common SKUs in well-utilized regions. For large increases (over 200 vCPUs) or less common SKUs, it might take a few hours and require business justification.
Path 2, Spread across regions or use a different SKU. If you need capacity immediately and can't wait for a quota increase, consider changing your VMSS to use a different VM SKU with available quota, or deploy a second scale set in a paired region.
For ongoing quota management, run this command to get a consolidated view of all compute quotas in a region:
az vm list-usage \
--location "eastus" \
--output table \
--query "[?contains(name.localizedValue, 'Standard DSv3')]"
Replace eastus with your target region and the SKU filter with your VM family.
There's a second type of allocation failure that isn't quota-related: AllocationFailed with message "There are no more resources available for this subscription." This means Azure genuinely doesn't have physical capacity for that SKU in that region or availability zone right now. Solutions: use a different VM size, try a different availability zone in the same region (change the zones property on your scale set), or enable Spot instances as a fallback tier if your workload can tolerate preemption.
When quota is resolved and allocation succeeds, the Activity Log shows a new Create or Update Virtual Machine Scale Set operation with status Succeeded, and your instance count climbs to the requested level within a few minutes.
Advanced Troubleshooting for Azure Virtual Machine Scale Sets
If the steps above haven't cracked it, you're dealing with something deeper. Here's where I go when the standard fixes fail.
Using Azure Monitor and Log Analytics for VMSS Diagnostics
If you have Log Analytics connected to your scale set through Azure Monitor, you can query the health history with Kusto Query Language (KQL). This is the fastest way to understand patterns across dozens of instances. Open Azure Monitor > Logs and run:
AzureActivity
| where ResourceProvider == "MICROSOFT.COMPUTE"
| where Resource contains "YourScaleSetName"
| where ActivityStatus == "Failed"
| project TimeGenerated, OperationName, Properties, Caller
| order by TimeGenerated desc
| take 50
This gives you the last 50 failed operations on your scale set, with the full properties payload and who triggered each operation. Extremely useful for tracking down intermittent failures that don't show up when you check the portal manually.
For health extension data specifically, if you're using the Application Health Extension:
VMHealth_CL
| where ResourceId contains "YourScaleSetName"
| summarize arg_max(TimeGenerated, *) by instanceId_s
| project instanceId_s, healthState_s, TimeGenerated
Group Policy and Enterprise Domain-Joined VMSS
Domain-joined VMSS instances in enterprise environments have a specific failure mode I see regularly: the instances provision fine, but after joining the domain, Group Policy pushes settings that conflict with the workload, disabling services, locking down network connections, or applying firewall rules that block the health probe port. The VM itself reports healthy to Azure, but the application layer is broken.
To diagnose this, RDP into a healthy instance and run:
gpresult /H C:\gpresult.html
Open the generated HTML file and look under Computer Configuration for any policies related to Windows Firewall, service startup type, or network settings. If you find a conflicting policy, work with your Active Directory team to create a WMI filter on the GPO that excludes your VMSS computer accounts, never fight Group Policy by trying to undo its changes at the OS level; the policy will just reapply.
Spot Instance Eviction Troubleshooting
Azure Spot instances in a VMSS get a 30-second eviction notice via Scheduled Events. If you're not handling this event in your application, the VM just disappears mid-request and your users see connection resets. Implement the Scheduled Events endpoint in your application by polling:
http://169.254.169.254/metadata/scheduledevents?api-version=2020-07-01
Set this up as a background thread that polls every 10 seconds. When you see an event of type Preempt, trigger your graceful shutdown sequence, drain connections, flush queues, update your service registry, before the 30-second window expires.
Custom Image Replication Lag
If you're using Azure Compute Gallery (formerly Shared Image Gallery) and getting AllocationFailed errors that you're certain aren't quota-related, check whether your image version has finished replicating to the target region. Go to Azure Compute Gallery > your gallery > your image definition > your image version. Under the Replication tab, each region shows a replication status. If it shows Replicating or Failed, your VMSS can't use that image version yet.
Fix a failed replication by editing the image version and re-adding the target region. Azure will retry the replication job. For ongoing issues, consider increasing the Storage account type from Standard HDD to Standard SSD, HDD replication is significantly slower and times out more often on large images.
Escalate to Microsoft Support if you're seeing InternalExecutionError or InternalOperationError in your Activity Log, these indicate Azure infrastructure-side failures that you cannot fix from your end. Also escalate if your scale set is stuck in a partial delete state (instances are gone but the resource itself won't delete), if you're experiencing consistent allocation failures for a SKU that shows available quota, or if Resource Health shows platform-level degradation affecting your region. When you open a support case, include your subscription ID, the full Activity Log JSON for the failed operations, and the output of az vmss show for your scale set, this dramatically reduces the back-and-forth with the support engineer.
Prevention & Best Practices for Azure Virtual Machine Scale Sets
After you've fixed the immediate problem, the goal is to never be in this position again. Here's what I put in place for every production VMSS deployment I touch.
Set up Azure Monitor alerts on your scale set before you go live. The two most important alerts: one for failed provisioning operations (Activity Log alert, condition: "Create or Update Virtual Machine Scale Set" with status Failed), and one for instance count dropping below your minimum healthy threshold (Metric alert on "VMs Running" metric going below your floor). Both alerts should notify your on-call channel, not just email, email is too slow for production incidents.
Test your health probe endpoint independently. Before deploying a new VMSS or changing your app, manually curl your health probe URL from inside the VNet to confirm it returns HTTP 200. Don't assume, verify. Use Azure Bastion or a jump box in the same subnet as your scale set instances.
Use instance protection for long-running workloads. If you have instances that are processing jobs that take longer than your scale-in cooldown period, enable Instance Protection on those instances. This prevents autoscale from terminating them mid-job. Set it programmatically:
az vmss update \
--resource-group "YourResourceGroup" \
--name "YourScaleSetName" \
--instance-id 3 \
--protect-from-scale-in true
Monitor your quota headroom proactively. Don't wait for a production scale-out failure to discover you're at 95% of your vCPU quota. Set up a quota alert in Azure Advisor or use a simple Azure Automation runbook that checks quota usage weekly and sends a report. Request quota increases when you hit 70%, the lead time for large increases can be hours, and you don't want to need capacity urgently.
Test your scale-out path under load before it matters. Once a month, manually trigger a scale-out to your maximum instance count, watch all instances become healthy, then scale back in. This validates your entire pipeline, image availability, extension provisioning, health probe configuration, load balancer registration, while you have time to fix problems rather than during a live incident.
- Enable Terminate notification (30-minute warning before scale-in) under your VMSS > Instance termination notification, gives workloads time to gracefully drain.
- Set Scale-in policy to OldestVM so freshly warmed instances stay up and stale ones are removed first, reduces churn in rolling deployments.
- Store your VMSS ARM template or Bicep definition in source control and deploy via CI/CD pipeline, manual portal changes are how config drift happens and how you end up with mystery settings nobody remembers adding.
- Enable Automatic OS image upgrades for production workloads on well-known base images, let Azure handle security patch rollouts rather than maintaining your own patching pipeline for scale set VMs.
Frequently Asked Questions
Why are my VMSS instances stuck in "Creating" state for over 20 minutes?
Instances stuck in Creating almost always indicate an extension failure, not a provisioning failure. The VM itself is ready, but one of the extensions you've configured (Custom Script Extension, Azure Monitor Agent, etc.) is hanging or failing. Run az vmss get-instance-view --resource-group YourRG --name YourVMSS --instance-id 0 and look at the extensions array in the output. Each extension has a statuses field, look for ProvisioningState/failed or ProvisioningState/timedOut. Fix the extension configuration first, then run az vmss update-instances to reapply the model. The instance should move to Succeeded within 5–10 minutes after the extension completes successfully.
My autoscale keeps scaling out and then immediately scaling back in, how do I stop it?
This "flapping" behavior is almost always caused by one of two things: your scale-out metric threshold and scale-in metric threshold are too close together, or your health probe is marking new instances as unhealthy immediately after they join the pool, triggering a scale-in. Start by opening Scaling > Run history to see which metric is triggering each event. If it's metric-driven flapping, increase the gap between your scale-out threshold (e.g., CPU > 70%) and scale-in threshold (e.g., CPU < 30%), and increase the cooldown period to at least 10 minutes. If it's health-probe-driven, follow Step 2 in this guide to validate your probe configuration and NSG rules.
How do I upgrade all instances to the latest model without downtime?
The safest approach for zero-downtime upgrades is to switch your upgrade policy to Rolling mode with health gates. Configure maxBatchInstancePercent to 20% and maxUnhealthyInstancePercent to 20%, this upgrades 20% of instances at a time and pauses if more than 20% become unhealthy. Set your VMSS to Rolling mode via Upgrade policy in the portal, then enable the Application Health extension so Azure can gate upgrades on actual app health rather than just VM health. Once configured, updating the scale set model (changing the image version, for example) will automatically trigger a rolling upgrade across all instances while maintaining capacity.
Can I add instances to an existing VMSS without changing the instance count, like adding a specific VM?
Azure Virtual Machine Scale Sets don't support adding individually managed VMs, that's fundamentally different from how scale sets work. All instances in a VMSS are created from the same model (image, SKU, configuration) and are interchangeable. If you need to run a specialized VM alongside your scale set, the right pattern is to deploy it as a standalone VM in the same subnet and register it manually with the same load balancer backend pool. Alternatively, if you need instance-level customization, look at Flexible orchestration mode VMSS, which allows attaching pre-existing VMs as members of the scale group.
Why does my VMSS keep getting error code "BadRequest" when I try to scale out?
A BadRequest error during scale-out usually means the scale set configuration has become invalid, for example, referencing a subnet that no longer exists, an image version that was deleted, or a key vault secret that expired. Pull the full error JSON from the Activity Log (as described in The Quick Fix section above) and look for the message field which tells you specifically what parameter Azure rejected. The most common culprits are subnet IDs that changed after a VNet migration, Shared Image Gallery image versions that were cleaned up by a retention policy, or user-assigned managed identity IDs that were deleted. Fix the broken reference in your scale set configuration by editing the relevant property, save the model, and retry the scale operation.
How do I stop Azure from auto-repairing my VMSS instances when they fail health checks?
The automatic repairs feature is configured under Health and repair in your scale set settings. To disable it, go to your VMSS in the portal > Health and repair > toggle Enable automatic repairs to Off and save. Via CLI: az vmss update --resource-group YourRG --name YourVMSS --enable-automatic-repairs false. Be aware that disabling automatic repairs means failed instances will stay failed and won't be replaced automatically, you'll need to handle instance replacement manually or through your own health management logic. I'd recommend disabling it only temporarily during debugging, then re-enabling it once you've fixed the underlying health probe issue. Automatic repair is a safety net you generally want active in production.