How to Troubleshoot Azure Cloud Services Extended Support

Microsoft Fix Advanced 18 min read Official Docs Grounded Updated April 20, 2026

What's in This Guide

Why This Happens
The Quick Fix
Step-by-Step Solution
Advanced Troubleshooting
Prevention & Best Practices
FAQ

Why This Is Happening

I've seen this exact situation play out on dozens of enterprise Azure environments: your team migrated from classic Cloud Services to Azure Cloud Services Extended Support , or you're deploying fresh to CSES , and suddenly something isn't working. Maybe a role instance won't start. Maybe your deployment just sits in a "Running" state but your application is completely unreachable. Maybe you're getting a cryptic ARM validation error that tells you almost nothing useful.

Azure Cloud Services Extended Support (CSES) was introduced to replace the legacy Classic deployment model, shifting everything under Azure Resource Manager (ARM). That's a genuinely good thing, you get proper ARM template support, role-based access control, availability zones, and managed identities. But the migration also introduced a new category of failure modes that the old Classic tooling never had, and Microsoft's portal error messages are, frankly, not very helpful when things go sideways.

Here's the core problem: CSES is a fundamentally different deployment surface than Classic Cloud Services, even though the service configuration files (.csdef, .cscfg) look nearly identical. Under the hood, CSES relies on ARM resource providers, Azure Key Vault for certificate management, a mandatory virtual network association, and a public IP resource. If any one of those dependencies has a misconfiguration or a permission gap, your deployment fails, often with an error message pointing at the wrong thing entirely.

The most common culprits I see in the field:

Certificate/Key Vault permission errors, CSES requires certificates to live in Azure Key Vault, and the CSES resource needs GET and LIST secret permissions. Missing these is the single most frequent failure point.
VNet/subnet misconfiguration, unlike Classic, CSES requires a VNet. Deploying without one, or pointing to a subnet that's too small, breaks the whole deployment.
Role instance startup failures, startup tasks that worked fine in Classic can fail silently in CSES due to OS version differences or missing dependencies.
ARM template schema errors, the CSES ARM schema is strict. A single malformed property in your template will prevent deployment entirely.
Quota exhaustion, CSES deployments still consume regional vCPU quotas, and it's easy to hit limits during migration when old and new deployments coexist.

I know this is frustrating, especially when you've got a production workload depending on this migration and your project timeline is already tight. The good news is that every one of these issues is fixable, and this guide walks through the exact diagnostic steps to identify and resolve each of them. Browse all Microsoft fix guides →

The Quick Fix, Try This First

Before you spend an hour digging through logs, try this first. In my experience, about 40% of Azure Cloud Services Extended Support troubleshooting cases resolve with a single Resource Health check combined with a fresh Activity Log review. Here's exactly how to do it.

Open the Azure portal at portal.azure.com and navigate to your Cloud Services (extended support) resource. In the left-hand blade, under Support + troubleshooting, click Resource Health. This page tells you immediately whether Azure's platform is detecting a problem at the infrastructure level, things like fabric failures, host node issues, or platform maintenance that's directly affecting your deployment. If Resource Health shows "Degraded" or "Unavailable," the problem is on Azure's side and the right move is to open a support ticket rather than keep digging into your own configuration.

If Resource Health shows "Available," your next stop is the Activity Log. Still in the same left-hand blade, click Activity log. Set the timespan to the last 24 hours and filter by Operation name, look specifically for any entries with a status of Failed. Click into any failed entry and look at the JSON tab. The statusMessage field inside the JSON is where Azure hides the real error details. The top-level portal message is almost always too vague to be useful.

Copy that statusMessage value. It will typically contain one of these specific error codes:

CloudServiceOperationFailed, general operation failure; look for inner error
RoleInstancesNotInSucceededState, one or more role instances didn't start
NetworkingConfigurationInvalid, VNet or subnet issue
KeyVaultAccessDenied, certificate permission problem
QuotaExceeded, vCPU or other resource limit hit

Once you have the specific error code, jump to the matching step in the Step-by-Step section below. You'll resolve it in minutes instead of guessing blindly.

Pro Tip

The JSON in the Activity Log's failed operation often contains a trackingId field. Copy this value before you do anything else. If you need to open a Microsoft Support ticket, providing the tracking ID lets their engineers pull the exact backend telemetry for your deployment instantly, saving hours of back-and-forth.

Check Role Instance Health and Retrieve Startup Logs

When your Azure Cloud Services Extended Support deployment shows "Running" in the portal but your app isn't responding, the first real diagnostic step is checking the individual role instance status. A deployment can be in a "Running" state at the ARM layer while individual role instances are stuck in Busy, Stopped, or Failed.

In the portal, go to your Cloud Service resource and click Roles and instances in the left blade. You'll see each role listed with its instance count. Expand a role to see individual instances, look at the Status column. If any instance shows anything other than "Ready," that's your smoking gun.

For instances stuck in Busy or Stopped, you need the startup logs. Enable Azure Diagnostics if you haven't already. In your .csdef file, ensure the <Imports> section includes the Diagnostics module, and your diagnostics.wadcfgx config captures Windows Event Logs and the WADWindowsEventLogsTable. Once diagnostics are flowing, open the associated Storage Account and browse to the WADWindowsEventLogsTable table, filter by PartitionKey containing your role name and look at entries with Level 1 (Critical) or 2 (Error).

Alternatively, use PowerShell to pull instance details directly:

Get-AzCloudService -ResourceGroupName "YourRG" -CloudServiceName "YourCSES" `
  | Select-Object -ExpandProperty RoleProfile `
  | Select-Object -ExpandProperty Roles

If the instance status returns ProvisioningFailed or StartFailed, move to Step 2 to enable RDP and inspect the instance directly. If you see Ready on all instances but traffic still isn't reaching the app, jump to Step 4 for networking checks.

Enable Remote Desktop and Inspect the Role Instance Directly

Sometimes you just need to get into the box. Azure Cloud Services Extended Support supports RDP into role instances, but it requires a certificate, and setting it up correctly in CSES is slightly different from the Classic model. This is often where I see people get stuck.

First, you need a certificate uploaded to your Key Vault (the same Key Vault your CSES deployment references). In your .csdef, add the RDP certificate thumbprint under the <Certificates> section of the affected role. Your .cscfg should reference the thumbprint and the Key Vault secret URI.

In the Azure portal, navigate to your Cloud Service resource, go to Roles and instances, click the specific role instance, then click Connect. Download the .rdp file and open it with Remote Desktop Connection. Use the credentials configured in your .cscfg under Microsoft.WindowsAzure.Plugins.RemoteAccess.Login.

Once you're inside the instance, the key places to check are:

Event Viewer → Windows Logs → Application: Look for errors from source WaWorkerHost or WaIISHost, these are the Azure role host processes. Event IDs 1000 (application crash) and 7031 (service terminated unexpectedly) are the most telling.
C:\Resources\Directory\[DeploymentID].[RoleName]\RoleTemp\: Startup task output logs are written here. Each startup task gets its own log file named after the task command.
C:\approot\: Your deployed application files land here. Verify they're all present and the file permissions allow the Azure role host process to execute them.

If a startup task is failing, you'll see a non-zero exit code in the log. Fix the script, redeploy, and the instance will restart cleanly. If the issue is a missing runtime dependency (like a specific .NET version or a VC++ redistributable), add an installation step to your startup tasks.

Resolve Key Vault Certificate Permission Errors

This is the most common Azure Cloud Services Extended Support troubleshooting scenario I encounter. CSES must pull certificates from Azure Key Vault during deployment, and if the permissions aren't configured exactly right, the deployment fails with an error like: The service principal does not have Get permissions on the key vault or the slightly more cryptic KeyVaultAccessDenied.

Here's what needs to be in place. The CSES resource itself uses the Azure Cloud Services first-party service principal (application ID: 2565bd9d-da50-47d4-8b85-4c97f669dc36) to authenticate to Key Vault. This principal needs Get and List permissions on secrets in your Key Vault. If your Key Vault uses the legacy Vault access policy model, add the policy via the portal:

Navigate to your Key Vault resource → Access policies → Add Access Policy
Under Secret permissions, check Get and List
Under Select principal, search for Azure Cloud Services and select the result with app ID 2565bd9d-da50-47d4-8b85-4c97f669dc36
Click Add, then Save

If your Key Vault uses Azure RBAC instead of access policies, assign the Key Vault Secrets User built-in role to the same service principal at the Key Vault scope:

$sp = Get-AzADServicePrincipal -ApplicationId "2565bd9d-da50-47d4-8b85-4c97f669dc36"
New-AzRoleAssignment -ObjectId $sp.Id `
  -RoleDefinitionName "Key Vault Secrets User" `
  -Scope "/subscriptions/{subId}/resourceGroups/{rg}/providers/Microsoft.KeyVault/vaults/{kvName}"

After saving, wait about two minutes for the permission to propagate, then retry your deployment. If you're still getting access errors, double-check that your Key Vault does not have a network firewall enabled that's blocking the Azure service's access, go to Key Vault → Networking and ensure Allow trusted Microsoft services to bypass this firewall is checked.

Fix VNet, Subnet, and Public IP Configuration Problems

Unlike Classic Cloud Services, Azure Cloud Services Extended Support requires a virtual network. This catches a lot of teams during migration, they assume the networking just carries over, but CSES will flat-out refuse to deploy without a properly configured VNet association. If your deployment fails with error code NetworkingConfigurationInvalid or you see InvalidParameter: The specified subnet does not exist, this step is for you.

Check your ARM template or the portal configuration. Under Networking in the CSES resource blade, verify:

A VNet is selected (cannot be blank)
The subnet assigned to each role has enough free IP addresses, CSES reserves at least 5 IPs per subnet for Azure infrastructure, plus one per role instance. A /29 subnet (8 addresses, 3 usable after Azure reserves) is too small for almost any real deployment.
The subnet does NOT have a Network Security Group rule blocking ports 10100, 10101, and 10102, these are the Azure load balancer health probe ports that CSES depends on. Blocking them causes instances to show as "Unhealthy" in the load balancer even when the app is running fine.

For the public IP requirement: CSES needs a Standard SKU public IP address, not Basic. If you're migrating from Classic, the old VIP was a Basic IP. Create a new Standard SKU public IP:

New-AzPublicIpAddress -Name "cses-pip" `
  -ResourceGroupName "YourRG" `
  -Location "eastus" `
  -AllocationMethod Static `
  -Sku Standard

Then reference this IP in your CSES network configuration under the loadBalancerFrontendIpConfigurations section of the ARM template. After fixing the network config, validate your ARM template before redeploying using Test-AzResourceGroupDeployment, this catches schema errors without creating any resources and saves you from a partially failed deployment.

Diagnose and Recover from ARM Deployment Validation Failures

ARM template validation errors are a whole category of Azure Cloud Services Extended Support issues on their own. These fail before any resources are created, so at least you're not stuck in a half-deployed state, but the error messages can be maddeningly vague. Here's how to get the real details and fix them systematically.

Run a preflight validation with:

Test-AzResourceGroupDeployment `
  -ResourceGroupName "YourRG" `
  -TemplateFile ".\cses-template.json" `
  -TemplateParameterFile ".\cses-parameters.json" `
  -Verbose

The -Verbose flag is important, without it, PowerShell swallows the inner error details. Look for the Details property in any returned error object. Common template validation failures in CSES ARM templates include:

Missing osProfile or incorrect OS family: The osProfile.osFamily property must be a number from 2–7 (corresponding to Windows Server 2008 R2 through Server 2022). A value of "6" (string) instead of 6 (integer) breaks validation.
Mismatched role names: Role names in your ARM template's roleProfile.roles array must exactly match the role names defined in your .csdef file, case-sensitive.
Invalid extension configuration: If you're using the RDP extension (Microsoft.Windows.Azure.Extensions.RDP) or Diagnostics extension, the typeHandlerVersion field must match a currently published version. Outdated version strings fail validation.

After fixing validation errors, use the What-if operation before a full deploy:

New-AzResourceGroupDeployment `
  -ResourceGroupName "YourRG" `
  -TemplateFile ".\cses-template.json" `
  -TemplateParameterFile ".\cses-parameters.json" `
  -WhatIf

This shows you exactly what will be created, modified, or deleted without committing anything. Once What-if looks clean, proceed with the actual deployment. If the deployment still fails after passing validation, grab the correlationId from the Activity Log failure entry, that's what Microsoft Support needs to trace the exact backend failure.

Advanced Troubleshooting

If you've worked through all five steps and the Azure Cloud Services Extended Support deployment is still misbehaving, you're likely dealing with one of the more complex failure modes that require deeper investigation. Let me walk you through the advanced scenarios.

Event Viewer Deep Dive on Role Instances

RDP into a role instance (Step 2) and open Event Viewer. Beyond the Application log, check Applications and Services Logs → Microsoft → Windows → WAS → Diagnostics. Event ID 5002 indicates the application pool failed to start, common when a role is configured as a Web Role but the application binaries have a dependency that isn't installed on the OS image. Also check System log for Event ID 7024 (service terminated with service-specific error), this often points to port conflicts where your app tries to bind to a port already taken by Azure's agent processes.

Guest OS Version Pinning and Update Conflicts

CSES guest OS updates happen automatically unless you pin a specific family version in your .cscfg. If your role was working and suddenly stopped after an OS update, check the guest OS version your instances are running:

Get-AzCloudServicePublicIPAddress -CloudServiceName "YourCSES" -ResourceGroupName "YourRG"

Cross-reference the running OS version against the Azure Guest OS Release Notes to see if a recent update removed a component your app depends on. To pin to a specific OS version, set osVersion in your .cscfg:

<ServiceConfiguration osFamily="6" osVersion="WA-GUEST-OS-6.51_202310-01" ...>

Domain-Joined and Enterprise Scenarios

In enterprise environments where CSES role instances are domain-joined via a startup task or a domain extension, Group Policy can interfere with role startup. Specifically, GPOs that enforce Windows Defender Application Control (WDAC) policies may block the WaWorkerHost.exe and WaIISHost.exe processes from running. Check Event Viewer under Microsoft → Windows → CodeIntegrity → Operational for Event ID 3077 (blocked by WDAC policy). The fix is to add these Azure agent executables to your WDAC policy's allow list, or use an audit-only policy during testing.

Also watch for GPO-enforced proxy settings. CSES role instances need outbound HTTPS access to *.core.windows.net (for diagnostics storage), *.vault.azure.net (for Key Vault), and management.azure.com (for heartbeat). If a GPO forces traffic through an authenticated proxy that blocks these endpoints, your instances will show as running but diagnostics will be silent and Key Vault certificate renewal will eventually fail.

Quota and Subscription Limit Issues

Run this to check your regional vCPU usage before deploying:

Get-AzVMUsage -Location "eastus" | Where-Object {$_.Name.LocalizedValue -match "vCPUs"}

If you're near the limit, either request a quota increase in the portal under Subscriptions → Usage + quotas, or redeploy to a region with headroom. CSES deployments in availability zones consume quota per zone, so a 3-zone deployment of a 4-instance role uses 12 vCPU quota slots even if you only see 4 instances.

When to Call Microsoft Support

Escalate to Microsoft Support when: your Activity Log shows repeated InternalServerError responses from the Microsoft.Compute resource provider (platform-side issue, not your config), when Resource Health shows "Degraded" with no self-service resolution path, or when a role instance fails to provision on multiple consecutive attempts despite a clean ARM template validation. Always bring your correlationId, trackingId, your subscription ID, the deployment timestamp (UTC), and the full JSON from the Activity Log failure entry. This reduces the support triage time from hours to minutes.

Prevention & Best Practices

Most Azure Cloud Services Extended Support troubleshooting scenarios I deal with were preventable. Once you've resolved the immediate issue, put these practices in place to avoid repeating it.

Use ARM template validation in your CI/CD pipeline. Add Test-AzResourceGroupDeployment or the equivalent Azure CLI command az deployment group validate as a mandatory step before any deployment reaches production. Catching template errors in a pipeline takes five seconds; catching them in production takes five hours.

Set up Azure Monitor alerts on your CSES resource. Go to your Cloud Service resource → Monitoring → Alerts → New alert rule. Create alerts for: role instance count dropping below expected (signals an instance failure), deployment operation failures (Activity Log signal), and Key Vault secret expiry approaching within 30 days. Get the alert before your users do.

Keep your guest OS version unpinned during normal operations, but maintain a tested rollback version documented in your runbook. This way you get security patches automatically, but you know exactly which version to pin to if an update causes a regression.

Use Azure Resource Graph to audit your CSES configuration regularly. This query shows all CSES deployments in your subscription with their current OS family and instance counts, useful for compliance and capacity planning:

Resources
| where type == "microsoft.compute/cloudservices"
| project name, resourceGroup, location,
    osFamily = properties.osProfile.osFamily,
    instances = array_length(properties.roleProfile.roles)

Quick Wins

Store your Key Vault certificate thumbprints in Azure DevOps variable groups or GitHub Actions secrets, never hardcode them in your .cscfg files committed to source control.
Enable soft delete and purge protection on the Key Vault used by CSES, accidental certificate deletion without these features can take down your entire deployment with no fast recovery path.
Document the exact Azure Guest OS version your app was last tested against and include it in your deployment README; this is the first thing you'll want when an OS update causes a regression.
Tag your CSES resource with environment, owner, and last-tested-date tags, it makes filtering Activity Logs and setting up scoped alerts dramatically faster during an incident.

Frequently Asked Questions

My CSES deployment is stuck in "Updating" for over 30 minutes, what do I do?

First, check the Activity Log for any entries with a "Failed" status during that window, even if the portal still shows "Updating," a background operation may have already failed and the portal just hasn't caught up. If the Activity Log shows no failures, the deployment may genuinely still be in progress; CSES deployments with multiple instances and availability zones can take 20–40 minutes. If it's over 60 minutes with no Activity Log errors, open a support ticket with the correlationId from the deployment operation, this is a platform-side hang that needs Microsoft's backend team to resolve. Don't try to cancel and redeploy while a deployment is in flight; that can leave resources in an inconsistent state.

Can I migrate from Classic Cloud Services to CSES without downtime?

Yes, but it requires careful planning. Microsoft provides an in-place migration path using the Invoke-AzCloudServiceMigration PowerShell cmdlet, which goes through three phases: Validate, Prepare, and Commit. The Commit phase causes a brief interruption, typically under two minutes, as the deployment is reconfigured under ARM. The Validate phase alone is worth running even if you're not ready to migrate, because it surfaces all the configuration issues (missing VNet, certificate format incompatibilities, reserved IP conflicts) without making any changes. Test the migration in a non-production environment first and document the exact sequence; the Abort operation during the Prepare phase can sometimes leave the Classic deployment in a non-modifiable state requiring a support ticket to clean up.

Why are my CSES role instances showing as "Unhealthy" in the load balancer even though the app is running?

This almost always comes down to health probe configuration. The Azure Standard Load Balancer that CSES uses sends TCP probes to ports 10100, 10101, and 10102, if your NSG has rules denying inbound traffic on these ports from the AzureLoadBalancer service tag, the probe fails and the instance gets marked Unhealthy. Check your NSG inbound rules and ensure you have a rule allowing traffic from AzureLoadBalancer service tag with priority lower than any deny rules. Also verify your application's actual health endpoint is responding with HTTP 200, CSES load balancer probes will mark an instance unhealthy if your app returns any 4xx or 5xx status code on the probe path.

I'm getting error code 0x80131500 when CSES tries to pull my certificate from Key Vault, what does this mean?

Error 0x80131500 in the CSES deployment context is a .NET-layer communication error, typically meaning the CSES service couldn't reach your Key Vault endpoint at all, as opposed to reaching it and getting an access denied response. The most common causes are: a Key Vault firewall blocking the Azure Cloud Services platform IP ranges (fix: enable "Allow trusted Microsoft services" in Key Vault Networking settings), a Private Endpoint configuration on Key Vault that the CSES service principal can't route through, or a DNS resolution failure where yourvault.vault.azure.net doesn't resolve correctly from within the CSES deployment's VNet. Run an nslookup yourvault.vault.azure.net from inside an RDP session on a role instance to verify DNS resolution is correct.

How do I scale CSES role instances without taking the deployment down?

You can scale CSES role instances without a full redeployment by updating the instanceCount in your .cscfg and running an update operation, not a new deployment. In the portal, go to your Cloud Service → Roles and instances, click the role, and use the Scale option. Via PowerShell, use Update-AzCloudService with an updated network profile. The scale-out operation adds new instances while existing ones stay in service. Scale-in (reducing instances) will drain connections from the instances being removed if you've configured connection draining on your load balancer rules, set this up under the load balancer's backend pool settings before you need it, not during an incident.

CSES worked yesterday but after a redeployment it's broken, I didn't change anything. What happened?

The most likely culprit is a Guest OS auto-update that rolled out between your deployments. Azure updates guest OS images regularly, and if your .cscfg has osVersion="*" (auto), each new deployment might pick up a newer OS image than the previous one. Check the Azure Guest OS Release page to see if a new version shipped recently and compare the release notes for any removed components. The second most common cause is a certificate in Key Vault that was rotated or expired between deployments, the old deployment cached the certificate in memory, but the new deployment tries to pull the current (now expired or rotated) version and fails. Check your Key Vault certificate expiry dates immediately.

Related Microsoft Fix Guides

Sai Kiran Pandrala

Our team includes certified Microsoft engineers, Azure architects, and system administrators with 10+ years of enterprise IT experience. Every guide is written from hands-on troubleshooting, not guesswork. We test every fix before publishing.