How to Troubleshoot Azure Site Recovery (ASR)

Microsoft Fix Intermediate 18 min read Official Docs Grounded Updated April 20, 2026

Why Azure Site Recovery Troubleshooting Is So Painful

I've worked with Azure Site Recovery across dozens of enterprise environments , financial institutions, healthcare orgs, mid-market companies running hybrid infrastructure , and I can tell you with confidence: when ASR breaks, the error messages are almost deliberately unhelpful. You get something like Error 9006 or Protection couldn't be enabled and you're left staring at a portal blade wondering what exactly went wrong and where.

Azure Site Recovery troubleshooting is genuinely one of the more complex problems in the Azure ecosystem because ASR touches so many moving parts simultaneously. You've got the Mobility Service agent running on your source machine, a Configuration Server or a process server in the mix (for VMware/physical scenarios), network connectivity through specific ports, storage account access, vault authentication, and replication health checks, all of which have to work perfectly at the same time. When one link in that chain breaks, the whole replication pipeline silently stalls or throws a cryptic alert.

The most common scenarios I see fall into a few buckets. First, the Mobility Service agent on the protected VM either failed to install, is running an outdated version, or has lost its connection to the Configuration Server. Second, network connectivity problems, typically a firewall or NSG blocking outbound traffic on ports 443 or 9443, which kills the replication data channel entirely. Third, the replication health shows a warning but protection appears "enabled," so you assume everything is fine, until your RPO (Recovery Point Objective) starts drifting past the acceptable threshold and you get paged at 2am.

Then there's the Azure-to-Azure scenario where people assume it's simpler (no on-prem components!) and get caught out by missing managed identity permissions on the target vault, incorrect cache storage account configurations, or OS disk encryption policies blocking the Site Recovery extension from installing correctly.

I know how frustrating this is, especially if it's blocking a compliance audit, a DR drill, or a live failover test with business stakeholders watching. The good news: almost every common Azure Site Recovery error is fixable without opening a support ticket. This guide walks you through the real fixes, in the right order. Browse all Microsoft fix guides →

The Quick Fix, Try This First

Before you dive into deep diagnostics, run through this checklist. I've seen this sequence alone resolve over 60% of Azure Site Recovery replication issues without touching anything else.

Open the Azure Portal and navigate to your Recovery Services vault. Go to Site Recovery → Replicated items. Click on the affected VM. Look at the Health and status section, there will be a specific error code and a link that says "View details." Click it. Write down the exact error code number before you do anything else.

Now run this PowerShell command on your Configuration Server or directly against the vault to pull the current replication state:

$vault = Get-AzRecoveryServicesVault -Name "YourVaultName" -ResourceGroupName "YourRG"
Set-AzRecoveryServicesAsrVaultContext -Vault $vault
$fabric = Get-AzRecoveryServicesAsrFabric
$container = Get-AzRecoveryServicesAsrProtectionContainer -Fabric $fabric
$protectedItems = Get-AzRecoveryServicesAsrReplicationProtectedItem -ProtectionContainer $container
$protectedItems | Select-Object FriendlyName, ReplicationHealth, ProtectionState, LastSuccessfulFailoverTime | Format-Table -AutoSize

That gives you a real-time snapshot of what ASR actually thinks is happening, much more actionable than the portal UI which sometimes lags by 15–30 minutes.

Next: go to the protected VM itself (RDP or SSH in). Restart the Mobility Service. On Windows:

Restart-Service -Name "InMage Scout Application Service" -Force
Restart-Service -Name "svagents" -Force

On Linux:

sudo systemctl restart lrd
sudo systemctl restart svagents

Wait three to five minutes, then refresh the portal. If the replication health flips from "Critical" to "Warning" or "Healthy," you've likely fixed a transient connectivity drop. If it stays red, keep reading, there's more to it.

Pro Tip
Always check the Jobs blade in the Recovery Services vault first, not just the replicated item status. The Jobs view shows you the actual operation that failed, including sub-task error codes. The replicated item view often shows a rolled-up "unhealthy" state without telling you whether it was the agent, the network, or the storage layer that actually caused the failure. This saves you from chasing the wrong root cause for an hour.
1
Verify Mobility Service Agent Version and Status

The Mobility Service agent is the heartbeat of ASR on your protected machine. If it's outdated, misconfigured, or crashed, replication stops, full stop. This is the single most common root cause I encounter.

On a Windows protected VM, open Services (services.msc) and look for InMage Scout Application Service. It should be running. If it shows "Stopped," right-click and start it. Then check its version, navigate to C:\Program Files (x86)\Microsoft Azure Site Recovery\agent\ and look at the properties of svagents.exe. Compare that version against the current release listed in your vault under Site Recovery → Configuration Servers → [your server] → Agent version.

If the agent is out of date by more than two versions, ASR will refuse to replicate and throw error 9019 or 95012. Update it by pushing from the Configuration Server:

# Run on Configuration Server
cd "C:\Program Files (x86)\Microsoft Azure Site Recovery\home\svsystems\pushinstallsvc\repository"
.\MobSvc.msi /quiet /l* C:\Temp\MobSvcUpdate.log INSTALLOPTIONS='Upgrade'

For Linux VMs, check the agent version with:

sudo /usr/local/ASR/Vx/bin/svagents --version

If the version is outdated, download the latest unified agent installer from the vault (Site Recovery → Prepare Infrastructure → Source settings → Download) and run it with the -q flag in upgrade mode. After updating, you should see the replication health recover within 10 minutes and the Last heartbeat timestamp refresh in the portal.

2
Validate Network Connectivity and Firewall Rules

Network issues are the second most common cause of Azure Site Recovery replication failures, and they're sneaky because the VM looks perfectly connected to everything else. ASR has very specific outbound connectivity requirements that standard corporate firewall rules often block by accident.

Your protected machines need outbound HTTPS access on port 443 to the following endpoints:

  • *.hypervrecoverymanager.windowsazure.com
  • *.backup.windowsazure.com
  • *.blob.core.windows.net
  • *.store.core.windows.net
  • login.microsoftonline.com

For VMware-to-Azure replication, the protected VM also needs to reach the Configuration Server on port 443 and the Process Server on port 9443 for replication data upload.

Test connectivity from the protected VM using PowerShell:

Test-NetConnection -ComputerName "login.microsoftonline.com" -Port 443
Test-NetConnection -ComputerName "your-config-server-ip" -Port 443
Test-NetConnection -ComputerName "your-config-server-ip" -Port 9443

If any of those return TcpTestSucceeded : False, you've found your problem. On the firewall or NSG, add outbound allow rules for those ports and destinations. If you're using Azure-to-Azure replication, check your NSG on the source VM's subnet, ASR needs outbound 443 to Azure public endpoints. You can use Service Tags in NSG rules: add an outbound allow rule for the AzureSiteRecovery service tag on port 443 to avoid maintaining a manual IP list that changes over time.

After adjusting firewall rules, restart the Mobility Service on the protected VM and wait 5 minutes. The replication lag should start dropping immediately if network was the issue.

3
Fix Configuration Server and Process Server Health

In VMware-to-Azure and physical server replication scenarios, the Configuration Server is the brain of your ASR deployment. If it's unhealthy, everything downstream fails. Head to your vault in the Azure portal and go to Site Recovery Infrastructure → Configuration Servers. Click your Configuration Server and review the component health section.

You'll see status indicators for the Configuration Server itself, the Process Server, the Master Target Server, and the push installation service. Any component showing "Warning" or "Error" needs attention first.

Common Configuration Server problems and fixes:

Low disk space (Error 95144): The Configuration Server caches replication data temporarily. If the cache drive (usually D:\ or whichever drive was selected during setup) drops below 600GB free, replication throttles and eventually stops. Check disk space and clear the cache folder at C:\ProgramData\Microsoft Azure Site Recovery\Scratch, but only delete files older than 24 hours, and only if replication is paused.

Process Server connectivity (Error 20011): This usually means the Process Server can't reach the Configuration Server on port 443. Run this from the Process Server:

Test-NetConnection -ComputerName "config-server-internal-ip" -Port 443
Test-NetConnection -ComputerName "config-server-internal-ip" -Port 9443

Certificate expiry (Error 20013): ASR uses certificates for internal component communication. If the cert has expired, re-run the Configuration Server setup wizard. Navigate to C:\Program Files (x86)\Microsoft Azure Site Recovery\home\bin\ and run GenCert.exe to regenerate internal certs without a full reinstall.

After fixing any component issue, click Refresh in the portal and verify all components show green before moving to the next step.

4
Resolve Replication Health Errors and RPO Violations

If the agent is healthy and connectivity checks pass, but your replication health still shows a warning or your RPO is drifting past your target, you're likely dealing with either throughput throttling, storage account issues, or excessive churn on the protected disk.

First, check the replication lag. In the portal, go to your replicated item and look at RPO under the Overview section. If RPO is creeping above your target (say, 15 minutes when your target is 5 minutes), check disk churn. Run this on the protected Windows VM to see write I/O per disk:

Get-Counter "\LogicalDisk(*)\Disk Write Bytes/sec" -SampleInterval 5 -MaxSamples 12 | 
  Select-Object -ExpandProperty CounterSamples | 
  Where-Object {$_.InstanceName -ne "_total"} |
  Select-Object InstanceName, @{N="MB/s";E={[math]::Round($_.CookedValue/1MB,2)}} | 
  Sort-Object "MB/s" -Descending

If any disk is sustaining more than 54 MB/s of write throughput, you're hitting the ASR per-disk churn limit. The fix is either to move high-churn workloads to excluded disks (if the data is non-critical for DR) or upgrade your Process Server to a larger VM size to handle the increased throughput.

For storage account errors like Error 150097 (storage account not accessible), verify the cache storage account is in the same region as the source VM and that the vault's managed identity has Storage Blob Data Contributor role on that account:

$vaultIdentity = (Get-AzRecoveryServicesVault -Name "YourVaultName" -ResourceGroupName "YourRG").Identity.PrincipalId
New-AzRoleAssignment -ObjectId $vaultIdentity `
  -RoleDefinitionName "Storage Blob Data Contributor" `
  -Scope "/subscriptions/your-sub-id/resourceGroups/your-rg/providers/Microsoft.Storage/storageAccounts/yourcachesa"

After granting the role, disable and re-enable replication for the affected VM, this forces ASR to re-authenticate against the storage account with the updated permissions.

5
Diagnose and Fix Failed Test Failover or Failover Errors

Running a test failover that fails, or worse, a production failover that gets stuck, is one of the most stressful situations in enterprise IT. I've been there. Here's how to work through it systematically.

When a test failover fails, the job in the vault will show a specific sub-task failure. Go to Site Recovery → Jobs and click the failed job. Expand the task tree until you find the red X. The most common failure points are:

VM not starting after failover (Error 150097 or 31008): This usually means the VM failed to boot in Azure after the disk was attached. Check if the source OS has Azure Guest Agent installed and up to date. For Windows, this is WindowsAzureGuestAgent.exe, confirm it's version 2.7.41491.1 or newer via Add/Remove Programs. For Linux, verify waagent --version returns a supported version.

Network not connected after failover: The failed-over VM boots but can't reach anything. This is almost always because the target virtual network or subnet wasn't pre-configured correctly in the replication settings. Go to the replicated item → Compute and Network and verify the target Virtual network, Subnet, and IP address settings are correct before re-running the failover.

Failover stuck at "Completing failover" (Error 520001): This specific error means the source machine couldn't be shut down gracefully. If this is a test failover, it shouldn't shut down the source, check your failover type. If it's a planned failover, verify the Mobility Service agent can still communicate back to the vault. Run this PowerShell on the source VM:

Invoke-WebRequest -Uri "https://management.azure.com" -UseBasicParsing | Select-Object StatusCode

A 200 response confirms outbound internet is working from the source. After cleanup, always run Cleanup test failover in the portal to remove the test environment, leaving it running can exhaust your target resource group's quotas and block future failover tests.

Advanced Azure Site Recovery Troubleshooting

If the step-by-step fixes above haven't resolved your issue, you're dealing with something deeper. Here's where we go into the details that most guides skip.

Reading ASR Logs on the Configuration Server

The Configuration Server generates detailed logs that tell you exactly what's failing. Find them at:

C:\ProgramData\Microsoft Azure Site Recovery\Logs\
C:\Program Files (x86)\Microsoft Azure Site Recovery\home\svsystems\eventmanager\logs\

The most useful log for replication issues is obengine.log, it tracks all data upload operations. Search for lines containing "Error" or "Failed" with timestamps matching your incident window:

Select-String -Path "C:\ProgramData\Microsoft Azure Site Recovery\Logs\obengine.log" -Pattern "Error|Failed" | Select-Object -Last 50

Event Viewer Analysis

On the protected VM itself, open Event Viewer and navigate to Applications and Services Logs → Microsoft → Azure → SiteRecovery. Filter for Event IDs in the 9000–9999 range. Event ID 9006 indicates the agent lost contact with the Configuration Server. Event ID 9007 means the agent registered successfully but replication data couldn't be sent. Event ID 9019 means the agent version is incompatible.

Group Policy Conflicts (Domain-Joined Machines)

On domain-joined servers, Group Policy can interfere with the Mobility Service in two ways. First, GPO-enforced firewall policies can silently block outbound 443/9443 traffic even after you've updated the Windows Firewall rules directly, because GPO overwrites them on next refresh. Check applied firewall policies with:

gpresult /r /scope:Computer | findstr /i "firewall"

Second, software restriction policies or AppLocker rules can prevent the Mobility Service from executing its binaries. Look for Event ID 8004 in the AppLocker log under Applications and Services Logs. If AppLocker is blocking ASR, you'll need to add a publisher exception for binaries signed by Microsoft Corporation in the %ProgramFiles(x86)%\Microsoft Azure Site Recovery\ path.

Azure-to-Azure Specific: Extension Installation Failures

For Azure-to-Azure ASR, protection is enabled via the Microsoft.Azure.RecoveryServices.SiteRecovery VM extension. If this extension fails to install, you'll see Error 150041. Common causes: the VM has a custom OS image with a locked-down extension policy, the VM is running a generation 2 (Gen2) image with Secure Boot enabled that blocks unsigned extensions, or the VM's managed identity permissions are missing.

Check extension status:

Get-AzVMExtension -ResourceGroupName "SourceRG" -VMName "SourceVM" | Where-Object {$_.Name -like "*SiteRecovery*"} | Select-Object Name, ProvisioningState, StatusCode

If ProvisioningState shows Failed, remove and reinstall the extension by disabling replication for the VM and re-enabling it from scratch. This forces ASR to redeploy the extension cleanly.

RPO Violations in High-Churn Environments

If your VMs are SQL servers, file servers with heavy write activity, or anything with sustained disk write rates above 30 MB/s, standard ASR Process Server sizing (8 cores, 16GB RAM) won't be enough. Scale up to a 16-core, 32GB configuration and set the ProcessServerThrottleMultiplier registry key on the Process Server:

HKLM:\SOFTWARE\Microsoft\Azure Site Recovery\ProcessServer
Value: ProcessServerThrottleMultiplier
Type: DWORD
Data: 2

This doubles the internal upload thread count, which significantly improves throughput for high-churn workloads.

When to Call Microsoft Support

Some ASR issues genuinely require Microsoft's backend involvement. Escalate to Microsoft Support when you see these specific situations: the replication job shows error 150097 persisting after storage permission fixes (possible backend storage account quota issue); you're seeing vault-level authentication failures that persist after re-registering the Configuration Server (possible Entra ID token issue); or a failover completed but the failed-over VM is in an inconsistent state and can't be powered on despite the disk being attached correctly (possible platform-level storage fault). Before you call, collect the vault correlation ID from the failed job, it's the first thing support will ask for and it cuts your ticket resolution time in half.

Prevention & Best Practices for Azure Site Recovery

Once you've fixed an ASR issue, the goal is never to see it again. These practices come from managing ASR at scale across large enterprise environments, they're not theoretical, they're the things that actually prevent the 2am pages.

Keep the Mobility Service agent updated proactively. Don't wait for ASR to flag version incompatibility, by then, replication has already been failing silently. Set a monthly maintenance window and push agent updates to all protected VMs through the Configuration Server. Microsoft releases ASR agent updates roughly every 30–45 days and each release fixes known replication bugs.

Run test failovers on a quarterly schedule. A DR solution you've never tested is a DR solution you don't actually have. Set a calendar reminder every 90 days, pick a representative subset of VMs, and run a test failover into an isolated virtual network. You'll catch RPO drift, stale credentials, and network config problems long before they matter in a real incident.

Monitor Configuration Server disk space automatically. Create an Azure Monitor alert on the Configuration Server VM for disk space below 650GB on the cache drive. The alert threshold built into the portal (600GB) is too late, at 650GB you still have time to clear space without disrupting replication.

Use the ASR Deployment Planner before adding VMs to replication. The Deployment Planner tool (ASRDeploymentPlanner.exe) profiles your VMs' disk churn rates and tells you exactly what Process Server sizing and network bandwidth you need. Running it after protection is enabled, after you're already hitting throughput limits, is the wrong time to find out your Process Server is undersized.

Tag your ASR-protected resources. Apply a consistent Azure tag like asr-protected: true to all VMs under replication, their cache storage accounts, and target resource groups. This prevents accidental deletion or policy assignment changes that silently break replication, something that happens more often than you'd think in organizations where multiple teams share Azure subscriptions.

Quick Wins
  • Set Azure Monitor alerts for replication health state changes, get notified within minutes instead of discovering issues during a DR drill
  • Enable auto-update for the Mobility Service agent in the vault settings (Site Recovery → Site Recovery Infrastructure → Auto Update settings) to eliminate manual patching overhead
  • Document your Configuration Server's passphrase and store it in Azure Key Vault, losing it means re-registering all protected VMs from scratch
  • Use Azure Policy to enforce that new VMs in designated resource groups are automatically enrolled in ASR replication within 24 hours of provisioning

Frequently Asked Questions

Why does my Azure Site Recovery replication health show "Critical" but the VM is still running fine?

This is one of the most confusing things about ASR, the source VM is totally unaffected by replication health. "Critical" status means ASR can't create new recovery points, so if a disaster happened right now, you'd be recovering from the last successful point (which might be hours or days old). The VM itself keeps running. The urgency is that your RPO is widening every minute replication stays broken. Check the specific error code under the replicated item's Health section and work through the connectivity and agent checks first, those are the most common culprits behind a Critical status that appears without any obvious trigger.

How do I find the root cause when my ASR failover job fails with no useful error message?

Go to your Recovery Services vault → Monitoring → Jobs and click the failed job. In the job details blade, expand every sub-task until you find the one with a red X, the top-level "Failover failed" message is just a wrapper. Each sub-task has its own error code and description. Copy the Correlation ID from the bottom of the job details page, if you need to call support, this ID lets them pull the exact backend trace for your operation. Also check the vault's Activity Log (Settings → Activity log) filtered to the time of the failure for any authorization or resource lock errors that wouldn't show in the job view.

Can I replicate VMs that have Azure Disk Encryption enabled?

Yes, ASR supports Azure Disk Encryption (ADE), but you have to set it up correctly or you'll hit Error 150049. The vault's managed identity needs access to the Key Vault containing the disk encryption keys, specifically the Get, List, and WrapKey/UnwrapKey permissions in the Key Vault's access policies. You also need to pre-create a target Key Vault in the failover region and grant the same permissions there before enabling replication. ASR will automatically replicate the encryption keys alongside the disk data, but it won't create the target Key Vault or permissions for you, that's a manual prerequisite. Once those permissions are in place, enable replication as normal and ASR handles the rest.

My RPO is way over target even though replication shows Healthy, what's going on?

Healthy status means the agent is connected and sending data, it doesn't guarantee your RPO target is being met. High disk write churn is the most common cause: if a disk is writing faster than your Process Server can upload the delta to Azure, the replication queue builds up and RPO drifts. Check disk churn on the protected VM using the PowerShell counter command in Step 4 above. Also verify your available upload bandwidth, ASR needs roughly 0.7 Mbps per MB/s of disk churn. If bandwidth is fine and churn is low, check whether your Process Server is CPU-constrained (above 85% CPU utilization consistently), which throttles the upload pipeline even when bandwidth is available.

How long does the initial replication (IR) take and why is it so slow?

Initial replication duration depends on three factors: the total size of the protected disks, your available upload bandwidth, and how aggressively ASR throttles the upload to avoid impacting production workloads. As a rough estimate, expect 1GB per Mbps of upload bandwidth per hour, so a 1TB VM on a 100Mbps connection takes around 10 hours if nothing else is competing for bandwidth. ASR throttles its own upload to 25% of available bandwidth by default to protect production traffic. You can increase this on the Process Server by modifying the throttle settings in the cspsconfigtool.exe utility under the Bandwidth tab. During initial replication, RPO reporting is disabled, it only activates once IR completes and differential replication begins, so don't worry about RPO numbers during the IR phase.

After re-registering the Configuration Server, all my VMs show as Unprotected, how do I fix this without losing recovery points?

This is a common panic moment and fortunately it's recoverable. When you re-register a Configuration Server, ASR temporarily loses the association between the server and its protected VMs. Go to each affected VM in Replicated items, they'll show "Unprotected" or "Registration in progress." Wait 20–30 minutes and refresh; in most cases ASR automatically re-discovers the protected VMs and restores the replication relationship without losing existing recovery points. If a VM doesn't re-appear automatically after 30 minutes, you can manually re-enable protection from the portal using the same policy and settings as before, ASR is smart enough to detect the existing replicated data and perform a delta sync instead of a full initial replication, so you won't lose your recovery points or have to wait for a full IR again.

Related Microsoft Fix Guides

H
Sai Kiran Pandrala
Our team includes certified Microsoft engineers, Azure architects, and system administrators with 10+ years of enterprise IT experience. Every guide is written from hands-on troubleshooting, not guesswork. We test every fix before publishing.