Azure Site Recovery: Fix Setup, Replication & Failover Errors
Why This Is Happening
I've seen this exact scenario more times than I can count: an IT team spends two weeks planning their Azure Site Recovery disaster recovery setup, finally gets everything connected, and then watches in frustration as the replication health dashboard turns red. The portal shows cryptic warnings, replication jobs are stuck in a "pending" state, and nobody can agree on whether the problem is the network, the appliance configuration, or something buried in the Azure Recovery Services vault itself.
Azure Site Recovery is genuinely one of the more powerful BCDR tools Microsoft offers , it can replicate Azure VMs between regions, handle on-premises VMware and Hyper-V workloads, and even cover physical Windows and Linux servers. But that breadth comes with real complexity. There are a lot of moving parts, and when one of them misbehaves, the error messages you get back from the portal are often vague enough to send you down five wrong paths before you find the real culprit.
The most common root causes I see fall into a few distinct buckets. First, there's the replication appliance misconfiguration, the modernized ASR replication appliance (which replaced the older configuration server approach) is far better in terms of security and resilience, but it still requires precise network connectivity and firewall rules to work. Miss a single outbound HTTPS endpoint and replication silently stalls. Second, there are churn-related failures: if your workloads are IO-intensive and hitting data churn above the configured threshold, replication health degrades fast. Azure Site Recovery now supports high-churn scenarios with up to 100 MB/s, but you have to explicitly enable and configure this, it doesn't happen automatically. Third, network mapping gaps catch people after failover when VMs come up in the target region with no network connectivity because the virtual network mapping was never set up.
I know this is frustrating, especially when your organization is counting on this working before a compliance audit or a production cutover. The good news is that most Azure Site Recovery problems are fixable without opening a support ticket, once you know where to look. Let's work through it.
The Quick Fix, Try This First
Before diving into step-by-step troubleshooting, try this single check first, it resolves roughly 40% of Azure Site Recovery replication issues I see in enterprise environments.
Open the Azure portal and navigate to your Recovery Services vault. In the left blade, select Site Recovery → Replicated Items. Click on any item showing a health warning. On that item's blade, look at the top for a yellow or red banner, it will usually contain a direct link to the specific error. Click View Details on that error.
Nine times out of ten, the error details blade will tell you exactly what failed: a connectivity timeout to a storage endpoint, a certificate validation failure, or a process server that's running low on free disk space. Note the error code shown (it usually looks like a five-digit number starting with 9 or 15). Then go to your replication appliance VM, open a PowerShell session as Administrator, and run:
Get-AzRecoveryServicesAsrJob -VaultId (Get-AzRecoveryServicesVault -Name "YourVaultName" -ResourceGroupName "YourRG").ID | Sort-Object StartTime -Descending | Select-Object -First 10
This pulls the last 10 ASR jobs from your vault so you can see exactly which job failed and when. Cross-reference the job start time with any firewall changes, VM reboots, or network maintenance in your environment around that same window. In most cases, the root cause is sitting right there.
Also check the replication appliance's outbound internet connectivity. It needs to reach *.backup.windowsazure.com, *.blob.core.windows.net, and login.microsoftonline.com over port 443. A proxy that started blocking one of these silently after a security policy change is a classic culprit.
The Recovery Services vault is the control plane for all Azure Site Recovery operations, it's where replication policies, network mappings, and protected items are all managed from a single location. Before touching anything on the appliance side, confirm the vault itself is correctly configured.
In the Azure portal, navigate to your Recovery Services vault and select Site Recovery Infrastructure from the left menu. Click Replication Policies. You should see at least one policy defined. Click into it and check these specific values:
- Recovery point retention: This defines how far back you can recover to. For most workloads, 24 hours is the starting point, but IO-intensive apps may need shorter windows with more frequent recovery points to meet tighter RPO targets.
- App-consistent snapshot frequency: This setting controls how often application-consistent snapshots are taken. These snapshots capture disk data, all in-memory data, and all transactions in process, they're what make your recovery actually usable rather than just crash-consistent. If this is set too high, you may see performance impact on the source VMs.
- Replication frequency (Hyper-V): If you're replicating Hyper-V VMs, confirm this is set to your desired value. Azure Site Recovery supports replication frequency as low as 30 seconds for Hyper-V workloads.
If you're replicating Azure VMs between regions, also confirm the source and target regions are correct. Navigate to Replicated Items, select an item, and verify the Source and Target fields match your disaster recovery architecture design. A mismatch here, usually caused by someone setting up replication in the wrong vault, means every subsequent step is built on a broken foundation.
When this step is correct, the Replicated Items blade shows all items with a green health icon and the Replication Policy column shows your named policy. If it instead shows "None" or a policy name that ends in "-default", someone skipped policy assignment during initial setup.
The modernized Azure Site Recovery replication appliance is significantly better than the old configuration server, it offers stronger security and improved resilience. But it needs to be correctly registered to your vault and maintain continuous outbound connectivity to Azure. When either of those breaks, replication stalls.
Log into the replication appliance VM (this is a Windows Server VM running in your on-premises environment or on Azure Stack). Open the Azure Site Recovery Configuration Manager tool, you'll find it on the desktop as a shortcut called ASR Replication Appliance Manager. Check the status panel at the top. It should show Connected next to the vault name. If it shows Not Connected or Registration Pending, the appliance has lost its registration with the vault.
To re-register, go back to the Azure portal. Navigate to your vault → Site Recovery Infrastructure → Replication Appliances. Select your appliance and click Download Configuration File to get a fresh registration key. Back on the appliance, in the Configuration Manager, click Re-register and supply the new key file.
While you're on the appliance, also verify port 9443 is open bidirectionally between the appliance and your protected VMs, this is the channel the mobility service uses to send replication data to the process server. Run this quick check from PowerShell on the appliance:
Test-NetConnection -ComputerName <SourceVM-IP> -Port 9443
A TcpTestSucceeded : True result means the channel is clear. If it's false, trace the firewall path between the appliance and source VMs, a host-based firewall rule or a network security group is almost certainly blocking it.
After re-registration succeeds, give the appliance five to ten minutes and then refresh the Replicated Items blade in the portal. You should see the replication health move from Critical to Healthy as the appliance resynchronizes its state with the vault.
Once connectivity is confirmed, the next most common source of replication health warnings is data churn, the rate at which data is changing on your source disks. Azure Site Recovery has to keep up with those changes in near-real-time. When it can't, the replication lag grows, your RPO blows out, and you get health warnings.
Standard Azure Site Recovery configurations handle typical workloads comfortably. But if you're protecting SQL Server databases under heavy write load, large file servers, or any IO-intensive application, you may need to enable the High Churn option. This feature allows data churn up to 100 MB/s per VM, which covers the vast majority of even aggressive enterprise workloads.
To check your current churn rate, go to the Azure portal → your vault → Replicated Items → select the VM → Compute and Network. You'll see a per-disk breakdown of the data change rate. If any disk is consistently above 50 MB/s and you haven't enabled High Churn, that's your problem.
To enable High Churn support, navigate to the VM's replicated item blade and select Properties. Under the replication settings, you'll find the Churn option. Switch it to High Churn and save. Note that this change takes effect on the next replication cycle, so give it 15–20 minutes before checking health again.
Also check available disk space on the appliance's cache disk. The process server component of the appliance buffers changed data before uploading it to Azure storage. If the cache disk (typically the D: or E: drive on the appliance VM) is above 85% full, replication will throttle and eventually pause. Free up space or resize the disk, then restart the Microsoft Azure Site Recovery process server service on the appliance:
Restart-Service -Name "svagents" -Force
After this restarts, watch the replication health dashboard for five minutes. In most cases, health returns to green within one full replication cycle.
This is the step that surprises teams the most during their first test failover. Everything looks healthy, replication is green, recovery points are accumulating, but when they run a test failover, the VMs come up in Azure with no network connectivity. Nobody can RDP in, applications can't reach their dependencies, and the whole exercise fails. The cause is almost always missing network mapping.
Azure Site Recovery needs to know which virtual network in the target region to connect recovered VMs to. Without this mapping defined, VMs come up isolated. Setting it up takes three minutes and saves enormous pain.
In the Azure portal, go to your Recovery Services vault → Site Recovery Infrastructure → Network Mapping. If the list is empty, you haven't set this up yet. Click + Add Network Mapping.
- Source fabric: Select your source environment (e.g., your on-premises VMware fabric or the source Azure region)
- Source network: The virtual network your source VMs currently live in
- Target fabric: Your Azure region or target environment
- Target network: The Azure VNet you want recovered VMs to connect to
Click OK to save. The mapping takes effect immediately for future failovers, but if you have a test failover already in progress, clean it up first and rerun it.
Also configure static IP address reservation if your applications depend on fixed IPs. On each replicated item, go to Compute and Network → Network Interfaces and set the target IP for each NIC. If you leave this blank, Azure assigns a dynamic IP on failover, which breaks any application config that references the source IP directly. Azure Site Recovery integrates with Azure's network layer to reserve these IPs, so as long as the address is available in the target subnet, it'll be assigned correctly at failover time.
When network mapping is set correctly, a test failover will show your VMs coming up with the expected IP addresses and successfully resolving DNS names within two to three minutes of the failover job completing.
One of the most underused features in Azure Site Recovery is the ability to run a complete disaster recovery drill without touching your live replication. I've seen organizations go 18 months between drills because they're afraid of disrupting production, and then discover during an actual incident that their recovery plans don't work. Don't be that organization.
A test failover in Azure Site Recovery spins up your VMs in an isolated Azure VNet using your replicated data, lets you verify everything works, and then tears the test environment down, all while your live replication continues uninterrupted in the background. This is built into the product specifically to make regular testing the default behavior, not the exception.
To run a test failover from the portal: navigate to Replicated Items → select the VM or recovery plan → click Test Failover. You'll be prompted to choose:
- Recovery Point: Choose Latest processed for the fastest RTO (uses the most recent recovery point that's been fully processed). Choose Latest app-consistent if your application requires a clean transaction boundary.
- Azure Virtual Network: Select an isolated test network, one that doesn't have any routing to your production environment. This is critical. If you use a production-connected VNet for test failovers, you risk IP conflicts and traffic interception.
Click OK. The failover job runs, you can watch progress under Site Recovery Jobs in real time. Once complete, RDP or SSH into the failed-over VM to verify the OS boots correctly, applications start, and database connections are healthy. Run whatever smoke tests your runbooks call for.
When done, click Cleanup Test Failover on the replicated item. This tears down the test VMs and marks the drill as complete in your compliance records. Recovery plans, which let you sequence multi-tier application failover with custom scripts and Azure Automation runbooks, make this process repeatable and auditable across your entire workload portfolio.
Advanced Troubleshooting
Diagnosing Mobility Service Installation Failures on VMware VMs
If you're replicating on-premises VMware VMs and seeing errors during the initial enable-replication step, the mobility service installation is the first suspect. The mobility service is a lightweight agent that runs on each source VM and sends replication data to the process server on the appliance. When it fails to install, you get a failed job and an error message that's often less helpful than you'd hope.
Check the installation logs on the source VM at C:\ProgramData\ASRSetupLogs\ (Windows) or /var/log/ua_install.log (Linux). Look for lines containing ERROR or FAILED. The most common failure reasons are:
- The source VM can't reach the process server on port 9443 (firewall blocking)
- The credentials supplied for push installation don't have local admin rights on the source VM
- A previous, partially-installed mobility service agent version is conflicting, uninstall it via Programs and Features and retry
- SELinux or Windows Defender blocking the installer execution
Event Viewer Analysis for Replication Job Failures
On the replication appliance, open Event Viewer and navigate to Applications and Services Logs → Microsoft → Azure Site Recovery. Filter for Error and Warning level events. Errors logged here correspond directly to what you see in the portal, but the Event Viewer entries often contain the full stack trace and inner exception messages that the portal omits. Sort by time and look for event clusters around the time your replication health turned red.
Also check the Windows Application log for errors from source svagents, InMage Scout Application Server, or Microsoft Azure Recovery Services Agent. These services are the core of the replication pipeline on the appliance side.
PowerShell-Based Vault Diagnostics
For enterprise environments managing dozens of replicated items, the portal gets tedious fast. Use the Az.RecoveryServices PowerShell module to bulk-check health across all items:
# Set vault context
$vault = Get-AzRecoveryServicesVault -Name "YourVaultName" -ResourceGroupName "YourRG"
Set-AzRecoveryServicesAsrVaultContext -Vault $vault
# Get all fabrics
$fabrics = Get-AzRecoveryServicesAsrFabric
# Check replication protected items across all containers
foreach ($fabric in $fabrics) {
$containers = Get-AzRecoveryServicesAsrProtectionContainer -Fabric $fabric
foreach ($container in $containers) {
Get-AzRecoveryServicesAsrReplicationProtectedItem -ProtectionContainer $container |
Select-Object FriendlyName, ReplicationHealth, FailoverHealth, LastSuccessfulFailoverTime |
Format-Table -AutoSize
}
}
This gives you a clean table of every protected item's replication health, failover health, and last successful failover time. Sort by LastSuccessfulFailoverTime ascending to immediately spot items that haven't tested failover in too long.
Handling Zone-to-Zone and Extended Zone Replication Issues
Zone-to-zone disaster recovery, replicating Azure VMs between availability zones within the same region, is a newer scenario and has some specific configuration quirks. If you're seeing replication stuck after enabling zone-to-zone DR, verify that the target zone you selected actually has capacity for the VM SKU you're protecting. Zone capacity constraints are a real issue during high-demand periods and won't surface until failover time if you haven't checked ahead.
For Azure Extended Zones replication (currently in preview), be aware that recovery plans with custom scripts and manual actions aren't yet supported for Extended Zone scenarios, this feature is currently limited to region-to-region replication and will be added to Extended Zones in a future release. Plan your runbook automation accordingly if you're in this preview.
Prevention & Best Practices
Every Azure Site Recovery outage I've investigated had at least one preventable cause. The teams that keep ASR running cleanly long-term are the ones who treat it as an active system that needs regular attention, not a set-it-and-forget-it checkbox.
Keep the replication appliance updated. Microsoft releases Azure Site Recovery updates regularly, and running on outdated appliance versions is one of the top causes of compatibility issues and security gaps. The portal will show a banner when a new appliance update is available. Apply it within two weeks of release, never let it go more than 60 days. Navigate to Site Recovery Infrastructure → Replication Appliances to check the current version and kick off an update.
Run test failovers on a schedule. At minimum, quarterly. Monthly for critical workloads. Set a recurring calendar reminder and actually do it. The test failover feature is specifically designed so there's no excuse not to, it doesn't disrupt anything, it proves your recovery points are usable, and it validates your network mapping and IP assignments before a real incident forces the issue.
Monitor churn rates proactively. Set up an Azure Monitor alert on the Data Change Rate metric for your replicated items. If churn on any disk climbs above 70 MB/s and you haven't enabled High Churn support, you want to know before replication health degrades, not after. The High Churn option supports up to 100 MB/s and should be pre-enabled for any IO-intensive workload.
Document and test your recovery plans. Recovery plans let you sequence multi-tier application failover with scripts, manual approval gates, and Azure Automation runbooks. Build these plans during initial setup, not during an incident. Test them during your quarterly drills. A recovery plan that sequences your database tier before your application tier before your web tier, with health checks built in between each group, is the difference between a 20-minute recovery and a four-hour scramble.
Validate network mapping after any Azure networking changes. If someone adds, renames, or deletes a virtual network in either your source or target region, check your ASR network mappings immediately. Mappings don't auto-update when the underlying VNets change, and a stale mapping will silently produce a broken post-failover network configuration that you won't discover until test failover day.
- Enable Azure Monitor alerts for Replication Health and Data Change Rate on all replicated items, catch degradation in minutes, not days
- Pre-configure static target IP addresses for every replicated VM NIC so post-failover DNS and application config are never broken
- Add a minimum of 20% free disk buffer on your replication appliance cache drive at all times, buffer fills cause silent throttling before any alerts fire
- Apply the principle of least privilege to push-installation credentials used by ASR for mobility service deployment, dedicated service account, local admin only, no domain admin rights
Frequently Asked Questions
What exactly does Azure Site Recovery do and how is it different from Azure Backup?
Azure Site Recovery is a business continuity and disaster recovery service, its job is to keep your apps and workloads running during outages by replicating them to a secondary location and enabling failover. Azure Backup is focused on data protection, creating point-in-time copies of data you can restore from. Think of ASR as your "keep running" tool and Backup as your "restore from history" tool. They're designed to work together as part of a complete BCDR strategy, and both live under the Azure Recovery Services umbrella. For most production workloads, you want both.
What can Azure Site Recovery actually replicate?
Azure Site Recovery handles a wide range of scenarios. You can replicate Azure VMs from one Azure region to another, or from an Azure Extended Zone to its connected region. On the on-premises side, it covers VMware VMs (using the modernized replication appliance), Hyper-V VMs (with or without System Center VMM), physical Windows and Linux servers, and Azure Stack VMs. It can even replicate AWS Windows instances to Azure. You can also replicate on-premises VMware or Hyper-V workloads to a secondary on-premises site. Basically, if it runs on a machine that's in ASR's support matrix, it can be replicated, check the official support matrix in the portal for the exact OS versions and configurations that are covered.
How fast can Azure Site Recovery replicate, what's the minimum RPO I can achieve?
It depends on what you're replicating. For Azure VMs and VMware VMs, replication is continuous, changes are shipped to Azure storage as they happen, which means your RPO is typically measured in seconds to a few minutes under normal conditions. For Hyper-V VMs, you can set the replication frequency as low as 30 seconds, which gives you very tight RPO. Application-consistent recovery points (which capture in-memory state and active transactions, not just disk data) are created on a configurable schedule that you set in your replication policy, these have slightly higher overhead but give you a clean, consistent recovery point rather than a crash-consistent one.
Can I test a failover without affecting production or breaking active replication?
Yes, this is one of ASR's most valuable features and it's specifically designed for this purpose. A test failover spins up a copy of your VM in an isolated Azure VNet using your replicated recovery data, completely independent of live replication. Active replication continues without interruption while your test VMs are running. You validate everything works, then run "Cleanup Test Failover" and the test environment is torn down. The only thing to be careful about is choosing an isolated VNet for the test, if you accidentally point the test failover at a production-connected network, you risk IP conflicts. Always use a dedicated, isolated test VNet.
My replication health turned red overnight and I didn't change anything, what happened?
The most common overnight culprits are: a scheduled Windows Update that rebooted the replication appliance and didn't restart the ASR services cleanly; a network security group rule change or firewall policy update that started blocking outbound traffic to Azure endpoints; the appliance's cache disk filling up due to a spike in data churn; or an Azure Site Recovery service-side incident in your region (check the Azure Service Health blade for any active incidents). Start by checking the appliance services are running, verify outbound connectivity to *.backup.windowsazure.com and *.blob.core.windows.net, and look at the error details on the Replicated Items blade, the specific error code will point you to the right fix within minutes.
What's the difference between a planned failover and an unplanned failover in Azure Site Recovery?
A planned failover is used when you know an outage is coming, scheduled maintenance, a datacenter migration, or a planned Azure region evacuation. In this mode, ASR waits for all pending replication data to sync before failing over, so you get zero data loss. An unplanned failover is what you use during an actual disaster when the primary site is down and you can't wait. ASR fails over using the latest available recovery point, which may mean a small amount of data loss depending on how recent that recovery point is and the churn rate of your workloads. After either type of failover, once the primary location is available again, you can fail back to it and ASR will reverse-replicate any changes made in the secondary location back to the primary.