Azure Windows Virtual Machine Troubleshooting Guide
Why This Is Happening
You spun up an Azure Windows Virtual Machine, configured it exactly the way you needed it , and now something is broken. Maybe it won't boot. Maybe your RDP connection drops the moment you try to connect. Maybe the VM is running but grinding through tasks at a fraction of normal speed, with CPU pinned at 100% for no obvious reason. I've been in that exact seat more times than I can count, and I know how disorienting it feels when a cloud VM misbehaves because the feedback loop is so much slower than working on a physical machine you can physically touch.
Azure Windows Virtual Machine troubleshooting is genuinely different from on-prem troubleshooting. You don't have physical access to the hardware. You can't plug in a USB drive and boot from a recovery environment the normal way. The error messages you get , whether from the Azure portal, from Event Viewer, or from a failed RDP handshake, are often vague, and Microsoft's own error descriptions can read like they were written for someone who already knows the answer.
Here's the real picture of what causes most Azure Windows VM problems:
RDP and connectivity failures are the single most common complaint. These usually come from one of three places: a Network Security Group (NSG) rule that's blocking port 3389, a corrupt or disabled Remote Desktop Services component inside the guest OS, or an authentication mismatch, often tied to an expired password or a failed domain join. The Azure portal's built-in "Can't connect to my VM" guide is a good starting point, but it often misses guest-OS-level causes entirely.
Boot failures are the second most stressful category. A Windows VM that won't boot typically shows up in the Azure portal as stuck in a "Provisioning" or "Starting" state. Underneath, you're usually looking at one of these: a BitLocker encryption key that Azure can't access, a corrupted Boot Configuration Data (BCD) store, a bad Windows Update that wiped a critical boot file, or, my personal least favorite, an fstab-style disk signature collision after a disk resize.
Performance degradation that seems random is almost always one of two things: the VM SKU is undersized for the actual workload (especially after SQL Server or IIS starts warming up), or there's a noisy-neighbor situation on the underlying Azure host. Azure's Performance Diagnostics tool in the portal can surface CPU, memory, and disk I/O bottlenecks quickly, but you need to know it exists to use it.
VM extension failures are the quiet killers. Extensions run as SYSTEM inside the guest OS and handle everything from monitoring agents to custom script deployments. When they fail, they usually fail silently from the user's perspective, and you only notice something is wrong when your monitoring stops reporting or a deployment pipeline throws a cryptic error.
The good news: almost every Azure Windows VM troubleshooting scenario has a documented fix path. The challenge is knowing which path applies to your situation. That's exactly what this guide walks you through. Browse all Microsoft fix guides →
The Quick Fix, Try This First
Before you dig into logs and registry edits, try the Azure portal's built-in VM repair tools. Microsoft added a solid "Redeploy" feature that moves your VM to a fresh Azure host node, and it resolves a surprising number of problems that look like software issues but are actually infrastructure-level glitches on the underlying host.
Here's how to redeploy your Azure Windows VM in under two minutes:
- Sign in to the Azure portal at
portal.azure.com. - In the left navigation, select Virtual Machines, then click your VM's name.
- In the VM's left menu, scroll down to the Help section and click Redeploy + Reapply.
- On the blade that opens, click the Redeploy button. Confirm if prompted.
- Wait 5–10 minutes. The VM will shut down, migrate to a new host node, and restart.
What does this actually fix? Any issue caused by a degraded physical host, network fabric problems, storage latency spikes, or host-level hypervisor instability. It does not fix problems that live inside the guest OS itself. If your VM comes back up and the same error is waiting for you, the problem is internal, and you need the full step-by-step below.
If redeploy doesn't help and you're locked out entirely, can't RDP, can't see the desktop, your next fastest move is the Reset Password tool. In the VM menu under Help, click Reset password. Choose "Reset password," enter a new admin username and password, and hit Update. Azure pushes this through the VM agent even if the OS is otherwise unreachable via RDP. This alone fixes probably 30% of "locked out of my VM" tickets I've seen come through.
If the VM shows as Running in the portal but RDP still refuses to connect, check your Network Security Group before anything else. Go to your VM → Networking → inspect the inbound rules. You need a rule explicitly allowing TCP port 3389 from your source IP (or from any IP if this is a dev environment). If that rule is missing or set to Deny, no amount of OS-level fixing will help, the traffic never reaches the VM.
cmdkey /list in Command Prompt and deleting any stored entries for your VM's IP or hostname before reconnecting.
If you can get into the VM at all, even intermittently, your first move should be checking the Windows Event Logs. RDP failures leave very specific fingerprints. Open Event Viewer (press Win + R, type eventvwr.msc, hit Enter) and navigate to Windows Logs → Security.
Filter for these Event IDs, they tell you exactly what's failing:
- Event ID 4625, Failed logon. The "Failure Reason" field will say whether it's a bad password, an account lockout, or a logon type mismatch.
- Event ID 6273, Network Policy Server denied access. This shows up in domain-joined VMs when NPS policies are blocking the RDP session.
- Event ID 1158, Remote Desktop Services hit its connection limit. On Server SKUs, this means you've hit the 2-session limit without an RDS CAL.
For RDP-specific events, also check Applications and Services Logs → Microsoft → Windows → TerminalServices-LocalSessionManager → Operational. This log gives you a play-by-play of every RDP session attempt, including why the connection was terminated.
If you're seeing authentication errors and the VM is domain-joined, the problem is often a broken Kerberos ticket or a time skew between the VM and the domain controller. Check the VM's system clock, if it's off by more than 5 minutes from the DC, Kerberos authentication will fail every time. Run w32tm /query /status in an elevated Command Prompt to see the time source and current offset.
w32tm /query /status
w32tm /resync /force
If the VM is not domain-joined and you're still getting Event ID 4625 with "Failure Reason: Unknown user name or bad password" despite using the correct credentials, you likely have a cached credential issue or the local account got locked. Unlock it via the Reset Password blade in the portal as described above, then check Local Users and Groups (lusrmgr.msc) to confirm the account isn't disabled.
A Windows VM that won't boot is one of the most stressful situations in Azure troubleshooting. The VM shows as "Running" in the portal but you get nothing on the RDP connection, just a timeout. The Boot Diagnostics feature in the portal is your first stop. Under your VM's settings, go to Help → Boot diagnostics → Screenshot. This gives you a screenshot of what the VM's console actually shows, and that image tells you almost everything.
Common boot failure screens and what they mean:
- "BOOTMGR is missing", The Boot Configuration Data store is corrupt. This often happens after an aggressive disk resize operation or a failed Windows Update.
- "BitLocker recovery key required", The VM has BitLocker enabled and Azure can't automatically unlock the drive. You need to provide the recovery key through the Boot Diagnostics serial console.
- Blue screen with
STOP 0x0000007B, Inaccessible boot device. Common after changing VM size to a series with different storage controllers, or after migrating disks between VMs. - "Boot Configuration Update Error", BCD update failed during a Windows patch. The fix is repairing the BCD from a recovery environment.
For BCD corruption, you need to attach the OS disk to a "repair VM", a second Azure Windows VM, as a data disk, fix the BCD from there, then reattach it. Here's the PowerShell to repair BCD once you've booted the repair VM and identified the disk letter (let's say it's F:):
bootrec /fixmbr
bootrec /fixboot
bootrec /rebuildbcd
bcdboot F:\Windows /s F: /f ALL
For BitLocker, navigate to the VM's Boot diagnostics → Serial console in the portal. At the SAC prompt, type cmd to get a command shell, then run manage-bde -unlock C: -RecoveryPassword <your-48-digit-key>. You can find the recovery key in Azure Key Vault if your organization stored it there during encryption setup.
Azure Disk Encryption for Windows VMs uses BitLocker under the hood, integrated with Azure Key Vault. When it breaks, the error messages are often opaque. The most common failure modes I've seen in the field are: Key Vault access policy missing the VM's managed identity, the encryption extension failing to install, or a disk being re-encrypted when it already had encryption applied (causing a metadata conflict).
First, verify the disk encryption status. In the portal, go to your VM → Disks and check the "Encryption" column for each disk. You can also run this from Azure CLI:
az vm encryption show --name <YourVMName> --resource-group <YourRG>
If the status shows "VMRestartPending," the VM needs a reboot to complete encryption, that's expected behavior, not an error. If it shows "EncryptionFailed," you need to look at the extension logs inside the VM at C:\WindowsAzure\Logs\Plugins\Microsoft.Azure.Security.AzureDiskEncryption\.
For Virtual Disk Management issues, the Azure Managed Disks FAQ covers a lot of the "why can't I resize this disk" and "why is my snapshot not working" territory. A critical thing to understand: you cannot resize an OS disk while the VM is running. You must Stop (deallocate) the VM first, not just restart it, actually deallocate it so Azure releases the compute resources. Then resize the disk in Disks → Size + performance, and start the VM again. Inside the OS, you'll still need to extend the partition using Disk Management or PowerShell:
$disk = Get-Disk | Where-Object {$_.IsSystem -eq $false}
$partition = Get-Partition -DiskNumber $disk.Number
Resize-Partition -DiskNumber $disk.Number -PartitionNumber $partition.PartitionNumber -Size (Get-PartitionSupportedSize -DiskNumber $disk.Number -PartitionNumber $partition.PartitionNumber).SizeMax
Also remember: the temporary drive on Azure VMs (usually the D: drive) is ephemeral. It's wiped on every deallocation and redeployment. Never store anything important there. I've seen entire database transaction log directories get wiped because someone assumed D: was persistent. Don't be that person.
Azure Windows VM performance troubleshooting starts with one tool: Performance Diagnostics. Find it in the Azure portal under your VM → Help → Performance diagnostics. Click Run diagnostics, choose "Quick performance analysis" for a fast read or "Continuous performance analysis" if the issue is intermittent. The tool runs inside the VM via an extension and generates a detailed report covering CPU, memory, disk I/O, and network.
The report will flag issues like:
- High CPU utilization with a process breakdown, so you know if it's svchost, antivirus, or a runaway application
- Disk queue depth above 1.0, which indicates the storage tier can't keep up with I/O demands
- Network packet loss or excessive retransmits
- Memory pressure causing excessive paging to disk
If Performance Diagnostics points to a specific process eating CPU, RDP into the VM and open Resource Monitor (search for resmon in the Start menu). The CPU tab shows per-thread CPU usage, which is far more granular than Task Manager. Sort by "Average CPU" to find the real offender.
One thing that catches people off guard: Azure VM CPU throttling. If you're on a burstable VM series (like B-series), your VM has a CPU credit system. When credits run out, performance drops dramatically. Check your VM's "CPU Credits Remaining" metric in the portal under Monitoring → Metrics. If that number is at or near zero, you need to either right-size to a non-burstable SKU or reduce your baseline CPU load.
For persistent high-CPU caused by Windows Update service (TiWorker.exe or WUAUServ), set the Windows Update service to manual startup during peak hours and schedule updates during maintenance windows. In an enterprise setup, use Azure Update Manager to control patch scheduling centrally rather than letting each VM update independently.
# Check top CPU-consuming processes via PowerShell
Get-Process | Sort-Object CPU -Descending | Select-Object -First 10 Name, CPU, WorkingSet
VM extensions are powerful, and fragile. They run as SYSTEM inside the guest OS, communicate back to the Azure fabric through the VM Agent, and when they fail, the failure modes range from "the extension just won't install" to "the extension installed but silently does nothing." I've seen monitoring gaps last weeks because a failed extension update wasn't caught.
The first thing to check is the VM Agent status. In the portal, under your VM → Properties, look for the "Agent status" field. It should say "Ready." If it says "Not Ready" or "Unknown," the Azure VM Agent inside the guest OS has stopped communicating with the Azure fabric, and no extension will work until that's fixed.
To check the agent inside the VM, open Services (services.msc) and look for Windows Azure Guest Agent. It should be running and set to Automatic. If it's stopped, start it. If it fails to start, check the logs at:
C:\WindowsAzure\Logs\WaAppAgent.log
C:\WindowsAzure\Logs\TransparentInstaller.log
For specific extension failures, the logs live here, replace <ExtensionName> with the actual extension:
C:\WindowsAzure\Logs\Plugins\<ExtensionName>\
The Custom Script Extension for Windows is particularly prone to failure when scripts have dependencies that aren't met, like trying to download from an internet URL when the VM has no outbound internet access, or when the script requires elevated privileges that the extension's execution context doesn't have. Always test your Custom Script Extension scripts locally with SYSTEM-level privileges before deploying through Azure.
If an extension is stuck in a "Transitioning" state for more than 20 minutes, you can force-remove it through the portal (VM → Extensions + applications → select the extension → Uninstall) and reinstall. Sometimes the Azure fabric loses track of extension state and a clean reinstall is the only path forward. After removing, wait 5 minutes before reinstalling, this lets the fabric fully clean up the old state.
Advanced Troubleshooting
When the standard fixes don't resolve your Azure Windows VM troubleshooting issue, you need to go deeper. These are the techniques I pull out for the genuinely stubborn cases, the ones that survive a reboot, a redeploy, and an extension reinstall.
Serial Console Access is the most powerful tool you're probably not using. If your VM won't boot or RDP is completely unreachable, the Azure Serial Console gives you a text-based console directly to the VM's COM port, no networking required. Enable it under VM → Help → Serial console. At the SAC (Special Administration Console) prompt, type cmd to get a command channel, then ch -si 1 to open a CMD session. From there you can check and modify the registry, manage services, and repair boot files without needing RDP at all.
Registry-level fixes for RDP are sometimes necessary when the RDP listener itself is broken. Connect via Serial Console and run:
REG ADD "HKLM\SYSTEM\CurrentControlSet\Control\Terminal Server" /v fDenyTSConnections /t REG_DWORD /d 0 /f
REG ADD "HKLM\SYSTEM\CurrentControlSet\Control\Terminal Server\WinStations\RDP-Tcp" /v PortNumber /t REG_DWORD /d 3389 /f
netsh advfirewall firewall set rule group="remote desktop" new enable=Yes
Group Policy conflicts are a major source of pain on domain-joined Azure VMs. If your VM recently applied a new GPO and now something is broken, RDP blocked, a service disabled, a registry key locked, you need to identify which policy made the change. Run gpresult /h C:\gpreport.html and open the HTML report. It shows every GPO that applied, in order, with the specific settings each one enforced. Cross-reference this with what changed the day the problem started.
Event Viewer correlation across multiple logs is how you find the root cause when a symptom shows up in one log but the actual cause is in another. Set up a Custom View in Event Viewer covering System, Application, and Security logs filtered to the time window when the problem occurred. Export the filtered log (Action → Save Filtered Log File As) and open it in a text editor or Excel to look for event sequences, a disk error in System logs 30 seconds before a service crash in Application logs, for example.
Network-level diagnostics for VMs that can reach some endpoints but not others: use the Network Watcher service in Azure. Go to Network Watcher → IP flow verify and test specific source IP + destination IP + port combinations. This tells you exactly which NSG rule is allowing or blocking traffic, far faster than manually reading through NSG rule lists. Also run Connection troubleshoot from Network Watcher to test end-to-end connectivity from the VM to a target host, which will surface routing issues that NSG inspection alone won't catch.
Multiple certificates on IaaS VMs that use extensions can cause extension failures that look like networking problems. If your VM has several certificates in the machine store and extensions are failing with certificate validation errors, check C:\WindowsAzure\Logs\Plugins\ for "certificate not found" or "SSL validation failed" messages. The fix is typically re-pushing the correct certificates through the VM's certificate configuration in the portal under Operations → Configuration.
If you've tried redeploy, password reset, Boot Diagnostics repair, and Serial Console access and the VM is still non-functional, especially if it's a production system, it's time to escalate. Open a support ticket directly in the Azure portal under Help + support → New support request. Choose "Technical" as the issue type and "Virtual Machine running Windows" as the service. Include the VM's resource ID, the time the issue started, and any Event IDs you've found. For business-critical systems, choose Severity A (Critical business impact) to get a response within one hour. You can also reach Microsoft Support directly, though the portal route gets you Azure-specialized engineers faster. If you need developer-level help with Azure PowerShell or Azure CLI commands related to your VM, open a GitHub issue at https://github.com/Azure/azure-powershell/issues instead, the product teams monitor those actively.
Prevention & Best Practices
The best Azure Windows VM troubleshooting session is the one you never have to do. After years of working with Azure VMs, I've boiled down the most impactful preventive measures to a handful of practices that consistently prevent the most common failures.
Enable Boot Diagnostics before you need it. This sounds obvious, but I've seen countless situations where a VM needed Boot Diagnostics to recover and it wasn't enabled. Turn it on at VM creation time, it's under Management → Boot diagnostics in the VM creation wizard. Enable it with a managed storage account (the default option). It costs almost nothing and is invaluable when a VM won't start.
Use Azure Backup with VM-consistent snapshots. Configure backup in the Azure portal under your VM → Backup. Application-consistent backups for Windows VMs use VSS (Volume Shadow Copy Service) to ensure the backup is taken in a state the OS and applications can actually recover from, not just a raw disk snapshot that might have partial writes. Set a daily backup schedule with at least 7 days of retention for any VM running workloads you care about.
Monitor the Azure VM Agent health proactively. The VM Agent is the backbone of everything, extensions, diagnostics, password reset, all of it depends on the agent being healthy. Set up an Azure Monitor alert on the "VM Availability" metric for your critical VMs. If the agent stops reporting, you get notified before users start complaining about RDP failures.
Keep your NSG rules documented and reviewed quarterly. NSG rule sprawl is real. Teams add "allow all" rules for debugging and forget to remove them. Other teams block ports for security reasons without documenting the impact. Do a quarterly review of your NSG inbound and outbound rules. Use the Network Watcher → Security group view to get a consolidated view of all NSG rules affecting a VM, it aggregates subnet-level and NIC-level rules in one place.
Don't store data on the temporary disk. The temporary disk (D: drive by default on most Azure Windows VM sizes) is tied to the physical host node. When the VM is deallocated or redeployed, that disk is wiped. Move your page file there (Azure recommends it for performance), but keep all application data, logs, and database files on managed data disks that are explicitly attached to the VM and backed up.
Test extension deployments in staging first. Before pushing a new Custom Script Extension or any other VM extension to production VMs, test it on a staging VM with identical configuration. Extensions run with SYSTEM privileges and a poorly written script can corrupt system state in ways that are difficult to reverse without disk-level recovery.
- Enable Boot Diagnostics on every Azure Windows VM, do it today, before you need it
- Set an NSG rule explicitly allowing RDP only from your known IP ranges, not from
0.0.0.0/0 - Configure Azure Monitor alerts for VM availability so you know about outages before users do
- Store the BitLocker recovery key in Azure Key Vault at encryption time, you will need it someday
Frequently Asked Questions
My Azure Windows VM shows as "Running" in the portal but I can't RDP into it, what's happening?
This is one of the most common Azure Windows VM troubleshooting scenarios, and "Running" in the portal only means the VM's compute resources are allocated, it doesn't confirm that RDP is functional or that the OS is responsive. Start by checking your NSG rules in the VM's Networking blade to ensure TCP 3389 is allowed from your IP. Then check Boot Diagnostics screenshot to see if the OS is actually at a login screen or stuck on a blue screen. If the NSG looks fine and the screenshot shows a normal desktop, the problem is usually a firewall rule inside the guest OS blocking RDP, connect via Serial Console and run netsh advfirewall firewall set rule group="remote desktop" new enable=Yes to re-enable it.
How do I reset my Azure VM admin password if I'm completely locked out?
You don't need RDP access to reset the password, the Azure portal's Reset Password tool pushes changes directly through the VM Agent, bypassing the network entirely. Go to your VM in the portal → Help → Reset password, select "Reset password," enter a new username and strong password, and click Update. The change takes effect within 30–60 seconds. If the VM Agent itself is broken (which you can check under Properties → Agent status), this method won't work and you'll need to attach the OS disk to a repair VM and reset the SAM database offline, a more involved process documented in Microsoft's offline password reset guide.
Why does my Azure Windows VM keep rebooting on its own?
Unexpected reboots on Azure Windows VMs typically have one of three causes: Windows Update installed a patch requiring a restart and the auto-restart policy triggered it, the Azure platform performed planned maintenance on the underlying host (you get advance notice in the Azure portal under Service Health → Planned maintenance), or the Windows OS itself crashed with a BSOD and auto-restarted. To find out which it was, check Event Viewer → Windows Logs → System and filter for Event ID 1074 (user-initiated or update restart) and Event ID 41 (unexpected shutdown/crash). If you see Event ID 41 with Kernel-Power as the source, the VM crashed, check the minidump files in C:\Windows\Minidump\ for the crash analysis.
How do I fix the "Your credentials did not work" error when connecting to an Azure Windows VM via RDP?
This error almost always points to one of three things: an expired or incorrect password, an account lockout, or a credential type mismatch. First, try resetting the password via the Azure portal's Reset Password blade. If that doesn't help, check if the account is locked out by connecting via Serial Console and running net user <username>, if it shows "Account active: No" or "Account locked," run net user <username> /active:yes to re-enable it. For domain-joined VMs, the issue could be a stale Kerberos ticket or a time skew between the VM and domain controller, force a time sync with w32tm /resync /force from an elevated command prompt.
Can I get help with Azure Virtual Machine issues from the community without opening a paid support ticket?
Yes, Microsoft Q&A is the official community support destination for Azure VM questions. Post your question at learn.microsoft.com/answers using the tag azure-virtual-machines and you'll get responses from Microsoft engineers, Azure MVPs, and experienced community members. For issues specifically related to Azure tooling like Azure PowerShell, Azure CLI, or the Azure SDKs, the GitHub issue trackers are better: github.com/Azure/azure-powershell/issues for PowerShell and github.com/Azure/azure-cli/issues for CLI. These are actively monitored by product teams. Community support is best for non-urgent issues, if you're down in production, open a paid support request in the portal for guaranteed response times.
What's the difference between stopping and deallocating an Azure Windows VM, and why does it matter for troubleshooting?
This distinction trips up a lot of people and it directly affects several troubleshooting steps. "Stopping" a VM from inside the OS (Start → Shut down) puts it in a Stopped state, Azure keeps the compute resources allocated, you continue to pay for the VM SKU, and the VM stays on the same host node. "Deallocating" via the portal's Stop button (or via az vm deallocate) releases the compute resources, you stop paying for compute, but the VM may come back on a different host node when you restart it. Deallocation is required for disk resizing, changing the VM SKU, and certain networking changes. It also resets the temporary D: drive. For RDP or boot troubleshooting, a full Stop (deallocate) + Start cycle is much more effective than just restarting the OS, because it moves the VM to a fresh host node, equivalent to the Redeploy feature.