How to Troubleshoot Azure Virtual Machines Linux
Why This Is Happening
You SSH into your Azure Linux VM and get nothing. Connection refused. Timeout. Or maybe the VM itself shows as "Running" in the portal but you can't reach it. I've seen this scenario play out on hundreds of enterprise deployments, and the maddening part is that Azure's error messages rarely tell you what actually broke.
Linux VM issues on Azure tend to cluster around a few core culprits. The first and most common is SSH connectivity failure , either your Network Security Group (NSG) is blocking port 22, the sshd daemon crashed inside the VM, or a misconfigured /etc/ssh/sshd_config is rejecting your key. The second major category is boot failures: the VM won't start at all, usually because of a corrupt GRUB configuration, a failed /etc/fstab mount entry, or a kernel panic during the last update cycle. Third comes storage , the root disk filling to 100% is far more common than people expect, and Linux VMs on Azure with small OS disks (the default 30 GB in many marketplace images) hit this faster than you'd think.
Then there's the Azure Linux Agent (waagent). This agent is responsible for provisioning, extension execution, and communication with the Azure fabric. When it stops responding or gets corrupted, you lose the ability to run Custom Script Extensions, reset passwords through the portal, or get accurate health data. Azure marks the VM as "unresponsive" even though Linux itself is running fine underneath.
Network-level problems round out the picture. Azure VNets, subnets, NSG rules, and even peered network configurations can all silently block traffic. A firewall rule added by a well-meaning teammate, an NSG applied at the subnet level rather than the NIC, or an asymmetric routing issue in a hub-and-spoke topology, any of these will make your VM unreachable without a single error message explaining why.
I know this is frustrating, especially when this VM is holding up a production workload or a CI/CD pipeline. The good news is that Azure gives you tools like Boot Diagnostics, Serial Console, and the Run Command feature that can reach a broken VM even when SSH is completely dead. We're going to use all of them. Browse all Microsoft fix guides →
The Quick Fix, Try This First
Before you go deep into diagnostics, run through this 90-second checklist in the Azure portal. It resolves roughly 40% of Azure Linux VM connectivity issues without touching the VM itself.
Open the Azure Portal at portal.azure.com, navigate to Virtual Machines, and click your VM. On the left sidebar, go to Help > Boot diagnostics. If you see a kernel panic message or a GRUB prompt frozen on screen, that's your answer, skip to Step 3 (Boot Diagnostics). If the screenshot shows a normal login prompt, the OS is alive and your problem is network or SSH.
Next, click Networking in the left sidebar. Look at the inbound port rules table. Find the rule allowing port 22 (SSH). If it doesn't exist, that's your problem, click Add inbound port rule, set Destination port ranges to 22, protocol TCP, priority something low like 300, and hit Add. Wait 30 seconds and try SSH again.
If port 22 is allowed but SSH still fails, go to Help > Reset password in the left sidebar. Choose Reset SSH public key or Reset password depending on your auth method. This operation uses the Azure VM Agent to rewrite /etc/ssh/sshd_config and restart the sshd service from outside the VM. It takes about 60–90 seconds and fixes a surprising number of "SSH just stopped working" situations.
Finally, check your VM's Activity Log (left sidebar > Activity log). Filter the last 24 hours. Look for any operations that say Deallocate virtual machine, Resize virtual machine, or failed extension deployments. These operations can change your VM's public IP address or corrupt the agent state.
The Azure portal's VM overview page shows you a Status field. "Running" means the hypervisor thinks the VM is on, it doesn't mean Linux booted successfully or that the VM Agent is responsive. Pay attention to two additional fields: Agent status and VM health.
If Agent status shows Not Ready, the Azure Linux Agent (waagent) inside the VM is either crashed, misconfigured, or the VM itself never finished booting. If VM health shows Degraded or you see health alerts in the Insights blade, there's an active platform issue Azure is already aware of.
Check the Service Health dashboard to rule out an Azure regional outage:
https://status.azure.com/en-us/status
Also run this Azure CLI command to pull detailed instance view information, this gives you far more than the portal UI:
az vm get-instance-view \
--resource-group <your-rg> \
--name <your-vm-name> \
--query "instanceView.statuses[*].{Code:code, Display:displayStatus}" \
--output table
You're looking for two codes: PowerState/running confirms the VM is on. ProvisioningState/succeeded means the last provisioning operation completed cleanly. If you see ProvisioningState/failed, scroll back through the Activity Log, left sidebar > Activity log, to find the specific error. Failed provisioning states often block subsequent operations like password reset or extension execution.
If everything looks healthy here but you still can't SSH in, move on to Step 2. The problem is almost certainly at the network layer.
SSH failures on Azure Linux VMs almost always trace back to one of three places: the NSG, the VM's host-based firewall (firewalld or ufw), or the sshd configuration itself. Let's go through them in order.
Check NSG rules: In the portal, go to your VM > Networking. Click Effective security rules, this shows you the merged view of all NSG rules applied at both the NIC and subnet level. Look for any rule with priority lower than your SSH allow rule that has Action: Deny and covers port 22. NSG rules are evaluated lowest-priority-number first; a Deny at priority 100 beats an Allow at priority 300.
Use the IP flow verify tool: In the portal, go to your VM > Help > Connection troubleshoot. Enter your source IP (your workstation's public IP, check it at whatismyip.com), destination port 22, protocol TCP. Click Check. Azure will tell you exactly which NSG rule is allowing or blocking the connection. This is the fastest way to identify a subnet-level NSG blocking you.
If NSG is clear, use Run Command to check sshd: In the portal, go to Operations > Run command, select RunShellScript, and run:
systemctl status sshd
journalctl -u sshd --since "1 hour ago" --no-pager | tail -30
Look for lines like error: Could not load host key or Permission denied (publickey,gssapi-keyex). The first means SSH host keys are missing or corrupt, fix with ssh-keygen -A. The second means your public key isn't in the right place or has wrong permissions. The ~/.ssh/authorized_keys file must be mode 600 and the ~/.ssh directory must be mode 700.
If sshd is fully dead, restart it via Run Command: systemctl restart sshd. If it refuses to start, check the config: sshd -t will validate /etc/ssh/sshd_config and print the exact line causing the problem.
When a VM won't boot at all, and SSH is obviously impossible, Azure's Serial Console is your lifeline. It gives you a direct console connection to the VM's serial port, completely bypassing the network. I cannot overstate how useful this is.
First, enable Boot Diagnostics if it's not already on: VM > Settings > Boot diagnostics > Enable with managed storage account > Save. Then access Serial Console: VM > Help > Serial console. You'll get a terminal window directly in your browser.
If you land at a login: prompt, type your username and password (or use the root account if password auth is enabled). If you're stuck at a GRUB menu, that's actually a good sign, it means the bootloader is alive. Use the arrow keys to select your kernel entry and press e to edit. Find the line starting with linux and append systemd.unit=rescue.target to the end. Press Ctrl+X to boot into rescue mode.
Common Serial Console findings and fixes:
If you see FAILED to mount /<mountpoint>, a bad /etc/fstab entry is preventing boot. In rescue mode, remount root as read-write and edit fstab:
mount -o remount,rw /
nano /etc/fstab
Comment out the offending line (add # at the start) and reboot.
If you see a kernel panic with VFS: Unable to mount root fs, the OS disk may have filesystem corruption. From Serial Console, run:
fsck -y /dev/sda1
Replace sda1 with your actual root partition, check with lsblk first. After fsck completes, reboot with systemctl reboot.
If the screen is completely black with no output, Boot Diagnostics screenshot will show you what's happening at the kernel level. Download the screenshot from VM > Help > Boot diagnostics > Screenshot.
A 100%-full disk on an Azure Linux VM is one of the nastiest situations to recover from remotely, because every command that writes to disk, including creating log files, will fail silently or with cryptic errors. SSH sessions may appear to connect but immediately drop. The sudo command may fail to write to /var/log/auth.log. Things get weird fast.
First, confirm disk usage via Run Command (this works even when SSH is broken):
df -hT
du -sh /var/log/* 2>/dev/null | sort -rh | head -20
du -sh /tmp/* 2>/dev/null | sort -rh | head -10
The most common space hogs on Azure Linux VMs are: journal logs in /var/log/journal, old kernel packages, /tmp files, Docker images and containers (if Docker is installed), and core dump files in /var/crash.
Quick space recovery via Run Command:
# Truncate largest log files safely
journalctl --vacuum-size=100M
# Clean package cache (Ubuntu/Debian)
apt-get clean && apt-get autoremove -y
# Clean package cache (RHEL/CentOS/Rocky)
dnf clean all
# Remove old kernels (Ubuntu)
apt-get autoremove --purge
# Find and remove core dumps
find /var/crash -name "*.crash" -delete
find /var/crash -name "core.*" -delete
If you need a long-term fix, expand the OS disk without downtime. In the portal, go to your VM > Settings > Disks. Click the OS disk, then Size + performance. Increase the disk size (you can only increase, never decrease). Save, then back in your VM use Run Command to expand the partition:
# For LVM-based systems (most Azure Marketplace images)
growpart /dev/sda 1
pvresize /dev/sda1
lvextend -l +100%FREE /dev/ubuntu-vg/ubuntu-lv
resize2fs /dev/ubuntu-vg/ubuntu-lv
For non-LVM systems, growpart followed by resize2fs on the raw partition is usually sufficient. Confirm with df -hT afterward.
The Azure Linux Agent (waagent) is the bridge between your VM and Azure's management plane. When it breaks, you lose password reset from the portal, Custom Script Extensions fail silently, and VM health monitoring goes dark. The agent runs as a system service and communicates with Azure's IMDS (Instance Metadata Service) endpoint at 169.254.169.254.
Check agent status via Serial Console or Run Command:
systemctl status waagent
waagent --version
journalctl -u waagent --since "24 hours ago" --no-pager | grep -E "ERROR|WARN|CRITICAL"
If the agent is stopped, try restarting it first:
systemctl restart waagent
systemctl enable waagent
If waagent fails to start or throws errors about being unable to reach 169.254.169.254, check that the IMDS endpoint is reachable from inside the VM:
curl -s -H "Metadata: true" \
"http://169.254.169.254/metadata/instance?api-version=2021-02-01" | python3 -m json.tool
If that curl fails, you have a host-based firewall rule blocking the IMDS endpoint. Check iptables or firewalld:
# Check iptables rules
iptables -L -n | grep 169.254
# If using firewalld
firewall-cmd --list-all
If waagent is completely corrupted, reinstall it. For Ubuntu/Debian:
apt-get install --reinstall walinuxagent
systemctl restart waagent
For RHEL/CentOS/Rocky Linux:
dnf reinstall WALinuxAgent
systemctl restart waagent
After reinstalling, verify the agent is reporting back to Azure: in the portal, refresh your VM overview page and check that Agent status now shows Ready. This can take up to 5 minutes for the heartbeat to register.
Advanced Troubleshooting
When the standard fixes don't cut it, these techniques go deeper. Most of them require either Azure CLI access or the ability to attach and detach managed disks.
Repair VM using az vm repair (AZ CLI extension): Microsoft's VM repair extension is genuinely excellent and underused. It automatically creates a repair VM, attaches your broken OS disk as a data disk, and lets you fix it from a working environment. Run this from the Azure Cloud Shell or your local CLI:
az extension add --name vm-repair
az vm repair create \
--resource-group <your-rg> \
--name <broken-vm-name> \
--repair-username repairadmin \
--repair-password "<SecurePassword123!>" \
--verbose
This creates a repair VM in a new resource group. SSH into that repair VM and you'll find your broken disk mounted at /dev/sdc (usually). Mount it, fix the problem, bad fstab, missing SSH keys, whatever, then run:
az vm repair restore \
--resource-group <your-rg> \
--name <broken-vm-name> \
--verbose
Analyze Event Viewer equivalent, systemd journal: On Linux VMs, the equivalent of Windows Event Viewer is journalctl. Pull it via Run Command for the last boot cycle:
journalctl -b -1 --no-pager | tail -100
The -b -1 flag pulls logs from the previous boot, extremely useful when the VM rebooted unexpectedly and you want to know what happened right before it went down.
Domain-joined or AAD-joined VMs: If your Linux VM is joined to Azure Active Directory (using the AAD SSH Login extension), authentication issues often stem from the aad_auth PAM module or an expired conditional access policy. Check extension status in the portal: VM > Extensions + applications. If AADSSHLoginForLinux shows as Failed, try removing and re-adding it. Also verify the user has the Virtual Machine Administrator Login or Virtual Machine User Login RBAC role assigned at the VM resource level, not just the subscription.
Accelerated Networking conflicts: If your VM size supports Accelerated Networking and it's enabled, certain older kernel versions can cause the Mellanox VF (virtual function) driver to fail on boot. You'll see the VM as running but completely network-unreachable. Serial Console will show mlx5_core errors. The fix is to disable Accelerated Networking temporarily: VM > Networking > Network interface > Enable accelerated networking > Off. Update the kernel, then re-enable it.
Escalate to Microsoft Support when: (1) your VM shows Unhealthy in Azure Monitor and you're seeing host-level events like Host Node Fault or Unexpected node failure in the Activity Log, this is a platform-side hardware issue and nothing you do inside the VM will fix it; (2) the Azure VM repair extension fails repeatedly with internal errors; (3) your VM is on an Isolated SKU (like Standard_D96as_v4) and is stuck in a Stopping state for more than 30 minutes, these require Microsoft to manually intervene at the host level; or (4) you're seeing data corruption on managed disks that fsck cannot repair, which may indicate a storage platform problem.
Prevention & Best Practices
Most Azure Linux VM disasters are preventable. I've watched teams spend entire nights recovering from issues that a handful of proactive configurations would have prevented entirely. Here's what actually matters.
Always use static private IPs for production VMs. Dynamic private IPs assigned by DHCP within the VNet can technically change when a VM is deallocated and reallocated. While Azure tries to preserve them, it's not guaranteed. Go to VM > Networking > Network interface > IP configurations and switch the private IP to Static. Do the same for the public IP if you're using one.
Configure Azure Monitor and set up alerts before you have problems. Go to VM > Monitoring > Insights and enable VM Insights. Then set up an alert rule on the metric Percentage CPU > 90% and another on Available Memory Bytes < 500 MB. Add a Disk Read Bytes/sec alert too. These take 15 minutes to configure and will save you hours of reactive firefighting.
Keep the Azure Linux Agent updated. Add a cron job or use Azure Update Manager to regularly update walinuxagent. An outdated agent breaks extension compatibility and causes the "Agent not ready" state that blocks portal-based recovery operations.
Test your SSH access path at least monthly. Put a calendar reminder to SSH in from a fresh terminal window and verify key-based authentication works. If you're relying on a bastion host or VPN, test that whole path, not just the final hop. Many teams discover SSH is broken only when there's an emergency.
Use Azure Backup with a daily snapshot policy. Go to your VM > Operations > Backup. A standard policy with daily backups and 30-day retention costs almost nothing for typical VM sizes and gives you a clean recovery point for catastrophic disk corruption or accidental data deletion.
- Set private IP addresses to Static on all production VMs, it takes 2 minutes and eliminates an entire category of "VM unreachable after restart" incidents
- Enable Boot Diagnostics on every VM at creation time, you cannot enable it retroactively without a restart, and you'll want it the moment something goes wrong
- Add a dedicated data disk for application logs and data, never let application workloads compete with the OS disk for space; keep your 30 GB OS disk for the OS only
- Set up a Just-in-Time (JIT) access policy under VM > Configuration > Just-in-time VM access to limit SSH exposure, JIT opens port 22 only when you request it and auto-closes it after a configurable timeout
Frequently Asked Questions
My Azure Linux VM says "Running" but I can't SSH into it, what do I check first?
Start with the NSG rules under VM > Networking > Effective security rules. A deny rule at a lower priority number will override your SSH allow rule even if both exist. Then use the Connection troubleshoot tool in the portal (VM > Help > Connection troubleshoot) to do an IP flow verify from your public IP to port 22, Azure will tell you exactly which rule is blocking or allowing. If NSG is clear, use Run Command (VM > Operations > Run command) to check whether sshd is actually running: systemctl status sshd. A crashed sshd looks identical to an NSG block from the outside.
How do I fix a Linux VM stuck in a boot loop after a kernel update?
Open the Azure portal Serial Console (VM > Help > Serial console). You should see a GRUB menu, if the timeout is too fast, press any key quickly. Select your previous kernel version (it'll say something like 5.15.0-91-generic instead of the latest one) and boot from that. Once you're in, hold the broken kernel from auto-booting by editing /etc/default/grub and setting GRUB_DEFAULT to the working kernel entry, then run update-grub. If you can't catch the GRUB menu, use the az vm repair CLI extension to mount the OS disk externally and edit GRUB configuration from a working repair VM.
The Azure portal's "Reset Password" option isn't working on my Linux VM, it just spins and fails
Password/SSH key reset through the portal requires the Azure Linux Agent (waagent) to be running and healthy inside the VM. If the agent is down, crashed, or reporting Not Ready, portal-based resets will always fail. Connect via Serial Console (VM > Help > Serial console) to get a direct console connection and manually edit /etc/ssh/sshd_config or use passwd <username> to reset credentials. After fixing whatever caused the agent to fail, reinstall it with apt-get install --reinstall walinuxagent (Ubuntu) or dnf reinstall WALinuxAgent (RHEL/Rocky), then restart the waagent service.
My /etc/fstab has a bad entry and the VM won't boot. How do I fix it without physical access?
Use the Serial Console. Go to VM > Help > Serial console and press Enter to get a login prompt. Log in as root or your sudo user. If the system is stuck in emergency mode due to the fstab failure, it may drop you to a root shell automatically, look for a prompt asking for the root password to proceed with maintenance. Once you have a shell, remount the root filesystem as read-write with mount -o remount,rw /, then edit the fstab with nano /etc/fstab. Comment out the bad line (add # at the start). Run mount -a to verify no errors remain, then systemctl reboot. The offending line is almost always a network share or external mount point that doesn't exist in Azure's environment.
How do I troubleshoot an Azure Linux VM that has no internet access even though the NSG looks fine?
NSG rules only cover network ingress and egress at the Azure fabric layer. If your VNet uses a custom DNS server, routing table (UDR), or a forced-tunneling configuration that routes all outbound traffic through an NVA (Network Virtual Appliance) or Azure Firewall, your VM's outbound internet traffic may be getting dropped at a different layer. From inside the VM (via Run Command or Serial Console), run curl -v https://www.microsoft.com and check whether the TCP handshake completes. Then run ip route show to inspect the routing table. If you see a 0.0.0.0/0 route pointing to an internal hop (like 10.x.x.x) rather than the VNet gateway, forced tunneling is active and your NVA or Azure Firewall is blocking the traffic. Check the Azure Firewall or NVA logs for denied traffic matching your VM's private IP.
How do I recover an Azure Linux VM where the root disk is 100% full and I can't run any commands?
Even on a 100%-full disk, Run Command in the Azure portal can execute scripts because it bypasses your SSH session. Go to VM > Operations > Run command > RunShellScript. Run journalctl --vacuum-size=50M first, this is usually the fastest way to reclaim significant space with zero risk of data loss. Then run apt-get clean or dnf clean all to purge package caches. If that's not enough, identify large files with du -sh /var/log/* /tmp/* /var/crash/* 2>/dev/null | sort -rh | head -20 and delete safely. For a permanent fix, expand the OS disk through the portal (VM > Disks > select OS disk > Size + performance) and then use growpart and resize2fs to extend the filesystem.