How to Fix Azure Load Balancer

Microsoft Fix Intermediate 14 min read Official Docs Grounded Updated April 20, 2026

What's in This Guide

Why This Happens
The Quick Fix
Step-by-Step Solution
Advanced Troubleshooting
Prevention & Best Practices
FAQ

Why This Is Happening

I've spent years staring at Azure dashboards at 2 AM watching traffic mysteriously stop reaching backend VMs , and I can tell you that Azure Load Balancer troubleshooting is one of those problems that feels deceptively simple until it absolutely isn't. Your VMs are running, your app is deployed, your firewall rules look fine. And yet: nothing gets through. Users report timeouts. Your monitoring goes red. Sound familiar?

Here's what's actually going on under the hood. Azure Load Balancer , especially the Standard SKU, which is what you should be running, operates on a five-tuple hash: source IP, source port, destination IP, destination port, and protocol. When something in that chain breaks, the load balancer doesn't just slow down. It silently drops packets. No error. No alert by default. Just... silence.

The three most common culprits I see again and again are:

Health probe failures, The load balancer's health probe is marking your backend VMs as unhealthy, so it stops sending traffic to them entirely. The VMs are up, your app is running, but the probe itself can't reach the configured port. Traffic gets blackholed.
Network Security Group (NSG) blocking, An NSG attached to your VM NIC or subnet is silently dropping the probe traffic or the data-plane traffic. This is the single most common mistake I see from teams migrating from Basic to Standard SKU.
SKU mismatch between the Load Balancer and Public IP address, If your Load Balancer is Standard SKU but your public IP is Basic (or vice versa), Azure will reject the configuration outright with a cryptic validation error, and no traffic flows at all.

There are also subtler problems: SNAT port exhaustion killing outbound connections, misconfigured backend pool membership (the VM is there but its NIC isn't in the pool), inbound NAT rules conflicting with load balancing rules, and session persistence settings behaving unexpectedly with stateful workloads.

I know this is frustrating, especially when it's blocking a production deployment or a customer-facing workload. The good news is that almost every Azure Load Balancer connectivity issue falls into a known category, and this guide walks through each one systematically. No guesswork. Browse all Microsoft fix guides →

The Quick Fix, Try This First

Before diving deep, run this fast triage sequence. In my experience, these four checks resolve about 65% of Azure Load Balancer troubleshooting cases without needing to go further.

Step 1: Check the health probe status right now. In the Azure Portal, go to your Load Balancer resource → Monitoring → Insights → Backend Health. If any VM shows "Unhealthy," that's your smoking gun. The load balancer has already removed it from rotation. Traffic won't reach it, period.

Step 2: Verify your NSG allows probe traffic. Azure's health probe source IP is 168.63.129.16. That's a special Azure platform IP, not a public IP you recognize. If your NSG doesn't explicitly allow inbound traffic from 168.63.129.16 to your probe port, the probe fails, and your backend looks unhealthy even when the VM is perfectly fine.

Step 3: Confirm the Public IP SKU matches. In the Portal, open your Load Balancer → Overview. Note whether it says "Standard" or "Basic" SKU. Then go to your Frontend IP configuration and check the associated Public IP address, its SKU must match exactly. Mismatches fail silently in some configurations.

Step 4: Check that your backend VMs are actually "Running" and passing their OS-level checks. Go to Virtual Machines in the Portal, find each backend pool member, and confirm the status is "Running", not "Stopped (deallocated)" or "Stopping." A stopped VM still appears in the backend pool membership list but receives no traffic.

If those four checks don't immediately surface the problem, keep reading, the step-by-step section covers every layer.

Pro Tip

When a backend VM starts failing health probes, the Standard Load Balancer removes it from rotation but doesn't log this as an error in the activity log by default. Set up a metric alert on the Health Probe Status metric (split by BackendIPAddress) right now, before the next outage, so you get paged the moment a probe flips to 0. This turns a mystery into a five-second diagnosis.

Diagnose Health Probe Failures in the Azure Portal

The health probe is the load balancer's only way of knowing whether a backend VM can actually serve traffic. If the probe fails, traffic stops. Full stop. This is where I start every Azure Load Balancer troubleshooting session, and it catches the problem about half the time.

Navigate to your Load Balancer in the Azure Portal. In the left-hand menu, go to Settings → Health probes. Note the probe protocol (HTTP, HTTPS, or TCP), the port number, and the path (for HTTP/HTTPS probes). These must exactly match what your application is actually listening on inside the VM.

Common mistakes I see constantly:

The probe is set to HTTP port 80, but the app only listens on HTTPS port 443
The probe path is /health but the endpoint returns HTTP 404 (the probe requires 200 OK)
The probe port is correct, but Windows Firewall on the guest OS is blocking it

To confirm what your VM is actually listening on, open a Bastion session or connect via serial console, then run:

# On Windows VMs:
netstat -ano | findstr LISTENING

# On Linux VMs:
ss -tlnp

Compare the output to your probe port. If the port isn't in that list, your application isn't listening, and the probe will always fail, no matter what you do to the load balancer config.

Also check Windows Firewall specifically. Even if your NSG allows the traffic, the guest-level firewall can block it independently. Run this in an elevated PowerShell session on the backend VM:

Get-NetFirewallRule | Where-Object {$_.Enabled -eq 'True' -and $_.Direction -eq 'Inbound'} | Select-Object DisplayName, LocalPort, Action

If you see your probe port listed with Action = "Block," that's the culprit. You'll see the health status flip to Unhealthy in the portal within two to three probe intervals (default interval is 15 seconds) after fixing it.

Allow Azure Probe Traffic Through NSG Rules

This is the number-one cause of Azure Load Balancer backend pool not responding, especially for teams upgrading from Basic to Standard SKU. With Basic SKU, NSGs are optional. With Standard SKU, all VMs in the backend pool are locked down by default, no inbound traffic unless NSGs explicitly permit it. That's a security feature, not a bug. But it catches teams off guard every single time.

Azure's health probes originate from a single platform IP: 168.63.129.16. This is a virtual IP used by the Azure fabric for management traffic. It doesn't show up in your routing tables as a recognizable address, which is why teams deny it without realizing what they're doing.

To fix this, go to your NSG in the Portal (check both the NIC-level NSG and the subnet-level NSG, both can block the probe). Under Inbound security rules, add:

Source: Service Tag → AzureLoadBalancer
Source port ranges: *
Destination: Any
Destination port ranges: Your health probe port (e.g., 80 or 443)
Protocol: TCP
Action: Allow
Priority: Set this lower than any Deny rules you have (lower number = higher priority)

The AzureLoadBalancer service tag automatically includes 168.63.129.16, so you don't need to hardcode the IP, and Microsoft can update the underlying range without breaking your rule.

You also need to allow the actual data-plane traffic, the traffic from end users reaching your VMs through the load balancer. That traffic arrives from the client's original source IP (or from the load balancer's frontend IP for outbound SNAT traffic), so add an inbound rule allowing your application port from the appropriate source.

After saving the rule, give it 30–60 seconds to propagate, then recheck Backend Health in the portal. Healthy VMs show a green checkmark.

Verify Backend Pool Configuration and VM NIC Association

The load balancer only sends traffic to VMs whose network interfaces are members of the backend pool, not just VMs that exist in the same VNet. I've seen this trip up entire teams: someone adds a new VM to the resource group, assumes it's in the pool, and wonders why traffic never reaches it.

Go to your Load Balancer → Settings → Backend pools. Click on your backend pool. You'll see a list of IP configurations. Each entry corresponds to a specific NIC IP configuration on a VM, not just the VM itself. If a VM has multiple NICs, only the NIC explicitly listed here receives load-balanced traffic.

To add a VM that's missing, click + Add, select the VM, and choose the correct NIC and IP configuration. Save and wait for the update to propagate (usually under 60 seconds for Standard SKU).

For Virtual Machine Scale Sets (VMSS), the membership is managed differently. Go to your VMSS → Networking → Load balancing and confirm the correct load balancer and backend pool are associated there. Individual VMSS instance NICs are automatically registered when the VMSS is properly associated, you don't add them one by one.

Also confirm IP forwarding settings. If you're running a network virtual appliance (NVA) or custom routing in your backend, IP forwarding must be enabled on the NIC:

# Check IP forwarding state via Azure CLI
az network nic show \
  --resource-group MyResourceGroup \
  --name MyVMNic \
  --query "enableIpForwarding"

If that returns false and your architecture requires forwarding, set it to true. After all changes, re-verify backend health status and try sending test traffic from outside the VNet.

Audit Load Balancing Rules and Frontend IP Configuration

Even with healthy backends and correct NSGs, load balancing rules that are misconfigured mean traffic never makes it from the frontend to the backend. Azure Load Balancer inbound NAT rules problems and load balancing rule conflicts are the third most common issue I troubleshoot.

Go to Settings → Load balancing rules. For each rule, confirm:

Frontend IP configuration, Is the correct public or private IP selected? If you have multiple frontends, it's easy to configure a rule against the wrong one.
Frontend port and backend port, They don't have to match, but if they differ, make sure your application is listening on the backend port specified, not the frontend port.
Backend pool, Points to the correct pool containing your VMs.
Health probe, References the probe configured in Step 1.
Session persistence, Default is "None" (five-tuple hash). If you need sticky sessions, change it to "Client IP" (two-tuple). Mismatched persistence settings cause stateful applications like shopping carts or API sessions to break mid-session.
Floating IP (Direct Server Return), Unless you're running SQL Server Always On or a specific NVA scenario, keep this disabled. Enabling it changes how the VM receives traffic and breaks applications that aren't designed for DSR.

Also check your Inbound NAT rules for conflicts. If you have an NAT rule forwarding port 443 to a specific VM, and a load balancing rule also uses port 443 on the same frontend, they can conflict. NAT rules take priority over load balancing rules for matching traffic.

Run a quick validation using the Azure CLI to list all rules in one shot:

az network lb rule list \
  --resource-group MyResourceGroup \
  --lb-name MyLoadBalancer \
  --output table

Review the output for duplicate frontend ports or misconfigured backend ports. Fix any issues in the Portal and retest connectivity.

Diagnose and Resolve SNAT Port Exhaustion

This one is sneaky. Azure Load Balancer SNAT port exhaustion kills outbound connections from your backend VMs, not inbound ones, so your service might appear healthy from the outside while your VMs can't reach external APIs, databases, or update servers. Users report intermittent failures. Logs show connection timeouts. And nothing in the health probe shows anything wrong, because the probe is inbound.

SNAT (Source Network Address Translation) is how Azure translates your VM's private IP to a public IP for outbound traffic. With a Standard Load Balancer and no explicit outbound rules, each backend VM gets a limited number of ephemeral SNAT ports, and under heavy load, or if connections aren't being properly closed, those ports get exhausted.

To check if you're hitting SNAT exhaustion, go to your Load Balancer → Monitoring → Metrics. Add the SNAT Connection Count metric and split it by Connection State. If you see the "Failed" state spiking while "Allocated" is near its ceiling, you've confirmed SNAT exhaustion.

The fix: create an explicit Outbound Rule with a higher port allocation. Go to Settings → Outbound rules → + Add:

# Alternatively via CLI:
az network lb outbound-rule create \
  --resource-group MyResourceGroup \
  --lb-name MyLoadBalancer \
  --name MyOutboundRule \
  --frontend-ip-configs myFrontendIP \
  --protocol All \
  --outbound-ports 10000 \
  --backend-pool-name MyBackendPool

Setting --outbound-ports to 10,000 gives each VM significantly more SNAT ports. For high-throughput workloads, also consider adding additional public IP addresses to your frontend, each IP contributes 64,512 additional SNAT ports available for distribution. After applying, monitor the SNAT Failed metric for 10–15 minutes to confirm it drops to zero.

Advanced Troubleshooting

If you've worked through all five steps and traffic is still broken, you're in deeper territory. Here's how I approach Azure Load Balancer troubleshooting at the enterprise and infrastructure level.

Use Azure Monitor and Diagnostic Logs

Enable diagnostic settings on your Load Balancer. Go to Monitoring → Diagnostic settings → + Add diagnostic setting. Send all metrics and logs to a Log Analytics workspace. Then query the health probe log:

AzureDiagnostics
| where ResourceType == "LOADBALANCERS"
| where Category == "LoadBalancerProbeHealthStatus"
| where TimeGenerated > ago(1h)
| project TimeGenerated, BackendIPAddress, BackendPort, ProbeHealthStatus
| order by TimeGenerated desc

This gives you a time-series history of exactly when each backend VM started failing probes. You can correlate this with deployment events, VM restarts, or application crashes.

Azure Network Watcher, IP Flow Verify and Connection Troubleshoot

Azure Network Watcher is your best diagnostic tool for Azure Standard Load Balancer connectivity issues. Go to Network Watcher in the Portal → Diagnose & solve problems → Connection Troubleshoot. Set the source as your client VM, destination as your load balancer frontend IP, and the target port. Run the check. Network Watcher will trace the exact hop-by-hop path and tell you which NSG rule, UDR, or Azure policy is dropping the packet.

For more granular analysis, use IP Flow Verify under Network Watcher. This tool answers a binary question: "Would a packet with these exact source/destination parameters be allowed or denied by the NSG?" It saves enormous time compared to manually reading through NSG rules.

Domain-Joined and Enterprise Scenarios

If your backend VMs are domain-joined and you're seeing authentication-related failures through the load balancer, check Kerberos SPN configuration. Kerberos authentication is IP-sensitive, when traffic arrives at a VM with a different source IP than expected (due to the load balancer's SNAT behavior), Kerberos ticket validation can fail. The fix is to register SPNs for both the load balancer's frontend IP and each backend VM's IP.

For environments with Azure Policy or Microsoft Defender for Cloud, policies that require JIT (Just-in-Time) VM access or deny certain inbound ports can interfere with health probes. Check the Policy compliance blade for your VMs and look for "deny" effects on network-related policies.

SKU Downgrade and Migration Issues

You cannot mix Basic and Standard SKU resources in the same load balancer configuration. If you're migrating from Basic to Standard, the migration requires rebuilding the load balancer, you can't simply change the SKU in place. Use the az network lb migrate command (preview as of early 2026) or manually recreate the Standard LB and update references. During migration, plan for a maintenance window, there will be downtime.

Check for Conflicting User-Defined Routes (UDRs)

If you have UDRs in your subnet route table that redirect traffic to an NVA or VPN gateway, health probe traffic from 168.63.129.16 might be getting redirected away from your VMs. Azure probe traffic should never be sent through an NVA, add a specific route for 168.63.129.16/32 with next hop type "Internet" or "VirtualAppliance" pointing directly to the correct destination, depending on your topology.

When to Call Microsoft Support

If you've confirmed healthy probes, correct NSGs, valid backend pool membership, and no SNAT exhaustion, and traffic still doesn't flow, you may be hitting a platform-level issue: fabric routing anomalies, zone-redundancy failover bugs, or a known service incident. Check the Azure Status page first. If no incident is listed, open a support case with Microsoft. Provide your Load Balancer resource ID, the approximate time the issue started, affected backend VM NIC IDs, and any Network Watcher trace results you've already collected. This accelerates the triage significantly. For critical production outages, open a Severity A case at Microsoft Support, don't start at Severity B and wait for it to escalate.

Prevention & Best Practices

Once you've fixed a load balancer issue, the last thing you want is to go through this again at 2 AM. Here's what I put in place for every production Azure environment I manage.

Set up metric alerts before you need them. The most valuable alert you can add is on the Health Probe Status metric. Filter by your load balancer, split by BackendIPAddress, and alert when the value drops below 1 (meaning a backend VM is unhealthy). Set it to page your on-call team within 5 minutes. This turns a "we've had no traffic for 45 minutes" situation into a "one VM dropped out 3 minutes ago" situation.

Document your probe port and path explicitly. Create an Azure Tag on your load balancer resource with key ProbePort and value set to the actual probe port. It sounds trivial, but six months after deployment, nobody remembers whether the probe is on 80 or 8080, and that ambiguity costs hours during an incident.

Use Standard SKU for all production workloads. Basic SKU Load Balancers are being retired. Standard SKU provides zone redundancy, better diagnostics, outbound rules, and higher SLA. If you're still on Basic, plan the migration now, not during a future incident.

Implement Connection Draining. Under Load balancing rules, enable "TCP reset on idle" and configure "Session persistence" appropriately. Also set an idle timeout value that matches your application's expected connection lifetime. For most web workloads, 4 minutes is too short, 15 to 30 minutes prevents connections from being torn down mid-request.

Regularly audit backend pool membership. When VMs are decommissioned or replaced, their NIC IP configurations sometimes remain in backend pool definitions as stale entries. Set a monthly calendar reminder to audit pool membership, it takes five minutes and prevents confusion during incidents when you're counting healthy backends.

Use Azure Load Balancer Insights as your dashboard. The built-in Insights view (under Monitoring in the Load Balancer blade) shows a pre-built dashboard with data path availability, health probe status, and SNAT connection counts in one view. Pin it to your Azure dashboard for always-on visibility.

Quick Wins

Enable diagnostic logs to Log Analytics now, before you need them for a postmortem
Add the AzureLoadBalancer service tag to NSG inbound rules on every new VM added to a backend pool
Set an alert on SNAT Failed Connections metric with threshold > 0 to catch port exhaustion early
Tag your load balancer resources with probe port, backend pool owner, and last-reviewed date for faster incident triage

Frequently Asked Questions

Why is my Azure Load Balancer backend showing unhealthy even though the VM is running?

The most common reason is that an NSG is blocking Azure's health probe traffic from 168.63.129.16. The VM itself is running fine, but the load balancer can't reach the probe port to verify it, so it marks the backend as unhealthy and stops sending traffic. Go to the NIC-level NSG and the subnet-level NSG and add an inbound allow rule for the AzureLoadBalancer service tag on your probe port. Also check Windows Defender Firewall or iptables on the guest OS itself, a guest-level firewall can block probe traffic even when the NSG allows it.

What does "No healthy backends" mean in the Azure Load Balancer backend health view?

"No healthy backends" means that every VM in your backend pool is currently failing health probes, so the load balancer has no viable destination to send traffic to, all requests will be dropped. This is almost always caused by NSG blocking of probe traffic, the application port not listening on all VMs simultaneously (e.g., after a failed deployment), or all VMs being stopped or deallocated. Start by checking each VM's individual probe status, then look at NSG effective rules using Network Watcher's IP Flow Verify tool.

Can I use the same public IP address for both an Azure Load Balancer and a VM's direct public IP?

No. A public IP address can only be attached to one resource at a time, either to a load balancer frontend or directly to a VM's NIC, never both simultaneously. If you try to associate an IP that's already in use, Azure will throw a validation error and the association will fail. If you need direct VM access alongside load-balanced traffic, use separate public IPs: one for the load balancer frontend and one (or an inbound NAT rule) for direct VM management access.

My backend VMs can't reach the internet, is this an Azure Load Balancer outbound issue?

Yes, this is almost certainly Azure Load Balancer SNAT exhaustion or a missing outbound configuration. With Standard SKU, backend VMs don't get automatic outbound internet access, you must explicitly configure it via outbound rules on the load balancer, a NAT Gateway, or a public IP on each VM's NIC. Check the SNAT Connection Count metric in the load balancer's Monitoring section; if Failed connections are climbing, add an outbound rule with a higher port allocation or attach an Azure NAT Gateway to your subnet for scalable outbound connectivity.

What's the difference between Azure Load Balancer Basic SKU and Standard SKU, and does it matter for troubleshooting?

It matters a lot. Standard SKU is closed by default, all traffic is blocked unless NSGs explicitly allow it, while Basic SKU is open by default. Standard SKU supports zone redundancy, outbound rules, and has full diagnostic logging and metrics; Basic SKU has minimal monitoring. Most Azure Load Balancer troubleshooting issues that stump people are caused by the Standard SKU's secure-by-default stance, particularly around health probe blocking. Basic SKU is also being retired by Microsoft, so if you're still on it, migration to Standard should be on your roadmap.

How do I test Azure Load Balancer connectivity without affecting production traffic?

The best approach is to use Azure Network Watcher's Connection Troubleshoot tool, it sends synthetic test traffic from a source VM or a Network Watcher agent to your load balancer frontend and reports whether the connection succeeds, fails, or is blocked, along with the specific hop where the failure occurs. You can also spin up a test VM in the same VNet and use curl, telnet, or Test-NetConnection (PowerShell) to test specific ports without touching production clients. This gives you a controlled test plane completely separate from real user traffic.

Related Microsoft Fix Guides

Sai Kiran Pandrala

Our team includes certified Microsoft engineers, Azure architects, and system administrators with 10+ years of enterprise IT experience. Every guide is written from hands-on troubleshooting, not guesswork. We test every fix before publishing.