Azure Load Balancer Not Working, Connectivity, Rules, and Routing Fixes
Why Azure Load Balancer Stops Working
Picture this: your production environment is humming along, your team just finished a planned migration from Basic to Standard Load Balancer tier, and then, silence. Your virtual machines go dark from the internet's perspective. Health probes fail, backends stop responding, and your monitoring dashboard lights up like a Christmas tree. Your Azure Load Balancer is not working, and you have no idea where to start.
I've seen this exact situation on dozens of Azure environments, and the maddening part is that Microsoft's portal error messages almost never tell you why. They just tell you something is wrong. "The resource is in a failed state." Thanks, that's very helpful.
Here's the reality: Azure Load Balancer connectivity issues almost always fall into one of five categories. First, network security group rules that silently block traffic, especially after a tier upgrade. Second, health probes failing because the backend VM isn't actually listening on the port you configured. Third, Standard Load Balancer's closed-by-default security model catching admins off guard (this one bites people constantly). Fourth, a misconfigured load-balancing rule, often one that no longer matches the actual listener on your VMs. And fifth, the load balancer resource itself landing in a failed provisioning state after an interrupted operation.
The Standard tier is where most Azure Load Balancer not working complaints originate. Unlike the Basic tier, which was more permissive and frankly less secure, Standard Load Balancer operates on a deny-by-default model for both inbound and outbound traffic. If you haven't explicitly told it what to allow via NSG rules, it blocks everything. No exceptions. No warnings. Just dropped packets.
There's also a distinction between inbound failures and outbound failures that people conflate when they're panicking. Your backend VMs losing internet access is a completely different root cause from external clients failing to reach your load balancer frontend. Both result in "Azure Load Balancer not working" tickets, but they need different fixes. I'll walk you through both.
One more thing that makes Azure Load Balancer troubleshooting particularly tricky: the error symptoms are often misleading. Your health probe might show green in the portal while your actual application traffic is dead in the water. Or you'll see traffic reaching some backends but not others, pointing to a session persistence misconfiguration rather than a hard block. We'll cover all of it.
The Quick Fix, Try This First
Before you spend an hour digging through flow logs, try this. Open the Azure portal, navigate to your Load Balancer resource, and go to Monitoring > Insights > Flow Distribution. This single dashboard will immediately show you whether traffic is reaching your backends at all, and whether it's distributed evenly or piling onto one VM.
If that tab shows zero flow distribution, meaning nothing is getting through, the fastest fix in 80% of cases is checking your NSG rules. Here's the exact sequence:
- In the portal, go to your backend VM's Networking blade.
- Click Inbound port rules on the network interface card (NIC) view.
- Look for any rule with Action: Deny that sits above your allow rules in priority order. NSG rules are evaluated lowest number first, so a Deny rule at priority 100 will override an Allow rule at priority 200 every time.
- If you see no explicit Allow rule for the port your load balancer is forwarding to (say, TCP 80 or TCP 443), add one now. Set source to Any or to the specific client IP range, destination port to match your backend port, and action to Allow.
For Standard Load Balancer specifically, this is important, you need NSG rules on either the subnet or the NIC. Not both necessarily, but at least one. The load balancer itself does not pass traffic unless an NSG explicitly permits it. This is the single most common reason Azure Load Balancer connectivity issues appear immediately after a Basic-to-Standard upgrade.
If your issue is outbound (your VMs can't reach the internet through the load balancer), and you're on Standard internal load balancer, the quick answer is: you need Azure NAT Gateway. Standard ILBs do not provide outbound internet access by default. That's not a bug, it's by design. Jump to Step 4 for the full NAT Gateway configuration walkthrough.
This is the starting point for virtually every Azure Load Balancer not working investigation. NSG misconfiguration is responsible for the majority of Azure Load Balancer inbound connection failures on the Standard tier, especially when teams upgrade from Basic without adjusting their security group rules.
Standard Load Balancer and standard public IP addresses are closed to inbound connections by default. Full stop. You must add NSG rules that explicitly permit traffic, or the load balancer will silently drop everything. Here's how to find and fix the problem:
Navigate to your backend VM in the Azure portal. Go to Networking in the left menu. You'll see two sections: rules applied to the subnet and rules applied to the network interface. Check both. A deny rule on the subnet applies to every VM in that subnet, so a misconfigured subnet NSG can take down your entire backend pool at once.
For public-facing load balancers, the source IP in your NSG allow rules matters. When external clients connect through the load balancer, their original IP address is what your backend VMs see, not the load balancer's IP. So if you've whitelisted only the load balancer's frontend IP, you've whitelisted the wrong thing. Add the client IP ranges (or set source to Any for public workloads) to your inbound allow rules.
To add a corrective NSG rule: in the Networking blade, click Add inbound port rule. Set priority to a number lower than any existing Deny rules (e.g., 100), set the destination port to your backend port, protocol to TCP (or UDP if applicable), and action to Allow. Click Add.
You should see connectivity restore within 30–60 seconds of saving the rule, NSG changes propagate quickly. If you open a command prompt on the backend VM and run netstat -an and your application port now shows LISTENING with connections arriving, you're through.
If your backend pool VMs are listed as Unhealthy in the load balancer's health probe section, traffic will never reach them, it doesn't matter how perfect your NSG rules are. The load balancer won't send requests to a backend it considers down. So fixing your health probe failure is often the critical unlock.
First, confirm what's actually happening. Go to your Load Balancer resource in the portal, navigate to Monitoring > Insights, and check the Backend Health tab. Any backend showing a red or yellow status needs attention immediately.
The most common health probe failure reason: the VM is up, but the probe is checking the wrong port. Go to Load Balancer > Settings > Health Probes and verify the probe port matches a port that is actually open and listening on your backend VMs. Then sign in to one of the backend VMs and run:
netstat -an | findstr LISTENING
If the health probe port, say, port 80, doesn't appear in that output, your web server or application isn't running or isn't binding to that port. Start (or restart) the service, then re-check the health probe status in the portal.
There's also a specific scenario that trips up teams using virtual machine scale sets: you cannot change the backend port of a load-balancing rule while it's associated with a health probe. If you try, the portal will throw an error and the rule update will fail. The fix is to remove the health probe first, update the port on the rule and on the scale set, then re-attach the health probe. It feels backwards, but it's the required sequence.
Health probe recovery typically takes one to two probe intervals after the underlying issue is fixed. The default interval is 15 seconds with a threshold of 2. So you're looking at 30 seconds maximum before a recovered VM rejoins the pool.
Here's a scenario I see constantly: health probes pass (green status in the portal), the NSG rules look fine, but your backend VMs still aren't responding to actual application traffic on the configured data port. The load balancer says everything is healthy. Your users say nothing works. What gives?
Nine times out of ten, the application on the backend VM isn't actually listening on the data port, only on the probe port. A health probe on port 80 checking for an HTTP 200 will succeed even if your actual application is supposed to be running on port 8080. The probe passes, the backend is marked healthy, but all data-port traffic goes nowhere.
Sign in to the backend VM directly (via Azure Bastion or the portal's serial console) and run this command:
netstat -an
Look through the output carefully. You're checking for your data port, whatever port your load-balancing rule forwards to, with a state of LISTENING. If it's there, great. If it's not, the application isn't running correctly on that port. Check your application logs, restart the service, and confirm it binds to the right port and network interface.
One subtle gotcha: some applications bind to 127.0.0.1 (loopback only) instead of 0.0.0.0 (all interfaces). If your app is loopback-bound, the load balancer's traffic arriving on the VM's private IP will never reach it, even though netstat shows the port as listening. Look at the local address column in the netstat output. You want to see 0.0.0.0:<port>, not 127.0.0.1:<port>.
Also worth knowing: never try to access your load balancer's frontend IP from within a VM that's in the backend pool of that same load balancer, using the same NIC. This creates an asymmetric routing loop. The packet goes out to the frontend, but the return path bypasses the load balancer entirely. The result looks like a connectivity failure when the infrastructure itself is fine. Test from an external machine or a VM on a different subnet.
If your Azure Load Balancer not working complaint is specifically that your backend VMs have lost internet access, they can't reach external APIs, can't download updates, can't connect to Azure services outside your VNet, and you're running a Standard internal load balancer, here's what happened: you're hitting Standard ILB's default security behavior.
Basic internal load balancers used to quietly provide outbound internet access via a hidden, system-managed public IP called the "default outbound access IP." It wasn't documented clearly, it wasn't under your control, and Microsoft explicitly tells you not to rely on it for production. Standard internal load balancers removed this behavior entirely. There's no hidden IP. There's no default outbound path. If you migrated from Basic ILB to Standard ILB and your VMs suddenly lost internet access, this is why.
The correct fix, and the one Microsoft recommends for all production outbound scenarios, is Azure NAT Gateway. Here's how to configure it:
# Create a public IP for NAT Gateway
az network public-ip create \
--resource-group <your-rg> \
--name myNATGatewayIP \
--sku Standard \
--allocation-method Static
# Create the NAT Gateway
az network nat gateway create \
--resource-group <your-rg> \
--name myNATGateway \
--public-ip-addresses myNATGatewayIP \
--idle-timeout 10
# Associate NAT Gateway with your subnet
az network vnet subnet update \
--resource-group <your-rg> \
--vnet-name <your-vnet> \
--name <your-subnet> \
--nat-gateway myNATGateway
Once NAT Gateway is attached to your subnet, all outbound internet traffic from VMs in that subnet flows through it automatically. You get a predictable, static public IP, full NSG compatibility, and no more dependency on hidden Azure infrastructure. Test outbound connectivity from a backend VM with a simple curl https://www.microsoft.com, you should get a response immediately.
If your load balancer is showing Provisioning State: Failed in the Azure portal, maybe after an interrupted update, a failed deployment, or a resource conflict, this is one of the more alarming things you can see. The portal often gives you no useful path forward from this state. The fix involves Azure Resource Explorer, and it's less scary than it sounds.
Open a new browser tab and go to resources.azure.com. Sign in with the same account you use for the Azure portal. In the left panel, drill down through subscriptions > [your subscription] > resourceGroups > [your resource group] > providers > Microsoft.Network > loadBalancers > [your load balancer name].
At the top of the Resource Explorer pane, find the Read/Write toggle and switch it to Read/Write mode. This enables you to make direct API calls to Azure Resource Manager.
Now click Edit. The JSON definition of your load balancer appears in an editable panel. You don't need to change anything in the JSON, just click PUT. This sends the current configuration back to Azure Resource Manager and triggers a re-evaluation of the provisioning state.
Wait 15–30 seconds, then click GET to refresh. Check the provisioningState field in the response JSON. If it reads "Succeeded", your load balancer is out of the failed state. Return to the portal and refresh your Load Balancer blade, it should now show a healthy status and you can resume normal operations.
If the PUT operation returns an error or the provisioning state stays Failed after multiple attempts, the resource has a deeper conflict that the portal and Resource Explorer can't self-heal. At that point, collecting a network trace and opening a support ticket is the right call.
Advanced Troubleshooting
When the standard fixes don't work, you need to go deeper. Let me walk you through the diagnostic techniques that actually surface the root cause when you're stuck.
Diagnosing Uneven Traffic Distribution
Azure Load Balancer uses a hash-based distribution algorithm, not true round-robin. By default, it hashes on source IP, source port, destination IP, destination port, and protocol, a five-tuple hash. This means a single client will always land on the same backend VM for the duration of a session. If you have a small number of clients hammering your service, they may all hash to the same backend, making it look like the other backends are idle.
If you've configured Session Persistence to Client IP or Client IP and Protocol, you've reduced that hash to a two- or three-tuple. That makes the distribution even more concentrated, all traffic from a single IP goes to one VM, period. This is appropriate for stateful applications, but it's a common cause of apparent Azure Load Balancer traffic distribution issues when admins expect even spread.
To change session persistence, go to Load Balancer > Settings > Load Balancing Rules, click your rule, and change Session persistence to None. This restores full five-tuple hashing and distributes across all healthy backends.
Also watch for proxy infrastructure sitting in front of your clients. If all your clients route through a corporate proxy or a CDN edge that presents a single outbound IP, the load balancer sees one "client", and routes all that traffic to a single backend. In this case, uneven distribution is expected and correct. The fix is architectural: use per-session token-based routing or a different distribution algorithm in front of the load balancer, not behind it.
Running Network Captures for Deep Packet Analysis
When you've exhausted the portal-based diagnostics, network captures are how you prove what's actually happening on the wire. From one of your backend VMs, use PsPing to test the probe port response against another backend in the pool:
psping 10.0.0.4:3389
Run this while simultaneously collecting a Netsh trace on both the backend VM and a test VM in the same VNet:
# Start trace on backend VM
netsh trace start capture=yes tracefile=C:\backend_trace.etl
# Run your PsPing tests here
# Stop the trace
netsh trace stop
The resulting .etl file can be opened in Microsoft Message Analyzer or converted with netsh trace convert for analysis in Wireshark. Look for TCP RST packets (connection resets), ICMP unreachable messages, or simply missing return traffic, any of these narrows down whether the block is at the VM level, the NSG level, or somewhere between.
Checking Load Balancer Health Event Logs
Azure Monitor captures Load Balancer health events that don't surface anywhere in the portal UI. Go to Monitor > Logs, select your Load Balancer's resource scope, and run this query:
AzureLoadBalancerHealthEvent
| where TimeGenerated > ago(1h)
| project TimeGenerated, BackendIPAddress, ProbeStatus, ReasonCode
| order by TimeGenerated desc
The ReasonCode field is gold. It tells you exactly why a health probe marked a backend as unhealthy, whether it's a TCP connection refused, an HTTP non-200 response, a timeout, or a network-level drop. This eliminates the guesswork entirely when you're debugging Azure Load Balancer health probe failures.
Small Traffic Still Flowing After VM Removal
You removed a VM from the backend pool and you're still seeing a trickle of traffic hitting its IP. Don't panic. Azure infrastructure, DNS resolution, blob storage access, background management plane calls, generates small amounts of network traffic to VMs independent of the load balancer. This is expected Azure platform behavior. Run nslookup against your storage account FQDN from within your VNet to see which Azure IPs are resolving there, and you'll typically find that's the source of the residual traffic.
If you've worked through every step here and your load balancer is still broken, it's time to escalate. Specifically, open a ticket when: your load balancer is stuck in Failed provisioning state after multiple PUT attempts via Resource Explorer; your health probes pass but data traffic fails and Netsh traces show the packets arriving at the VM but no response being generated; or you're seeing asymmetric routing behavior in a complex hub-spoke topology. When you open the ticket at Microsoft Support, attach your Netsh trace files and your PsPing results, this cuts resolution time from days to hours in my experience.
Prevention & Best Practices
Most Azure Load Balancer not working incidents are preventable. I've watched teams spend entire weekends fixing issues that five minutes of pre-migration planning would have avoided. Here's what to build into your standard operating procedure.
Always deploy NSG rules before upgrading to Standard tier. If you're migrating from Basic Load Balancer to Standard, NSG configuration must happen before the migration, not after. The moment your resources become Standard SKU, the deny-by-default rules kick in. If you flip the tier and then write your NSG rules, you'll have a gap where nothing works. Even in a maintenance window, that gap causes alerts, confusion, and stress.
Use Azure NAT Gateway for all outbound internet access. Stop relying on default outbound access, Microsoft has deprecated it for new deployments and it will eventually go away for existing ones too. NAT Gateway gives you a static, predictable outbound IP, scales automatically, and works correctly with Standard Load Balancer. Set it up once per subnet and forget about outbound connectivity issues forever.
Validate health probe ports against actual application ports before deployment. Before you go live with any load balancer rule, SSH or RDP into a backend VM and confirm the probe port is listening. Run netstat -an and look for the port in LISTENING state bound to 0.0.0.0. Thirty seconds of validation prevents hours of incident response.
Test connectivity from outside the backend pool, always. Never use a backend VM to test the load balancer's frontend. Use an external machine, a VM in a different subnet, or Azure's Network Watcher connection troubleshoot tool. Testing from inside the backend pool introduces the asymmetric routing problem that masks real issues.
Monitor the Flow Distribution tab regularly, not just during incidents. Set up an Azure Monitor alert on backend health percentage. If any backend drops below 100% health, you want to know immediately, not when users start calling.
- Tag every NSG rule with a description explaining what it allows and why, reviewing rules is much faster when you can read intent at a glance.
- Set your health probe interval to 5 seconds with a threshold of 2 in production, this means backends are marked unhealthy after 10 seconds instead of 30, cutting your failover time significantly.
- Use Azure Network Watcher's IP Flow Verify tool to test whether a specific source/destination pair would be allowed or blocked by your NSGs, it takes 10 seconds and replaces hours of manual rule-reading.
- If you're using virtual machine scale sets behind a load balancer, document the health probe removal procedure now, before you need it, you'll need it the first time you have to change a backend port under pressure.
Frequently Asked Questions
Why does my Azure Load Balancer show healthy backends but still drop traffic?
Health probes and data traffic use different ports and different paths. Your probe can succeed on port 80 while your actual application on port 8080 is down, not listening, or blocked by an NSG rule that only applies to that port. Open a session on the backend VM, run netstat -an, and verify the data port shows as LISTENING, not just the probe port. Also check whether you're testing from inside the backend pool itself, which creates an asymmetric routing loop that looks identical to a genuine traffic drop.
I upgraded from Basic to Standard Load Balancer and now nothing works, how do I fix it?
This is the classic post-upgrade scenario. Standard Load Balancer requires explicit NSG rules to pass any traffic, inbound or outbound. Go to your backend VM's Networking blade in the portal and add inbound allow rules for your application ports. For outbound internet access from a Standard internal load balancer, you'll also need to configure Azure NAT Gateway on your subnet, because Standard ILBs don't provide default outbound connectivity the way Basic ILBs did. Both changes together should restore everything to normal.
Can I change the backend port on my load-balancing rule while the load balancer is running?
Yes, but not while a health probe is attached to the rule, and not at all if you have a virtual machine scale set in the backend pool. For VMSS configurations, you have to remove the health probe first, update the backend port on both the load-balancing rule and the scale set itself, and then re-attach the health probe. Trying to shortcut this sequence will result in an error that blocks the change entirely. The full sequence takes about five minutes and requires no downtime if done correctly.
My load balancer is in a Failed state in the portal, do I need to delete and recreate it?
No, deletion is almost never the right answer here. Go to Azure Resource Explorer at resources.azure.com, find your load balancer resource, switch to Read/Write mode, and use the Edit > PUT sequence to re-submit the resource's current configuration to Azure Resource Manager. Then click GET and check that provisioningState has changed to Succeeded. This fixes the majority of failed state issues without losing any of your existing rules or configuration. Only escalate to deleting and recreating if the PUT fails repeatedly with an error that Resource Explorer can't resolve.
Why is my Standard internal load balancer blocking my VMs from reaching the internet?
Standard internal load balancers don't provide outbound internet connectivity by design, this is a security feature, not a bug. Basic ILBs used an undocumented hidden public IP to route outbound traffic, which Microsoft has since deprecated. To restore outbound internet access for your backend VMs, attach an Azure NAT Gateway to the subnet where your VMs live. NAT Gateway is Microsoft's recommended solution for all production outbound scenarios, and it gives you a static, predictable public IP you actually control.
Why is traffic only going to one backend VM even though I have three healthy backends?
Almost certainly a session persistence setting or a proxy in front of your clients. If session persistence is set to Client IP or Client IP and Protocol, all requests from a single source IP land on the same backend, always. If your clients route through a corporate proxy, the load balancer sees one source IP and routes everything to one VM. Change session persistence to None in your load-balancing rule if you don't require sticky sessions. Also remember to measure traffic distribution per connection (use Load Balancer Insights > Flow Distribution), not per packet, per-packet counts look uneven even with perfect distribution due to how hash-based load balancing works.