How to Troubleshoot Azure NAT Gateway

Microsoft Fix Intermediate 14 min read Official Docs Grounded Updated April 20, 2026

What's in This Guide

Why This Happens
The Quick Fix
Step-by-Step Solution
Advanced Troubleshooting
Prevention & Best Practices
FAQ

Why This Is Happening

I've seen this scenario play out more times than I can count. You deploy a clean Azure environment, associate your subnet with an Azure NAT Gateway, and everything looks perfectly configured in the portal. Then your workload goes live , and suddenly your virtual machines can't reach the internet, API calls are timing out, or you're getting intermittent connection drops that make no sense. You open a support ticket, Azure says the NAT Gateway is "healthy," and you're left staring at a green status icon wondering why nothing works.

Azure NAT Gateway troubleshooting catches a lot of engineers off guard because the failure modes are subtle. Unlike a down VM or a broken load balancer, NAT Gateway issues often surface as slow degradation , connections that worked fine yesterday start failing today, or only a percentage of outbound requests fail. That pattern points directly to SNAT port exhaustion, and it's the number one reason I get pulled into NAT Gateway escalations.

Here's the core of what NAT Gateway actually does: it provides outbound-only internet connectivity for resources in a subnet, replacing the older default SNAT behavior tied to Azure Load Balancer. Each public IP you attach to the gateway gives you 64,512 SNAT ports. Those ports are shared across every VM, container, or service in the associated subnet. When your application opens many concurrent outbound TCP connections, think microservices hammering an external REST API, or a data pipeline calling Azure Storage endpoints repeatedly, those ports get consumed fast. Exhaust them, and new outbound connections silently fail with TCP timeout errors (error code 110 on Linux, WSAETIMEDOUT on Windows).

Beyond SNAT exhaustion, the other causes I see most often in Azure NAT Gateway connectivity issues are:

Missing or wrong subnet association, the gateway exists but isn't actually linked to the subnet your VMs live in. This is more common than you'd think after IaC deployments where Terraform or Bicep partially succeeded.
No public IP or IP prefix attached, a NAT Gateway with zero public IPs does nothing. The portal will let you create one in this state.
NSG rules blocking outbound traffic, a deny-all outbound NSG applied at the subnet level will override NAT Gateway and silently drop traffic.
User-Defined Route (UDR) conflicts, a default route (0.0.0.0/0) pointing to a Network Virtual Appliance or Azure Firewall will intercept traffic before it ever hits NAT Gateway.
TCP idle timeout mismatches, NAT Gateway has a configurable idle timeout (default 4 minutes, max 120 minutes). Application-level keep-alive settings that don't match this window cause ghost connections that consume SNAT ports long after they should have been released.

Microsoft's error messages around NAT Gateway aren't great. The portal health status shows "Succeeded" even when SNAT ports are maxed out, because from Azure's perspective the resource is functioning, it's just out of capacity. That mismatch between "healthy" status and broken connectivity is exactly why Azure NAT Gateway outbound traffic problems are so frustrating to diagnose without knowing where to look.

Browse all Microsoft fix guides →

The Quick Fix, Try This First

Before you go deep on diagnostics, run through this checklist in the Azure portal. About 60% of Azure NAT Gateway not working cases I've seen are resolved in the first two checks alone.

Open the Azure portal and navigate to your NAT Gateway resource. Go to Settings > Subnets in the left blade. You should see the subnet you intended to protect listed here. If the list is empty or shows a different subnet, that's your answer. The gateway exists, but it's not associated with anything. Click + Associate, select your virtual network and the correct subnet, and hit Save. Give it 30–60 seconds to propagate, then retest your outbound connectivity.

Still not working? Go to Settings > Outbound IP. This blade shows which public IP addresses or public IP prefixes are attached to the gateway. If it says "None configured," the gateway has no way to translate your traffic. Click + Add, select a Standard SKU public IP (Basic SKU is not supported, this trips up a lot of people migrating older environments), and save.

Now open the subnet itself: go to Virtual Networks > [your VNet] > Subnets > [your subnet] and look at the Network security group field. If there's an NSG attached, click into it and check the Outbound security rules. Look for any rule with priority lower than 65000 that has Action: Deny applied to Destination: Internet or destination port 80/443. A rule like that will block all outbound web traffic regardless of what NAT Gateway is doing. You'll need to either remove that deny rule or add an explicit Allow rule with a lower priority number (higher precedence) above it.

If all three of those check out and you're still seeing Azure NAT Gateway connectivity issues, run this quick PowerShell check to confirm the association from the API side rather than trusting the portal display:

$natGw = Get-AzNatGateway -ResourceGroupName "your-rg" -Name "your-nat-gateway"
$natGw.Subnets | ForEach-Object { Write-Host $_.Id }
$natGw.PublicIpAddresses | ForEach-Object { Write-Host $_.Id }
$natGw.PublicIpPrefixes | ForEach-Object { Write-Host $_.Id }

If either of the last two commands returns nothing, you've found your problem. If both return values and the subnet is listed, move on to the step-by-step troubleshooting below, you're likely dealing with SNAT exhaustion or a routing conflict.

Pro Tip

When you check Resource Health for a NAT Gateway (Settings > Resource Health), a "Healthy" status does not mean SNAT ports are available. Resource Health only reflects control-plane availability. To actually see port exhaustion, you must go to Metrics and plot the SNATConnectionCount metric split by Connection State = Failed, that's the number that tells you what's actually happening on the data plane.

Verify Subnet Association and Public IP in Azure Portal

This is always step one. I don't care how confident you are in your IaC deployment, verify it manually first. Portal drift is real, and a failed Terraform apply can leave a NAT Gateway in a half-configured state that looks fine until you dig in.

In the Azure portal, go to All Resources and search for your NAT Gateway by name. Open the resource, then click Settings > Subnets in the left navigation panel. Confirm the correct subnet appears. If you're running multiple subnets (for example, a frontend subnet and a backend subnet), each subnet that needs outbound internet access requires explicit association, NAT Gateway doesn't apply VNet-wide, only per-subnet.

Next, click Settings > Outbound IP. Confirm at least one Standard SKU public IP or a public IP prefix is attached. Remember: you can attach up to 16 public IP addresses to a single NAT Gateway, giving you a maximum of 16 × 64,512 = 1,032,192 simultaneous SNAT ports. For high-throughput scenarios like connection-heavy microservices, you'll often need more than the default single IP.

To verify from the command line, which I always recommend as a cross-check:

# Check NAT Gateway full configuration
az network nat gateway show \
  --resource-group "your-rg" \
  --name "your-nat-gateway" \
  --output table

# List associated subnets
az network nat gateway show \
  --resource-group "your-rg" \
  --name "your-nat-gateway" \
  --query "subnets[].id" \
  --output tsv

# List attached public IPs
az network nat gateway show \
  --resource-group "your-rg" \
  --name "your-nat-gateway" \
  --query "publicIpAddresses[].id" \
  --output tsv

If the subnet list or IP list is empty, you've confirmed the misconfiguration. Use az network nat gateway update or fix it in the portal. Once you associate the subnet, expect 30–90 seconds before outbound traffic begins routing through the gateway.

Measure SNAT Port Exhaustion with Azure Monitor Metrics

If the basic association looks correct but outbound connections are still failing intermittently, SNAT port exhaustion is your next suspect. This is the leading cause of Azure NAT Gateway outbound traffic problems at scale, and it's invisible unless you know which metrics to check.

Navigate to your NAT Gateway in the portal, then click Monitoring > Metrics in the left blade. Click + Add metric and select the following metrics one by one:

SNATConnectionCount, total active SNAT connections. Split this by Connection State dimension and compare "Successful" vs "Failed" values.
DroppedPackets, a rising value here alongside failed SNAT connections confirms exhaustion.
ByteCount and PacketCount, useful baselines to understand traffic volume hitting the gateway.

Set your time range to the last 24 hours and use a 1-minute granularity. If you see SNATConnectionCount (Failed) climbing above zero during peak usage, especially if it correlates with spikes in overall traffic, you are exhausting SNAT ports.

The fix is straightforward: add more public IP addresses to the NAT Gateway. Each additional Standard SKU public IP adds another 64,512 ports. You can also use a public IP prefix (/28 gives you 16 IPs at once), which has the added benefit of giving your partners and downstream systems a predictable, contiguous IP range to allowlist.

# Add a new public IP to the NAT Gateway
$pip = New-AzPublicIpAddress `
  -Name "nat-pip-02" `
  -ResourceGroupName "your-rg" `
  -Location "eastus" `
  -Sku "Standard" `
  -AllocationMethod "Static"

$natGw = Get-AzNatGateway -ResourceGroupName "your-rg" -Name "your-nat-gateway"
$natGw.PublicIpAddresses.Add(@{Id = $pip.Id})
Set-AzNatGateway -NatGateway $natGw

After adding the IP, give it 2–3 minutes and then watch the DroppedPackets metric, it should drop to zero if exhaustion was the root cause.

Check NSG Rules and User-Defined Route Conflicts

Even with a perfectly configured NAT Gateway, two common Azure networking constructs can silently intercept or block your outbound traffic: Network Security Groups (NSGs) and User-Defined Routes (UDRs). I've seen both cause Azure NAT Gateway subnet association to work perfectly at the control plane while traffic still goes nowhere.

NSG Outbound Rules: Navigate to your virtual network, open the subnet in question, and click on any attached NSG. Go to Outbound security rules. The default rules (65000 AllowVnetOutBound and 65001 AllowInternetOutBound) should normally permit internet traffic. But if someone has added a custom Deny rule with a priority number below 65001, say, priority 200 with Destination = Internet and Action = Deny, that will override NAT Gateway completely. Look for any rule that matches ports 80, 443, or any port with destination "Internet" or a specific external IP range and Action = Deny.

UDR Routing Conflicts: Go to Virtual Networks > [your VNet] > Subnets > [your subnet] and click on the Route table if one is attached. Open the route table and examine its routes. If you see a route for 0.0.0.0/0 (the default route) pointing to a next hop of Virtual Appliance or VirtualNetworkGateway, all internet-bound traffic is being redirected to that appliance, bypassing NAT Gateway entirely. This happens frequently in hub-and-spoke architectures where a forced tunneling policy exists.

# Check effective routes on a VM's NIC (shows actual routing after all UDRs applied)
Get-AzEffectiveRouteTable `
  -ResourceGroupName "your-rg" `
  -NetworkInterfaceName "your-vm-nic" `
  | Format-Table AddressPrefix, NextHopType, NextHopIpAddress, State

Look for a route with AddressPrefix 0.0.0.0/0 in the output. If NextHopType shows VirtualAppliance instead of Internet, the UDR is hijacking your traffic. You'll need to either modify the route table to let NAT Gateway handle internet traffic, or configure your NVA to pass traffic through.

Diagnose TCP Idle Timeout and Connection Reuse Issues

This one is sneaky. Your application works fine most of the time, but every few minutes a batch of connections silently drops, causing retry storms and occasional failures. You check everything and it all looks healthy. The culprit is almost always a TCP idle timeout mismatch between Azure NAT Gateway and your application's connection behavior.

Azure NAT Gateway has a configurable TCP idle timeout. The default is 4 minutes. You can set it anywhere from 4 minutes up to 120 minutes. What this means in practice: if a TCP connection has no data flowing across it for longer than the idle timeout window, NAT Gateway silently drops the SNAT port mapping for that connection. If your application then tries to send data on that "connection," the packets go nowhere, from the application's perspective the socket is still open, but NAT Gateway has already forgotten about it.

To check or change the idle timeout:

# View current idle timeout (in minutes)
az network nat gateway show \
  --resource-group "your-rg" \
  --name "your-nat-gateway" \
  --query "idleTimeoutInMinutes"

# Update idle timeout to 10 minutes
az network nat gateway update \
  --resource-group "your-rg" \
  --name "your-nat-gateway" \
  --idle-timeout 10

However, simply cranking up the idle timeout is often the wrong move, it causes SNAT ports to be held longer, which can actually accelerate exhaustion under load. The right fix is to configure TCP keep-alives at the application level so that your connections send heartbeat packets before the idle timeout fires. On Linux VMs, you can set kernel-level TCP keep-alive parameters:

# Check current TCP keep-alive settings
sysctl net.ipv4.tcp_keepalive_time
sysctl net.ipv4.tcp_keepalive_intvl
sysctl net.ipv4.tcp_keepalive_probes

# Set keep-alive to fire after 3 minutes of idle (before NAT Gateway's 4-minute timeout)
sysctl -w net.ipv4.tcp_keepalive_time=180
sysctl -w net.ipv4.tcp_keepalive_intvl=60
sysctl -w net.ipv4.tcp_keepalive_probes=3

For application-level connection pools (Java HttpClient, .NET HttpClientHandler, Node.js, etc.), enable keep-alive at the HTTP or socket level in your code. Setting Connection: keep-alive headers is not enough, the underlying TCP socket also needs periodic pings. Once your application is sending keep-alives every 2–3 minutes, you'll see the random disconnection pattern disappear.

Enable Diagnostic Logs and Analyze NAT Gateway Flow Data

At this point, if you still can't identify the root cause of your Azure NAT Gateway connectivity issues, you need actual log data. Azure Monitor diagnostic settings let you capture detailed flow information that's invisible through the Metrics blade alone.

In the Azure portal, open your NAT Gateway resource and click Monitoring > Diagnostic settings. Click + Add diagnostic setting. Give it a name, check allLogs (which captures the NatGatewayFlowLogs category), and send it to a Log Analytics Workspace, this gives you the most flexible querying options. Hit Save.

Note: flow log data can take 5–10 minutes to appear after initial configuration. Once data is flowing, go to your Log Analytics workspace and run these queries:

// Find failed outbound connections in the last hour
AzureDiagnostics
| where ResourceType == "NATGATEWAYS"
| where TimeGenerated > ago(1h)
| where operationName_s contains "FlowLog"
| where resultType_s == "Failed"
| project TimeGenerated, sourceAddress_s, destinationAddress_s, 
          destinationPort_d, protocol_s, resultDescription_s
| order by TimeGenerated desc
| take 100

// SNAT port usage breakdown by source VM IP
AzureDiagnostics
| where ResourceType == "NATGATEWAYS"
| where TimeGenerated > ago(1h)
| summarize ConnectionCount = count() by sourceAddress_s, resultType_s
| order by ConnectionCount desc

For a deeper look at which VMs are consuming the most SNAT ports, pair this with NSG flow logs on the subnet. Enable NSG flow logs under Network Watcher > NSG Flow Logs, then use Traffic Analytics (requires a Log Analytics workspace) to get a visual breakdown of your top-talking VMs and external destinations.

If your logs show failures consistently targeting the same destination IP and port, the remote endpoint may be blocking your NAT Gateway's public IP, this happens with rate-limiting APIs or geo-restricted services. The fix there is to contact the remote service's ops team with your NAT Gateway public IP addresses and ask them to allowlist those ranges.

Advanced Troubleshooting

If you've worked through all five steps and still have problems, or if you're in a complex enterprise environment with Azure Firewall, hub-and-spoke topology, or domain-joined VMs, the following advanced scenarios likely apply to you.

Azure Firewall and NAT Gateway Coexistence

A common architectural question is whether Azure Firewall and NAT Gateway can share the same subnet. The short answer: they cannot and should not be on the same subnet, but they can work together in hub-and-spoke if designed correctly. Azure Firewall has its own SNAT capability built in. If both are present and your UDR sends traffic to the firewall, the firewall will use its own SNAT, not NAT Gateway. In this setup, you'd want NAT Gateway on the AzureFirewallSubnet itself to give the firewall more outbound SNAT ports, not on spoke subnets. Placing NAT Gateway on the AzureFirewallSubnet (supported since 2023) is the correct pattern for high-volume firewall egress.

Forced Tunneling and On-Premises Routing

In hybrid environments connected via ExpressRoute or VPN Gateway, many organizations apply a forced tunneling policy that routes all internet traffic (0.0.0.0/0) back through on-premises. NAT Gateway is completely bypassed in this scenario, the UDR wins. If you need to selectively route some subnets through NAT Gateway while others go on-prem, you'll need per-subnet route tables and intentionally not applying the forced tunneling UDR to the subnets associated with NAT Gateway.

Activity Log Deep Dive

For diagnosing configuration changes that broke a previously working NAT Gateway, the Azure Activity Log is your friend. Go to your NAT Gateway resource and click Monitoring > Activity log. Filter by Timespan: Last 7 days and look for any Write or Delete operations on the NAT Gateway or its associated subnets. You'll see exactly who made the change and when, critical for incident post-mortems.

# Pull Activity Log entries for NAT Gateway via Azure CLI
az monitor activity-log list \
  --resource-group "your-rg" \
  --resource-id "/subscriptions/{sub-id}/resourceGroups/your-rg/providers/Microsoft.Network/natGateways/your-nat-gateway" \
  --start-time 2026-04-13T00:00:00Z \
  --end-time 2026-04-20T00:00:00Z \
  --output table

Resource Health and Service Health Checks

Occasionally, rarely, but it happens, Azure has a platform-level issue affecting NAT Gateway in a specific region. Check Settings > Resource Health on the NAT Gateway resource. If it shows "Degraded" or "Unavailable," that's a Microsoft-side issue and no amount of local troubleshooting will fix it. Cross-reference with the Azure Service Health dashboard (search "Service Health" in the portal) and filter by your region and the "Networking" service category.

Testing Connectivity Directly from the VM

Always validate from the VM itself, not from an external perspective. SSH or RDP into a VM in the affected subnet and run:

# Test basic outbound internet connectivity
curl -v --max-time 10 https://api.ipify.org

# The returned IP should be one of your NAT Gateway's public IPs
# If it returns your VNet private IP or times out, traffic isn't routing through NAT Gateway

# Test DNS resolution (a separate failure point)
nslookup microsoft.com 8.8.8.8

# On Windows, test with Test-NetConnection
Test-NetConnection -ComputerName "api.ipify.org" -Port 443

If curl returns your NAT Gateway public IP, great, routing is working. If it times out or returns nothing, the problem is in the data path between your VM and the gateway. If it returns a different IP entirely, a UDR or NVA is performing its own SNAT before your traffic exits Azure.

When to Call Microsoft Support

If Resource Health shows platform degradation, if you've verified all configuration steps and still see consistent failures on a properly configured gateway, or if you're seeing packet drops that don't correlate with SNAT exhaustion, it's time to escalate. Open a support ticket at Microsoft Support with the output of your Metrics charts (especially SNATConnectionCount and DroppedPackets), your diagnostic log query results, and the output of Get-AzEffectiveRouteTable and Get-AzEffectiveNetworkSecurityGroup from the affected VM's NIC. That combination gives the support engineer everything needed to start without 10 rounds of back-and-forth.

Prevention & Best Practices

Azure NAT Gateway troubleshooting is far less fun than getting it right from the start. After handling dozens of production incidents around Azure NAT Gateway SNAT port exhaustion and misconfiguration, here's what I tell every team before they deploy:

Size your public IPs proactively, not reactively. If your subnet will host more than 50 VMs or containers that make frequent outbound connections, start with at least 2 public IPs from day one. A single IP's 64,512 ports sound like a lot, until you have 100 pods each maintaining a connection pool of 10. Calculate your expected concurrent connections and divide by 64,512 to get your minimum IP count, then add 20% headroom.

Use public IP prefixes instead of individual IPs when possible. A /28 prefix gives you 16 contiguous IPs, which simplifies firewall allowlisting for your partners and gives you immediate room to grow without any application-side changes. The cost difference is minimal.

Set up SNAT exhaustion alerts before you need them. In Azure Monitor, create an alert rule on the DroppedPackets metric for your NAT Gateway. Set a threshold of > 0 with a 5-minute evaluation window. This alert fires the moment you start dropping packets, giving you time to add another public IP before users notice anything. Have the alert email your on-call distribution list or trigger a Logic App to page your team.

Review your route tables after every infrastructure change. The most common post-deployment Azure NAT Gateway connectivity issue I see comes from someone applying a new UDR to a subnet and forgetting it contains a default route that overrides NAT Gateway. Build a checklist item into your change management process: after any route table modification, run Get-AzEffectiveRouteTable on a test VM in each affected subnet and verify the 0.0.0.0/0 next hop is still Internet.

Enable connection draining and graceful shutdown in your applications. When Azure NAT Gateway maintenance events occur (rare but they happen), connections are migrated transparently, but long-lived connections may reset. Applications that retry on TCP reset recover gracefully; those that don't will crash. Build retry logic with exponential back-off into any service that makes outbound calls through NAT Gateway.

Quick Wins

Attach at least 2 Standard SKU public IPs to every production NAT Gateway from day one
Create an Azure Monitor alert on DroppedPackets > 0 so you catch SNAT exhaustion before users do
Enable TCP keep-alive in all applications with connections lasting more than 3 minutes
After every IaC deployment, run Get-AzEffectiveRouteTable on a VM in each subnet to verify routing

Frequently Asked Questions

Why does my Azure NAT Gateway show as healthy but VMs still can't reach the internet?

Resource Health "Healthy" only reflects whether the control plane is available, not whether the data plane is functioning or has capacity. The two most common causes of this maddening discrepancy are SNAT port exhaustion (check the SNATConnectionCount Failed metric in Azure Monitor) and a missing or incorrect subnet association (go to Settings > Subnets and confirm your VM's subnet is listed). A UDR with a 0.0.0.0/0 default route pointing to a Network Virtual Appliance will also cause this, the gateway is healthy, your traffic just never reaches it.

How many concurrent connections can one Azure NAT Gateway handle?

Each Standard SKU public IP attached to a NAT Gateway provides 64,512 SNAT ports. You can attach up to 16 public IPs (or a /28 public IP prefix, also giving 16 addresses), for a theoretical maximum of 1,032,192 simultaneous SNAT-mapped connections on a single gateway. In practice, ports are held for the duration of the connection plus the idle timeout period, so the effective throughput depends heavily on your connection lifecycle, short-lived HTTP requests recycle ports far faster than long-lived database connections.

Can I use Azure NAT Gateway with Azure Kubernetes Service (AKS)?

Yes, and it's actually the recommended outbound connectivity method for AKS clusters as of the past couple of years. When creating an AKS cluster, you can set --outbound-type managedNATGateway or --outbound-type userAssignedNATGateway to control how egress is handled. With userAssignedNATGateway, you pre-create the NAT Gateway and associate it with the node subnet yourself, this gives you more control over the public IPs and idle timeout settings. SNAT exhaustion is especially common in AKS environments because each pod can open its own connections; monitor your SNATConnectionCount metric closely as you scale node count up.

Does Azure NAT Gateway work with IPv6?

No, as of early 2026, Azure NAT Gateway only supports IPv4 outbound traffic. It does not support IPv6 SNAT. If your workloads need to reach IPv6 destinations, you'll need an alternative outbound path such as Azure Firewall (which supports IPv6 in preview in some regions) or a Network Virtual Appliance that has dual-stack capabilities. This is a known limitation documented by Microsoft, and support for IPv6 NAT has been on the roadmap but does not have a confirmed general availability date.

What's the difference between Azure NAT Gateway and Load Balancer outbound rules for SNAT?

Load Balancer outbound rules are tied to backend pool membership, every VM in the pool shares a pool of SNAT ports, and those ports are pre-allocated per instance (configurable from 128 to 64,000 per VM). This rigid pre-allocation wastes capacity when VMs are idle. NAT Gateway uses on-demand port allocation, ports are assigned dynamically as connections open and released immediately when they close, making it far more efficient for variable workloads. NAT Gateway also has a higher throughput ceiling and supports connection tracking at the flow level, which Load Balancer SNAT does not. For most new deployments, NAT Gateway is the right choice; Load Balancer outbound rules are mainly useful for existing architectures that already have a Load Balancer in front of the VMs.

My NAT Gateway was working fine and then stopped after a maintenance window, what happened?

Scheduled Azure platform maintenance can occasionally cause brief NAT Gateway interruptions while underlying infrastructure is updated. Most of the time this is transparent (sub-second), but long-lived TCP connections may receive a reset (RST) packet during the transition. If your application doesn't handle TCP resets gracefully, by catching the exception and reconnecting, it will appear broken after maintenance until you restart the affected service. Check the Activity Log on the NAT Gateway resource for any platform-initiated events around the time of the outage. Going forward, implement TCP keep-alive and connection retry logic so that a brief platform event doesn't require a manual service restart to recover.

Related Microsoft Fix Guides

Sai Kiran Pandrala

Our team includes certified Microsoft engineers, Azure architects, and system administrators with 10+ years of enterprise IT experience. Every guide is written from hands-on troubleshooting, not guesswork. We test every fix before publishing.