How to Troubleshoot Azure Application Gateway

Microsoft Fix Intermediate 18 min read Official Docs Grounded Updated April 20, 2026

What's in This Guide

Why This Happens
The Quick Fix
Step-by-Step Solution
Advanced Troubleshooting
Prevention & Best Practices
FAQ

Why This Is Happening

I've spent years staring at Azure Application Gateway alerts at 2 AM, and I can tell you this: when AGW breaks, it breaks loudly. Users get cryptic 502 Bad Gateway errors, your backend pools go red in the portal, and the error messages Microsoft surfaces give you almost nothing useful to work with. "Unhealthy backend" is not a diagnosis , it's a symptom.

Azure Application Gateway is a Layer 7 load balancer. Unlike a basic TCP load balancer, it understands HTTP and HTTPS traffic, terminates SSL, routes based on URL paths and hostnames, and optionally runs a Web Application Firewall (WAF) in front of your backends. All of that power comes with a lot of moving parts , and any one of them can silently fail in ways that look identical from the outside.

The most common scenario I see: someone deploys a new VM or App Service into a backend pool, doesn't update the health probe path, and suddenly 100% of backends are marked unhealthy even though the app itself is running fine. The Gateway is probing / and getting a 404 or a redirect to HTTPS, and it interprets that as a dead backend. Traffic stops. Alarms fire. Chaos.

Other scenarios that trigger Azure Application Gateway troubleshooting sessions regularly:

HTTP 502 errors, the gateway received an invalid response from the backend, or the backend connection timed out entirely
HTTP 503 Service Unavailable, all backends in the pool are marked unhealthy simultaneously
SSL handshake failures, certificate mismatch, expired cert on the backend, or a missing trusted root certificate uploaded to the gateway
WAF blocking legitimate traffic, OWASP rule sets flagging real user requests as malicious (false positives)
Routing rule misconfiguration, requests intended for one listener landing on the wrong backend pool
NSG blocking health probe traffic, a Network Security Group rule silently dropping the gateway's own probes

The reason Azure's built-in error messages don't help much is that Application Gateway aggregates everything at the gateway level. By the time a 502 surfaces to the user, the actual failure, a TCP timeout, a TLS negotiation failure, a backend HTTP error, has already been swallowed. You have to dig into backend health, diagnostic logs, and sometimes packet captures to find it.

This guide walks you through every layer of that diagnosis. Browse all Microsoft fix guides →

The Quick Fix, Try This First

Before diving deep, do this one check. It catches about 60% of Azure Application Gateway issues in under five minutes.

Open the Azure Portal and navigate to your Application Gateway resource. In the left-hand sidebar, under Monitoring, click Backend health. This page shows you every backend pool, every server in each pool, and the current health probe status for each one.

Look for any backends marked Unhealthy in red. Click on the unhealthy backend, you'll see a column called Details that contains the actual probe response. This is the single most useful piece of information Azure surfaces for AGW problems. Common messages you'll see:

"The HTTP status code received from the backend server does not match the probe's success criteria", your probe expects 200, backend is returning 301, 302, 403, or 404
"The connection to the backend server was refused", NSG is blocking the probe, or the backend app isn't listening on the expected port
"The backend server did not respond within the specified timeout", timeout is too short, or the backend is genuinely overloaded
"The backend certificate is not trusted", SSL end-to-end encryption is enabled but the backend's certificate chain isn't uploaded to the gateway

Once you know which message you're dealing with, jump to the relevant step below. If all backends are suddenly unhealthy at the same time, check your Network Security Group rules first, there's a very good chance a new NSG rule is blocking the gateway's health probe source IP range (168.63.129.16 or the GatewayManager service tag).

If the Backend health page itself fails to load or shows "No data," your Application Gateway may be in a failed provisioning state. In that case, run this quick PowerShell check:

Get-AzApplicationGateway -Name "your-agw-name" -ResourceGroupName "your-rg" | Select-Object Name, OperationalState, ProvisioningState

If OperationalState is Stopped or ProvisioningState is Failed, you'll need to restart the gateway first before any other fix will work.

Pro Tip

The Backend health page in the portal has a known delay, it can lag real state by 2 to 3 minutes. If you've just made a configuration change, wait at least 3 minutes before concluding the fix didn't work. Hitting refresh every 10 seconds won't speed it up; the probe cycle runs on its own timer.

Verify and Fix Health Probe Configuration

The health probe is the heartbeat of Azure Application Gateway. If it's misconfigured, every backend in the pool gets marked unhealthy, and your users get 502s or 503s even though your app is running perfectly. This is the single most common cause of AGW issues I troubleshoot.

In the Azure Portal, navigate to your Application Gateway → Settings → Health probes. Click on your probe to inspect it. Check these fields carefully:

Protocol, must match what your backend actually listens on (HTTP vs HTTPS)
Host, if your backend is a multi-tenant service like App Service, this must be the correct hostname, not blank
Path, must return a 200 status code. If your app returns 301 redirects from /, change this to a dedicated health endpoint like /health or /ping
Timeout (seconds), should be lower than the interval. I recommend 30 seconds timeout, 30 seconds interval to start
Unhealthy threshold, set this to 3, not 1, to avoid flapping on transient failures

To update via PowerShell when you need to script changes across multiple gateways:

$gw = Get-AzApplicationGateway -Name "myAppGateway" -ResourceGroupName "myRG"
$probe = Get-AzApplicationGatewayProbeConfig -ApplicationGateway $gw -Name "myProbe"
Set-AzApplicationGatewayProbeConfig -ApplicationGateway $gw `
  -Name "myProbe" `
  -Protocol Http `
  -Path "/health" `
  -Interval 30 `
  -Timeout 30 `
  -UnhealthyThreshold 3 `
  -Host "myapp.azurewebsites.net" `
  -Match (New-AzApplicationGatewayProbeHealthResponseMatch -StatusCode "200-399")
Set-AzApplicationGateway -ApplicationGateway $gw

After saving, give the probe 2 to 3 minutes to cycle. Then check Backend health again. If backends flip to green, you found your problem.

Diagnose HTTP 502 Bad Gateway Errors

A 502 from Application Gateway specifically means the gateway established a connection attempt to the backend but either got an invalid HTTP response or the connection was reset/timed out before a valid response arrived. I know how maddening this is, the gateway is running, your app looks fine, but users are hitting a wall.

First, check whether the 502 is consistent or intermittent. Consistent 502s usually point to a configuration problem. Intermittent 502s (especially under load) usually point to backend timeouts or connection pool exhaustion.

For consistent 502s, check your HTTP Settings in the portal under Settings → HTTP settings. Specifically look at:

Request timeout, default is 20 seconds. If your backend takes longer to respond (e.g., a cold-start on App Service), increase this to 60 or 120 seconds
Override with new host name, when targeting App Service, set this to "Pick host name from backend target" to avoid hostname mismatch errors
Cookie-based affinity, if enabled but your backend doesn't support sticky sessions, this can cause session routing failures

For intermittent 502s, enable diagnostic logging and query the access log. Run this in Log Analytics:

AzureDiagnostics
| where ResourceType == "APPLICATIONGATEWAYS"
| where httpStatus_d == 502
| project TimeGenerated, requestUri_s, serverStatus_s, timeTaken_d, clientIP_s
| order by TimeGenerated desc
| take 100

The serverStatus_s field tells you what the backend actually returned (or if the connection was reset). If you see 0 in that field, the backend never responded, that's a timeout, not a bad response. Increase your request timeout in HTTP settings and consider scaling your backend.

If you see consistent backend response codes like 403 or 500, the problem is in your application layer, not the gateway itself.

Fix Network Security Group Rules Blocking Probes

This one catches people off guard constantly, and it's one of the first things I check when ALL backends go unhealthy at once after a network team makes changes. NSG rules silently drop health probe packets, and Application Gateway interprets that silence as backend failure.

Azure Application Gateway health probes originate from IP addresses in the gateway subnet. But the Azure infrastructure also uses the IP address 168.63.129.16 for management traffic. Your NSG on the backend subnet (or the gateway subnet itself) needs to allow inbound traffic from the gateway's IP range and from the GatewayManager service tag.

Navigate to your backend VM's or App Service's associated NSG. Go to Inbound security rules and verify these rules exist:

# Required inbound rule on backend subnet NSG
Source: Application Gateway subnet CIDR (e.g., 10.0.1.0/24)
Destination: Backend subnet CIDR
Port: Your backend app port (e.g., 80 or 443)
Action: Allow
Priority: Lower number than any Deny rules (e.g., 100)

# Required inbound rule for AGW management
Source: GatewayManager (service tag)
Destination: Any
Port: 65200-65535
Action: Allow

That second rule, ports 65200 through 65535, is specifically for Application Gateway V2 infrastructure communication. Without it, the gateway itself may fail to provision or update. This is a requirement Microsoft documents but buries deep, and I've seen enterprise teams accidentally block it when tightening security posture.

Also check your Application Gateway subnet's NSG. It needs an inbound rule from AzureLoadBalancer service tag to any port, Application Gateway V2 relies on Azure's internal load balancer for its own health.

After updating NSG rules, test backend health again. Changes to NSGs take effect almost immediately, within 30 to 60 seconds, so you don't need to wait long.

Resolve SSL/TLS Certificate Errors on the Backend

End-to-end SSL with Application Gateway is powerful for security, but it's one of the trickier configurations to get right. When you enable HTTPS on your HTTP settings (backend protocol = HTTPS), the gateway validates the backend's certificate. If that validation fails, backends go unhealthy and you get 502 errors, all while the app itself serves HTTPS just fine when accessed directly.

There are two common certificate-related failure patterns I see:

1. Self-signed or private CA certificate on the backend. Application Gateway V2 (Standard_v2 / WAF_v2 SKU) requires you to upload the trusted root certificate of your backend's certificate chain. Go to Settings → HTTP settings, open your HTTP setting, and under Use for App Service or Custom probe, find the CER certificate upload field. Export your backend server's root CA certificate in Base64-encoded X.509 (.CER) format and upload it here.

To export the root certificate from a Windows machine where you have the backend cert installed:

# Open PowerShell as admin
$cert = Get-ChildItem -Path Cert:\LocalMachine\My | Where-Object {$_.Subject -like "*yourbackend*"}
$rootCert = (New-Object System.Security.Cryptography.X509Certificates.X509Chain)
$rootCert.Build($cert) | Out-Null
$rootCA = $rootCert.ChainElements[-1].Certificate
[System.IO.File]::WriteAllBytes("C:\temp\backend-root.cer", $rootCA.Export([System.Security.Cryptography.X509Certificates.X509ContentType]::Cert))

2. Certificate CN/SAN mismatch. If your backend certificate's Common Name or Subject Alternative Name doesn't match the hostname the gateway uses to connect, the TLS handshake fails. Check the Host name override field in your HTTP settings, it must match exactly what's on the backend certificate. Wildcard certificates need the specific subdomain in the host override, not just *.domain.com.

After uploading the correct root certificate and saving HTTP settings, the gateway reprovisioned state typically takes 2 to 5 minutes to fully propagate. Check Backend health after that window.

Enable Diagnostic Logs and Query Access Logs in Log Analytics

If you're past the obvious fixes and still seeing intermittent failures, you need data. Gut feelings don't fix Azure Application Gateway, log data does. Enabling diagnostic logging is non-optional for any production AGW deployment, and I'll show you exactly how to get it set up and queried.

In the Azure Portal, go to your Application Gateway → Monitoring → Diagnostic settings. Click Add diagnostic setting. Name it something like agw-logs-la. Check these log categories:

ApplicationGatewayAccessLog, every request/response with timing, status codes, client IPs, backend IPs
ApplicationGatewayPerformanceLog, throughput, connection counts, failed requests per backend
ApplicationGatewayFirewallLog, WAF rule matches and blocks (critical if WAF is enabled)

Send logs to a Log Analytics Workspace. Once data starts flowing (allow 5 to 10 minutes for initial ingestion), use these Kusto queries to dig in:

// Find all non-200 responses and what backend served them
AzureDiagnostics
| where ResourceType == "APPLICATIONGATEWAYS" and OperationName == "ApplicationGatewayAccess"
| where httpStatus_d != 200
| summarize Count = count() by httpStatus_d, serverRouted_s, requestUri_s
| order by Count desc

// Find slowest backend responses (potential timeout candidates)
AzureDiagnostics
| where ResourceType == "APPLICATIONGATEWAYS" and OperationName == "ApplicationGatewayAccess"
| where timeTaken_d > 5000  // milliseconds
| project TimeGenerated, requestUri_s, timeTaken_d, serverRouted_s, httpStatus_d
| order by timeTaken_d desc
| take 50

// WAF blocks, what's being blocked and why
AzureDiagnostics
| where ResourceType == "APPLICATIONGATEWAYS" and OperationName == "ApplicationGatewayFirewall"
| where action_s == "Blocked"
| project TimeGenerated, clientIP_s, requestUri_s, ruleId_s, ruleGroup_s, message_s
| order by TimeGenerated desc

The serverRouted_s field shows you the actual backend IP and port that handled (or failed to handle) each request. Cross-reference that with your backend pool IPs to identify if one specific backend instance is responsible for errors while others are healthy. That narrows your focus dramatically.

Advanced Troubleshooting

When the standard checks don't solve it, you're dealing with something more architectural. Here's where I go when the basics haven't helped.

WAF False Positives Blocking Legitimate Traffic

If you're running Application Gateway WAF (WAF_v2 SKU), the OWASP 3.2 rule set is aggressive by default. I've seen it block legitimate API calls, multipart form uploads, and JSON payloads that contain strings matching injection patterns. Your application works fine when you bypass the WAF, but breaks when it's in the path.

Check the ApplicationGatewayFirewallLog in Log Analytics first (query shown in Step 5). Find the rule IDs that are triggering. Then navigate to Application Gateway → Web application firewall → Rules. You can disable specific rules or entire rule groups without disabling the WAF entirely. For example, Rule Group REQUEST-942-APPLICATION-ATTACK-SQLI is notorious for false positives on legitimate parameterized queries.

Always set WAF to Detection mode first when you're tuning. In Detection mode it logs but doesn't block, giving you data without breaking production. Flip to Prevention mode only after you've verified the rule exclusions are correct.

Routing Rule Conflicts and Priority Issues

Application Gateway evaluates listeners in priority order. If you have multiple path-based routing rules or multi-site listeners, a misconfigured rule can "steal" traffic intended for another listener. Check your routing rules under Settings → Rules. Each rule has a priority number, lower numbers are evaluated first. A catch-all basic listener with priority 1 will intercept everything before your path-based rules at priority 100 ever get a chance to run.

To audit your current routing configuration via PowerShell:

$gw = Get-AzApplicationGateway -Name "myAppGateway" -ResourceGroupName "myRG"
$gw.RequestRoutingRules | Select-Object Name, Priority, RuleType,
  @{N="Listener";E={$_.HttpListener.Id.Split("/")[-1]}},
  @{N="BackendPool";E={$_.BackendAddressPool.Id.Split("/")[-1]}} | Format-Table

Connection Draining and Backend Updates

When you update a backend pool during live traffic, Application Gateway can drop in-flight connections if connection draining isn't configured. Enable it under HTTP settings → Connection draining with a drain timeout of 30 seconds. This tells the gateway to stop sending new requests to a backend being removed but to let existing connections complete naturally, essential for zero-downtime deployments.

Checking the Activity Log for Provisioning Failures

Gateway configuration changes that partially fail leave the resource in a bad state. Go to Monitoring → Activity log and filter for the last 24 hours. Look for any Write ApplicationGateway operations with status Failed. Expand the failure entry to see the error detail, common causes include IP address conflicts in the subnet, exceeding subscription limits on public IPs, or Key Vault certificate permission errors.

When to Call Microsoft Support

Escalate to Microsoft Support when: your gateway is stuck in a Failed or Updating provisioning state for more than 30 minutes and the portal shows no actionable error; you're seeing platform-level errors in the Activity log referencing internal Azure fabric failures; or you have a P1 production outage and have already verified all configuration elements are correct. Create a Severity A support ticket and include your gateway Resource ID, the Activity log correlation IDs from the failed operations, and your Log Analytics workspace ID.

Prevention & Best Practices

Most Azure Application Gateway outages are preventable. I've seen the same failure patterns repeat across dozens of organizations, and they almost always come down to skipping a few setup steps that seem optional until they're not.

Design a dedicated health endpoint on every backend. Don't rely on the root path / for health probes. Create a /health route in your application that returns HTTP 200 unconditionally (or checks a database ping and returns 503 if the DB is down). This gives you precise control over when backends mark themselves unhealthy, and it avoids false failures when your root path redirects or requires authentication.

Enable diagnostic logs from day one. I can't stress this enough. When an incident happens at 3 AM, you want Log Analytics already ingesting your access logs and firewall logs. Retroactively enabling logs after an outage means you have no data from when the problem actually started. Set up a Log Analytics workspace, connect it to your AGW on first deployment, and it costs almost nothing at low traffic volumes.

Set up Azure Monitor alerts for backend health. Navigate to Monitoring → Alerts → Create alert rule. Use the metric Unhealthy Host Count with a threshold of > 0 for any 5-minute window. This fires before users start complaining. Also alert on Response Status metric filtered to 5xx codes exceeding your baseline.

Test NSG changes in staging before production. Every time your network team updates NSG rules, run a health probe verification immediately after. Create a simple Azure DevOps pipeline stage that calls the AGW backend health API and fails the deployment if any backend is unhealthy.

Document your certificate rotation schedule. Backend SSL certificate expiry is a silent killer. When the backend cert expires, every backend goes unhealthy instantly. Put certificate expiry dates in your team's calendar with 60-day and 30-day reminders. Consider using Azure Key Vault certificate integration with auto-rotation for the gateway's own frontend certificates.

Quick Wins

Set WAF to Detection mode initially and tune exclusions before enabling Prevention mode in production
Use the GatewayManager and AzureLoadBalancer service tags in NSG rules instead of hardcoded IP ranges, they update automatically as Azure evolves its infrastructure
Enable connection draining (30 seconds minimum) in HTTP settings to protect in-flight requests during backend pool updates
Pin your Application Gateway to a dedicated subnet with no other resources, mixing AGW with other workloads in the same subnet causes NSG conflicts that are painful to debug

Frequently Asked Questions

Why does my Azure Application Gateway show all backends as unhealthy after I changed an NSG rule?

This is almost certainly because your new NSG rule is blocking the health probe traffic from the Application Gateway subnet to your backend subnet, or blocking the required management port range 65200–65535. Application Gateway health probes originate from the gateway subnet's IP range. Open an inbound NSG rule on your backend subnet that explicitly allows TCP traffic from the Application Gateway subnet CIDR on your application's port. Also verify the rule allowing inbound from the GatewayManager service tag on ports 65200–65535 is still present on the AGW subnet NSG, it's easy to accidentally delete during a cleanup sweep. Changes take effect within 60 seconds, so recheck Backend health after waiting 2 to 3 minutes.

How do I fix Azure Application Gateway 502 errors that only happen under high load?

Intermittent 502s under load typically mean one of three things: your backend request timeout in HTTP settings is too short and backends are taking longer to respond under stress; your backend instances are hitting connection limits or memory pressure and resetting connections; or your Application Gateway autoscaling hasn't provisioned enough capacity units fast enough. Start by increasing the request timeout to 120 seconds in HTTP settings. Then check your backend metrics, CPU above 80% or connection queue depth rising sharply during the 502 windows is a dead giveaway. For AGW V2, set a minimum capacity unit count above 2 so there's pre-provisioned capacity before traffic spikes hit. Also consider enabling Application Gateway autoscale with a max instance count appropriate for your traffic patterns.

My WAF is blocking legitimate requests, how do I find which rule is causing it without turning WAF off?

Switch WAF to Detection mode first (under Web application firewall → Policy settings), this logs everything but blocks nothing, so production traffic flows while you gather data. Then query the ApplicationGatewayFirewallLog table in Log Analytics using the query from Step 5 above, filtering for action_s == "Matched". You'll see the specific rule IDs and rule groups triggering on your legitimate requests. Once identified, create rule exclusions in the WAF policy under Exclusions, you can scope exclusions to specific match variables, request attributes, and even specific request URLs so you're not broadly weakening your WAF posture. After adding exclusions, flip back to Prevention mode and monitor for a few hours.

How do I troubleshoot SSL certificate errors on Azure Application Gateway backends?

When you enable HTTPS on your HTTP settings (backend protocol HTTPS), Application Gateway V2 validates the backend certificate chain against trusted root certificates you upload. If nothing is uploaded, or if the wrong root is uploaded, backends go unhealthy with the message "The backend certificate is not trusted." Export the root CA certificate of your backend's cert in Base64 X.509 .CER format and upload it under the HTTP settings for that backend. If you're using App Service, you can toggle "Use for App Service" which automatically trusts App Service's certificate chain. Also verify the hostname in your HTTP settings matches the CN or a SAN entry on the backend certificate, a mismatch causes TLS handshake failures even with the correct root CA uploaded.

What's the difference between a 502 and a 503 from Azure Application Gateway?

A 502 Bad Gateway means the Application Gateway was able to select a backend to send the request to, but the backend returned an invalid response or the connection failed mid-request, the gateway tried but the backend didn't cooperate. A 503 Service Unavailable means the gateway found zero healthy backends in the pool to route the request to, it never even tried to forward it. If you're seeing 503, check Backend health immediately, all backends in your pool are marked unhealthy simultaneously, which typically points to a health probe misconfiguration, an NSG change blocking probes, or a backend application restart that temporarily fails health checks. 502 requires investigation into what the backend is actually returning; 503 is almost always a probe or availability issue.

Can Azure Application Gateway route traffic to backends in a different region or subscription?

Yes, Application Gateway supports FQDN-based backend pool members, you can add a DNS name like myapp.eastus.azurewebsites.net as a backend target regardless of region or subscription. The gateway resolves the FQDN at health probe time and routes accordingly. However, cross-region routing through AGW adds latency and is generally not recommended for latency-sensitive workloads, Azure Front Door is better suited for global load balancing with anycast routing. If you're targeting a different subscription's resources, the backend just needs to be network-reachable (via VNet peering, private endpoint, or public FQDN) from the Application Gateway subnet. There's no subscription-level restriction on backend targets, only network-level reachability requirements.

Related Microsoft Fix Guides

Sai Kiran Pandrala

Our team includes certified Microsoft engineers, Azure architects, and system administrators with 10+ years of enterprise IT experience. Every guide is written from hands-on troubleshooting, not guesswork. We test every fix before publishing.