How to Troubleshoot Azure Traffic Manager (Complete Fix Guide)

Microsoft Fix Intermediate 18 min read Official Docs Grounded Updated April 20, 2026

Why This Is Happening

I've seen this exact scenario play out on dozens of Azure deployments , everything looks green in the portal, your Traffic Manager profile is "Online," and then suddenly your users start hitting errors, or worse, all traffic silently routes to a single region while the others sit idle. Azure Traffic Manager troubleshooting is one of those tasks that feels straightforward until you're actually in it at 2am with your CTO on a call.

Here's the thing about Traffic Manager that catches even experienced Azure engineers off guard: it works entirely at the DNS layer. It doesn't proxy traffic. It doesn't sit in your data path. It simply returns different IP addresses in response to DNS queries , and that distinction is the root cause of most Azure Traffic Manager problems that don't have an obvious answer.

When Traffic Manager seems to stop working, the failure can come from several different directions simultaneously. Endpoint health probes start returning unexpected status codes. DNS TTL caching means your routing change doesn't take effect for minutes or hours after you made it. Clients resolve the DNS name, cache the result, and then completely bypass Traffic Manager for the duration of that TTL, even if the endpoint goes down. This is by design, not a bug, but it surprises people every single time.

The most common root causes I see in real-world Azure Traffic Manager troubleshooting scenarios fall into four buckets. First: endpoint health check misconfiguration, the probe path doesn't exist, returns the wrong HTTP status code, or is blocked by a firewall or NSG rule. Second: DNS propagation delays combined with overly aggressive TTL values that keep clients pinned to degraded endpoints far longer than intended. Third: routing method mismatches, the engineer configured "Priority" routing but expected "Performance" behavior, or "Weighted" routing with all weights equal causing unexpected distribution. Fourth: nested profile misconfiguration in enterprise topologies where the parent profile marks a child profile degraded based on stale health data.

I know this is frustrating, especially when you've got a multi-region architecture specifically designed for resilience and it's failing during the incident you built it to handle. The good news is that Azure Traffic Manager problems are almost always diagnosable and fixable without Microsoft Support involvement if you know where to look.

Browse all Microsoft fix guides →

The Quick Fix, Try This First

Before you dive deep into Azure Traffic Manager troubleshooting, do this one check. Open the Azure Portal, navigate to your Traffic Manager profile, and click Endpoints in the left-hand blade. Look at the Monitor Status column for every endpoint. What do you see?

If any endpoint shows Degraded or Inactive, that's your starting point. Click the degraded endpoint, then click Monitor Settings back on the profile blade. Check three things: the protocol (HTTP vs HTTPS), the port, and the path. In my experience, about 60% of Azure Traffic Manager health probe failures come down to one of these three values being wrong. The default probe path is /, but if your application serves a 301 redirect from / to /home, Traffic Manager will mark the endpoint degraded because it follows up to 10 redirects by default only in newer API versions. Older profiles might not follow redirects at all depending on when they were created.

Change the probe path to a dedicated health endpoint that returns HTTP 200 directly, with no redirects:

GET /health HTTP/1.1
Host: your-app.azurewebsites.net
Response: 200 OK

After updating the probe path, Traffic Manager will re-evaluate the endpoint within its next probe interval. The default probe interval is 30 seconds for Standard profiles and 10 seconds if you enable fast endpoint failover (which requires the endpoint to fail 3 consecutive probes before marking it degraded). Give it 90 seconds and refresh the Endpoints view.

If all endpoints show Online but traffic still isn't routing correctly, the problem is almost certainly DNS TTL caching on the client side. Run this from your local machine or an Azure Cloud Shell session to check what DNS is resolving:

Resolve-DnsName your-profile.trafficmanager.net -Type A

Compare the returned IP against where traffic should go. If it's pointing at the right endpoint but you're still hitting the wrong destination, something between the client and your app is caching beyond the TTL, a corporate DNS resolver, a CDN layer, or even the OS DNS cache (ipconfig /flushdns on Windows or sudo systemd-resolve --flush-caches on Linux to clear it).

Pro Tip
Never test Azure Traffic Manager behavior from the same machine that made recent DNS queries to the profile, your OS resolver will have cached the result and you'll think nothing changed. Always test from a fresh Cloud Shell session or a machine on a different network. I keep an Azure VM in a different region specifically for this kind of validation.
1
Verify Endpoint Health Probe Configuration

This is ground zero for Azure Traffic Manager troubleshooting. Head to the Azure Portal and open your Traffic Manager profile. In the left menu, select Configuration. Under the Endpoint monitor settings section, you'll see the Protocol, Port, and Path fields.

For HTTP probes, Traffic Manager sends a GET request to the configured path and expects a 200–299 HTTP response code. If your endpoint returns anything outside that range, including 301, 302, 401, 403, or 404, the endpoint is marked Degraded after the configured number of tolerated failures (default: 3). Here's what each status means in practice:

  • Online, Probe is succeeding. Endpoint is eligible to receive traffic.
  • Degraded, Probe is failing. Endpoint is excluded from DNS responses.
  • Disabled, You manually disabled it. Traffic Manager won't even probe it.
  • Inactive, The parent profile is disabled, or the endpoint type is invalid.
  • Checking endpoint, Traffic Manager is actively running its first probe. Wait 90 seconds.

Use Azure CLI to query the exact probe result status programmatically, the portal sometimes lags behind the API:

az network traffic-manager endpoint show \
  --resource-group myRG \
  --profile-name myTMProfile \
  --type azureEndpoints \
  --name myEndpoint \
  --query "endpointMonitorStatus"

If you need to probe from a specific region to test reachability, use curl from an Azure VM or Container Instance in that region:

curl -I https://your-app.azurewebsites.net/health

You should see HTTP/2 200 in the response headers. If you see a 301 or 403 from that specific IP, your probe will fail even if the app works fine through the browser (which handles redirects automatically).

2
Check Firewall and NSG Rules Blocking Health Probes

This one is responsible for more head-scratching than almost anything else in Azure Traffic Manager troubleshooting. Your endpoint looks healthy in every other test, you curl it, you browse to it, everything works, but Traffic Manager keeps marking it Degraded. The culprit: a Network Security Group (NSG) or Web Application Firewall (WAF) policy that blocks the specific IP ranges Traffic Manager uses to send health probes.

Azure Traffic Manager sends health probes from a well-known set of IP address ranges. These are the same Azure datacenter IP ranges published in the weekly JSON download, but specifically from the AzureTrafficManager service tag. You need to allow inbound traffic from this service tag on your probe port.

To add the rule to an existing NSG via PowerShell:

$nsg = Get-AzNetworkSecurityGroup -Name "myNSG" -ResourceGroupName "myRG"

Add-AzNetworkSecurityRuleConfig `
  -NetworkSecurityGroup $nsg `
  -Name "Allow-TrafficManager-Probes" `
  -Protocol Tcp `
  -Direction Inbound `
  -Priority 100 `
  -SourceAddressPrefix "AzureTrafficManager" `
  -SourcePortRange * `
  -DestinationAddressPrefix * `
  -DestinationPortRange 443 `
  -Access Allow

Set-AzNetworkSecurityGroup -NetworkSecurityGroup $nsg

For Azure App Service with IP restrictions enabled, go to your App Service → NetworkingAccess Restrictions and add a rule for the AzureTrafficManager service tag. Without this, the App Service will return 403 Forbidden to every probe, and Traffic Manager will correctly, if infuriatingly, mark it Degraded.

After adding the NSG rule, wait one full probe cycle (30 seconds standard, 10 seconds fast) and refresh the endpoint status. If it flips to Online, you've found your issue.

3
Diagnose DNS Resolution and TTL Caching Issues

When Azure Traffic Manager isn't routing traffic the way you expect, even though all endpoints show Online, the problem is almost always living in DNS. Traffic Manager is a DNS-based load balancer, which means clients resolve the profile name to an IP, cache that IP for the duration of the TTL, and then talk directly to that IP for every subsequent request until the TTL expires. Traffic Manager is completely out of the picture until the next DNS query.

Check your profile's current TTL setting in the portal under Configuration. The default is 300 seconds (5 minutes). During a failover, this means clients can continue hitting a degraded endpoint for up to 5 minutes after Traffic Manager has already stopped returning that IP in new DNS responses. Lower the TTL to 30 seconds for critical profiles, but understand this increases DNS query volume and costs.

Run a live DNS trace to see exactly what Traffic Manager is returning right now:

# PowerShell, check what IP is being returned
Resolve-DnsName your-profile.trafficmanager.net -Server 8.8.8.8 | Select-Object Name, IPAddress, TTL

# Bash / Cloud Shell
dig @8.8.8.8 your-profile.trafficmanager.net +short

Using 8.8.8.8 (Google DNS) or 1.1.1.1 (Cloudflare) bypasses your corporate resolver and gives you an uncached view of what Traffic Manager is actually serving. If this returns the correct endpoint but your app is still hitting the wrong one, the issue is local DNS caching.

Flush the DNS cache on Windows clients:

ipconfig /flushdns
ipconfig /registerdns

On Linux/macOS:

# systemd-resolved
sudo systemd-resolve --flush-caches

# nscd
sudo service nscd restart

One more thing to check: if you're using Azure DNS Private Zones alongside Traffic Manager, confirm there's no conflicting A record in the private zone that overrides what Traffic Manager returns for internal clients.

4
Validate Your Routing Method Configuration

Azure Traffic Manager supports six routing methods, Priority, Weighted, Performance, Geographic, Multivalue, and Subnet, and picking the wrong one, or misconfiguring the right one, produces behavior that looks like a bug but isn't. This is a very common stop in Azure Traffic Manager troubleshooting that gets overlooked because people assume the routing method they set up originally is still the right one.

Navigate to your profile's Configuration blade and look at the Routing method dropdown. Then ask yourself: is this actually what I want?

  • Priority, All traffic goes to the lowest priority number endpoint (Priority 1) unless it's Degraded, then it falls over to Priority 2, and so on. If you're seeing all traffic on one endpoint when you expected distribution, you likely have Priority routing instead of Weighted.
  • Weighted, Traffic is distributed proportionally based on endpoint weights. If you set all weights to 1, distribution will be roughly equal. If one endpoint has Weight 10 and another has Weight 1, that first endpoint gets about 91% of traffic.
  • Performance, Clients are routed to the endpoint with the lowest network latency from their location. This requires the endpoints to be in different Azure regions and relies on Traffic Manager's internal latency table. If all your endpoints are in the same region, Performance routing degenerates to round-robin.
  • Geographic, Routes based on the geographic location of the DNS query source. A common misconfiguration here: if a region isn't mapped to any endpoint, queries from that region will receive a "No data" DNS response and users get a connection failure.

To check and update routing method via CLI:

az network traffic-manager profile update \
  --resource-group myRG \
  --name myTMProfile \
  --routing-method Performance

After changing the routing method, the new behavior takes effect immediately for new DNS queries, but existing cached resolutions will continue until their TTL expires.

5
Use Azure Monitor and Diagnostics to Pinpoint the Failure

When the portal isn't giving you enough information, or when you need to understand what happened during a past incident, Azure Monitor is where you go. Azure Traffic Manager emits metrics you can query to correlate endpoint health changes with traffic shifts over time.

In the portal, open your Traffic Manager profile and click Metrics under Monitoring. The key metrics for Azure Traffic Manager troubleshooting are:

  • Endpoint by Status, Shows the count of endpoints in each status (Online, Degraded, Disabled) over time. Use this to pinpoint exactly when an endpoint went Degraded.
  • Queries by Endpoint Returned, Shows how many DNS queries resolved to each endpoint. If this drops to zero for an endpoint that should be receiving traffic, you've confirmed a routing problem.
  • Probe Latency by Probe Endpoint, Shows the round-trip time of health probes. Spikes here indicate network-level issues reaching the endpoint from Traffic Manager's probe agents.

Enable Diagnostic Settings to capture Traffic Manager logs to Log Analytics. From your profile, click Diagnostic settingsAdd diagnostic setting. Select ProbeHealthStatusEvents and send to a Log Analytics Workspace. Then run this KQL query to see recent probe failures:

AzureDiagnostics
| where ResourceType == "TRAFFICMANAGERPROFILES"
| where Category == "ProbeHealthStatusEvents"
| where ResultDescription != "Online"
| project TimeGenerated, EndpointName = Resource, ResultDescription, Description
| order by TimeGenerated desc
| take 50

This query returns the last 50 probe events where an endpoint wasn't Online, including the exact reason for failure. I've used this to prove that a backend team's deployment caused a 3-minute outage, the timestamps don't lie. If you need real-time alerting, create an Alert Rule on the Endpoint by Status metric with a condition like "Degraded endpoint count > 0" and route it to an Action Group that pages your on-call team.

Advanced Troubleshooting

Nested Traffic Manager Profile Failures

Enterprise architectures frequently use nested Traffic Manager profiles, a parent profile routes to child profiles (each covering a region), and each child profile manages individual endpoints within that region. This is the recommended pattern for global + regional failover, but it adds complexity to Azure Traffic Manager troubleshooting because a child profile's health status rolls up to the parent in non-obvious ways.

A nested endpoint (child profile) is considered Degraded by the parent if fewer than the configured minimum number of child endpoints are Online. The default minimum is 1, meaning as long as at least one endpoint in the child profile is Online, the parent sees the child as healthy. If you need the child to be considered Degraded when more than one endpoint fails (for capacity reasons), increase the minimum endpoint count:

az network traffic-manager endpoint update \
  --resource-group myRG \
  --profile-name myParentProfile \
  --type nestedEndpoints \
  --name myChildProfileEndpoint \
  --min-child-endpoints 2

Geographic Routing: The "No Data" Trap

With Geographic routing, every geographic region on the planet must be mapped to at least one endpoint. If a user from an unmapped region queries your Traffic Manager profile, they receive a NOERROR DNS response with no answer, effectively a silent DNS failure. The fix: map the World geographic region to a fallback endpoint. This acts as a catch-all for any unmapped location. In the portal, edit the endpoint and under Geographic mapping, add "World (All)" as a region.

Subnet Routing Mismatches

Subnet routing maps specific client IP ranges to endpoints. If a client IP falls outside all defined subnets, Traffic Manager falls back to the "All other" subnet mapping. If you haven't defined an "All other" entry, those clients get no DNS response. Double-check your subnet ranges for overlaps and gaps, especially after IP range expansions in your corporate network.

Checking Traffic Manager Probe Sources

Traffic Manager uses dedicated probe agents distributed globally. You can't control which probe agent tests your endpoint. If your endpoint has geo-blocking rules (only allowing traffic from certain countries), probes from other regions will fail. To see which probe source IPs hit your endpoint logs, search your app access logs for the user-agent string:

Microsoft.Azure.Traffic Manager Health Check

Any request with this user-agent is a Traffic Manager health probe. If these are being blocked in your firewall logs, that's your degraded endpoint cause.

When to Call Microsoft Support

Most Azure Traffic Manager problems are self-serviceable, but escalate to Microsoft Support if: you're seeing endpoints intermittently flip between Online and Degraded with no changes on your side (potential platform-side probe agent issue), if DNS responses from Traffic Manager contain incorrect IPs that don't match any of your endpoints, or if you've confirmed all endpoints are healthy and DNS is resolving correctly but traffic distribution is severely skewed beyond what your weights or routing method can explain. Open a Severity B support ticket via the Azure Portal (Help + Support → Create a support request, selecting Traffic Manager as the service). Include your profile resource ID, the time window of the issue, and the results of your Resolve-DnsName and Metrics exports.

Prevention & Best Practices

The best Azure Traffic Manager troubleshooting session is the one you never have to run. After spending years working through these issues in production, here are the patterns that actually prevent problems rather than just reacting to them.

Build a dedicated health probe endpoint. Never probe your root URL or a page that has dependencies, database calls, external API calls, authentication redirects. Create a dedicated /health route in every application that checks internal dependencies and returns 200 only when everything is genuinely ready to serve traffic, 503 when it isn't. Traffic Manager will respect 503 as "unhealthy" and stop routing there. This one change prevents false positives and false negatives simultaneously.

Set a realistic TTL. The default 300 second TTL means a 5-minute blast radius when an endpoint fails. For most production workloads, 60 seconds is a reasonable compromise between failover speed and DNS query cost. For very high-availability scenarios with fast probe intervals enabled, drop it to 30 seconds. Document your TTL in your runbook so your on-call team knows how long to expect degraded behavior after a failover event.

Test your failover before you need it. Manually disable an endpoint in the portal and watch how long it takes for traffic to move. Measure the actual failover time, it's usually TTL + probe interval + tolerated failures × probe interval. For a 60-second TTL, 30-second probe interval, and 3 tolerated failures, worst-case failover is 60 + (3 × 30) = 150 seconds. Know this number for your architecture before an incident forces you to learn it.

Use Azure Monitor Alerts proactively. Set up alerts on Endpoint by Status before there's a problem. Route alerts to a Teams channel or PagerDuty Action Group. By the time your users report an outage, your alert should have already fired.

Quick Wins
  • Switch probe path from / to a dedicated /health endpoint that returns 200 with zero dependencies
  • Lower TTL to 60 seconds on production profiles; document the expected failover window in your runbook
  • Enable fast endpoint failover (10-second probe interval) for Tier-1 services, it costs marginally more but cuts failover time dramatically
  • Add the AzureTrafficManager service tag to all NSG and App Service IP restriction allow-lists during initial deployment, not after the first outage

Frequently Asked Questions

Why does my Azure Traffic Manager endpoint keep switching between Online and Degraded every few minutes?

This flapping behavior almost always means the health probe is intermittently failing, typically because the endpoint is returning inconsistent response times that exceed Traffic Manager's probe timeout (default: 10 seconds), or because the endpoint occasionally returns a non-2xx response. Check your application's error rate and response time in Application Insights during the flapping windows. Also look at whether your endpoint is behind a load balancer that occasionally routes probes to a degraded instance, Traffic Manager probes the external IP, not individual instances. Another common cause is a probe path that triggers a slow database query, causing sporadic timeouts. Switch to a lightweight health endpoint that skips heavy operations.

Traffic Manager shows all endpoints Online but users are still getting errors, what's going on?

This is the classic DNS caching scenario. Traffic Manager's health status being Online only means the probe is succeeding, it doesn't mean client DNS has refreshed yet. If you recently recovered a failed endpoint, clients who cached the old (degraded) endpoint's IP before the failure will keep hitting it until their DNS TTL expires. Run Resolve-DnsName your-profile.trafficmanager.net -Server 1.1.1.1 from an uncached client to confirm Traffic Manager is now returning the correct IP. Also check: if you're using a CDN in front of Traffic Manager, the CDN may have its own caching layer that needs to be purged separately.

How do I test Azure Traffic Manager failover without taking down a real endpoint?

The safest way is to manually disable an endpoint directly in the portal. Go to your Traffic Manager profile → Endpoints → click the endpoint → set the Status toggle to Disabled. Traffic Manager will immediately stop routing new DNS queries to that endpoint (existing clients still have cached resolutions until TTL expires). Watch the Queries by Endpoint Returned metric to see traffic shift. Re-enable when done. This is non-destructive, the endpoint itself keeps running, it just gets excluded from DNS responses. I recommend running this test during low-traffic windows on your first attempt so you understand your actual failover timing before you need it in an emergency.

Can Azure Traffic Manager work with on-premises endpoints, not just Azure services?

Yes, this is what External Endpoints are for. You can point Traffic Manager at any publicly resolvable IP address or hostname, including on-premises data centers, other cloud providers (AWS, GCP), or co-location facilities. The health probe will be sent from Azure's probe agents over the public internet to whatever IP or domain you specify. The main constraint: external endpoints must be publicly reachable. If your on-premises endpoint is behind a corporate firewall with no public ingress, Traffic Manager's probes will fail and the endpoint will be marked Degraded. You'll also need to allow the AzureTrafficManager service IP ranges through your on-premises firewall for probe traffic.

Why is Azure Traffic Manager Performance routing sending everyone to the same region instead of distributing by latency?

Performance routing uses Traffic Manager's internal Internet Latency Table, a continuously updated map of network latency from various ISPs and regions to Azure datacenters. If all your endpoints are in the same Azure region, Performance routing has no differentiation to act on and will treat them identically. Also, if you have only two endpoints and one is significantly "closer" on the latency table to the majority of your user base, nearly all traffic will concentrate there, this is correct behavior. To validate what Traffic Manager's latency table currently thinks, check the Queries by Endpoint Returned metric geographically segmented by the client's DNS resolver region. You can also use the CheckTrafficManagerEndpoint REST API to query which endpoint would be returned for a specific client IP.

My Traffic Manager profile was working fine and then stopped after I deployed a new TLS certificate, why?

HTTPS health probes validate the TLS certificate on the endpoint by default. If your new certificate has a subject name mismatch (the certificate CN or SAN doesn't match the hostname Traffic Manager is probing), the probe will fail with a TLS handshake error and mark the endpoint Degraded. This also happens if you deploy a self-signed certificate to an endpoint that was previously using a CA-signed one. Check the certificate on your endpoint with openssl s_client -connect your-app.azurewebsites.net:443 -servername your-app.azurewebsites.net 2>/dev/null | openssl x509 -noout -subject -dates and verify the subject and expiry. If the cert is legitimate but newly issued, allow 5–10 minutes for Traffic Manager's probe agents in all regions to complete fresh evaluations against the new certificate.

Related Microsoft Fix Guides

H
Sai Kiran Pandrala
Our team includes certified Microsoft engineers, Azure architects, and system administrators with 10+ years of enterprise IT experience. Every guide is written from hands-on troubleshooting, not guesswork. We test every fix before publishing.