How to Troubleshoot Azure ExpressRoute Issues

Microsoft Fix Advanced 18 min read Official Docs Grounded Updated April 20, 2026

Why Azure ExpressRoute Troubleshooting Is So Hard

I've seen this play out on dozens of enterprise engagements: a critical Azure ExpressRoute connection goes dark at 2 AM, and the on-call engineer is staring at a portal that says "Circuit Status: Enabled" while traffic is sitting completely still. Nothing is working. The Azure portal gives you green checkmarks in all the wrong places. The provider says their end is fine. And yet , nothing moves.

Azure ExpressRoute troubleshooting is genuinely difficult because the failure can live in one of three distinct domains simultaneously: your on-premises environment, the connectivity provider's edge infrastructure, or the Microsoft Enterprise Edge (MSEE) inside Azure itself. Unlike a VPN that you fully own, ExpressRoute is a shared-responsibility model across all three. When your Azure ExpressRoute connection is not working, you need to interrogate each layer methodically before you'll find the real culprit.

Here's what I see causing most ExpressRoute outages and degradation in the field:

  • BGP session drops , The Border Gateway Protocol sessions between your edge router and the MSEE collapse, either because of authentication mismatches (MD5 keys), keepalive timer differences, or route policy changes that suddenly withdraw prefixes.
  • Circuit provisioning stuck in "Not Provisioned", The service key was exchanged but the provider hasn't completed layer-2 provisioning on their side, or there's a miscommunication about VLAN tags.
  • Asymmetric routing, Traffic goes out via ExpressRoute but comes back via the internet, causing stateful firewall drops on-premises that look like mysterious one-way connectivity failures.
  • Route advertisement issues, Either your on-premises router is not advertising the right subnets, or Azure isn't seeing the routes because of a misconfigured route filter on Microsoft Peering.
  • ExpressRoute gateway saturation, The virtual network gateway is hitting its packets-per-second or bandwidth ceiling, causing intermittent drops that look like random network instability.
  • Private peering misconfiguration, Wrong /30 subnets assigned to primary and secondary links, IP addressing overlap, or ASN conflicts with existing BGP neighbors.

The reason Microsoft's error messages don't help much here is that the portal is reporting the control plane status, what Azure knows about, but ExpressRoute data plane failures are often invisible to the portal until you actively probe them. A circuit can show "Provisioned" and "Enabled" while BGP is completely down and zero packets are flowing. That gap between what the UI tells you and what's actually happening is where most engineers lose an hour of their lives.

If you're dealing with an ExpressRoute circuit status showing unknown, BGP sessions that won't come up, or intermittent packet loss on a private peering, you're in the right place. Browse all Microsoft fix guides →

The Quick Fix, Try This First

Before you spend three hours digging through BGP route tables, run this single Azure PowerShell command. It gives you the ground truth on your circuit's actual operational state, not just the provisioning state the portal surface shows you.

Open Azure Cloud Shell or a local terminal with the Az module installed and run:

Get-AzExpressRouteCircuit -Name "YourCircuitName" -ResourceGroupName "YourRG" | Select-Object -ExpandProperty ServiceProviderProvisioningState, CircuitProvisioningState, Sku

You're looking for both values to read Provisioned and Enabled. If ServiceProviderProvisioningState says NotProvisioned while CircuitProvisioningState says Enabled, your connectivity provider hasn't finished their side. Call them with your service key, that's the only fix available to you at that point.

Now check the peering state directly:

Get-AzExpressRouteCircuitPeeringConfig -Name "AzurePrivatePeering" -ExpressRouteCircuit (Get-AzExpressRouteCircuit -Name "YourCircuitName" -ResourceGroupName "YourRG")

Look at the State field. If it's Disabled, someone turned off the peering, possibly intentionally during maintenance, possibly by accident. You can re-enable it right there in the portal under ExpressRoute Circuit > Peerings > Azure Private > Edit > Status: Enabled > Save.

If both show healthy but traffic still isn't flowing, the problem is almost certainly in BGP or in the VNet gateway connection, and you'll need the step-by-step section below.

Pro Tip
Always check the circuit's redundancy status before assuming it's a full outage. ExpressRoute circuits have a primary and secondary link. Run Get-AzExpressRouteCircuitARPTable on both the primary and secondary paths. I've seen dozens of cases where one link silently failed months earlier, the team never noticed because failover worked, and then the second link failed during a maintenance window causing a "sudden" outage that was actually half-broken for 90 days.
1
Verify Circuit and Peering Status in Azure Portal

Start here before touching anything else. Log into the Azure Portal, navigate to All Resources, and filter by type ExpressRoute Circuit. Open your circuit and look at the Overview blade immediately.

You need to confirm four specific fields:

  • Circuit Status, should read "Enabled"
  • Provider Status, should read "Provisioned" (not "Not Provisioned" or "Deprovisioning")
  • Bandwidth, confirm it matches what you contracted
  • SKU / Tier, Local, Standard, or Premium; this affects which regions and peerings are available

Now click Peerings in the left menu. For most enterprise deployments, you'll see Azure Private Peering listed. Click on it. Verify:

  • Primary peer address: the /30 subnet IP assigned to the MSEE side of your primary link
  • Secondary peer address: the /30 for the secondary link
  • VLAN ID: must exactly match what your provider has configured on their CE router
  • Peer ASN: your on-premises BGP ASN, must match your router config exactly
  • Shared Key (MD5): if configured, this must be byte-for-byte identical on both ends

A VLAN ID mismatch is far more common than people admit. Your provider configures VLAN 100 on their end, you entered 1000 in the portal three months ago during setup, and nobody caught it because the circuit appeared to work before a recent router replacement reset the dot1q tags. Check this carefully.

If everything looks correct here, the portal is telling you the configuration is right but you still have no traffic, move to Step 2 to interrogate BGP directly.

2
Retrieve ARP Tables to Confirm Layer-2 Connectivity

The ARP table check is the single most powerful first diagnostic you have for Azure ExpressRoute BGP session down problems. It tells you definitively whether Layer 2 is working between your edge router and the MSEE. If the ARP table is empty, you have a physical/Layer-2 problem. If it's populated, the issue is Layer 3 or BGP.

Run this in Azure PowerShell, substitute your actual circuit and resource group names:

$ckt = Get-AzExpressRouteCircuit -Name "YourCircuitName" -ResourceGroupName "YourRG"

# Primary path ARP table for Private Peering
Get-AzExpressRouteCircuitARPTable -ResourceGroupName "YourRG" `
  -ExpressRouteCircuitName "YourCircuitName" `
  -PeeringType AzurePrivatePeering `
  -DevicePath Primary

# Secondary path
Get-AzExpressRouteCircuitARPTable -ResourceGroupName "YourRG" `
  -ExpressRouteCircuitName "YourCircuitName" `
  -PeeringType AzurePrivatePeering `
  -DevicePath Secondary

A healthy response looks like this, two entries, one for each side of the /30:

Age (Min)  IpAddress       MacAddress
---------  ---------       ----------
0          10.0.0.1        aa:bb:cc:dd:ee:ff   ← MSEE side
0          10.0.0.2        11:22:33:44:55:66   ← Your CE router

If you get back an empty table or a single entry (only the MSEE IP, no CE router IP), your CE router is not sending ARP responses on that VLAN. This usually means: wrong VLAN ID configured on the CE router, the physical interface is down, or the encapsulation type doesn't match (dot1q vs QinQ). Take this output to your provider or your own network team, the fix lives outside Azure at this point.

If both entries are present, Layer 2 is healthy. Proceed to the BGP route table check in Step 3.

3
Inspect BGP Route Tables for Missing or Incorrect Prefixes

With Layer 2 confirmed healthy, the next culprit in most ExpressRoute private peering not established scenarios is a BGP routing problem. Azure gives you direct visibility into what routes the MSEE is learning from your router, and this is where you'll find the smoking gun most of the time.

Pull the route table from the MSEE's perspective:

Get-AzExpressRouteCircuitRouteTable -ResourceGroupName "YourRG" `
  -ExpressRouteCircuitName "YourCircuitName" `
  -PeeringType AzurePrivatePeering `
  -DevicePath Primary

What you're looking for: every on-premises subnet that your Azure VMs need to reach should appear in this output with a valid NextHop pointing to your CE router's IP on the /30 link. If a subnet is missing, your on-premises BGP router is either not advertising it or filtering it with a route policy before it hits the MSEE.

Common issues I find here:

  • Missing prefixes: A new subnet was added on-premises but the BGP network statement or redistribution rule wasn't updated on the CE router.
  • Route withdrawn: A prefix appears in the table with LocalPref 0 or a community value that causes it to be deprioritized, check your BGP policy on the CE router.
  • Summarization mismatch: You're advertising a /16 summary but Azure VMs are trying to reach a /24 that isn't actually covered by your on-premises routing table, causing black-holes.
  • AS path prepending gone wrong: An engineer added AS-path prepending to influence traffic but applied it to the wrong neighbor, making the primary path look worse than the secondary.

Also run the same command with -DevicePath Secondary and compare both outputs. They should be symmetric. If the secondary shows routes the primary doesn't, you have a CE router config inconsistency between your primary and secondary interfaces.

4
Diagnose the Virtual Network Gateway Connection

Here's a scenario I see constantly with Azure ExpressRoute gateway connection issues: the circuit is healthy, BGP is up, routes are being exchanged, but VMs in the VNet still can't reach on-premises. The gap is almost always in the virtual network gateway or the connection object that links the gateway to the circuit.

In the portal, navigate to Virtual Network Gateways > [Your Gateway] > Connections. Check that the ExpressRoute connection is listed and shows Status: Connected. A status of "Not Connected" or "Unknown" here means the association between the gateway and the circuit is broken.

To check this via PowerShell and get richer diagnostics:

# Get the gateway details
$gw = Get-AzVirtualNetworkGateway -Name "YourGatewayName" -ResourceGroupName "YourRG"

# Check all connections
Get-AzVirtualNetworkGatewayConnection -ResourceGroupName "YourRG" | 
  Where-Object {$_.VirtualNetworkGateway1.Id -like "*YourGatewayName*"} | 
  Select-Object Name, ConnectionStatus, EgressBytesTransferred, IngressBytesTransferred

Look at the EgressBytesTransferred and IngressBytesTransferred values. If they're stuck at zero while your circuit's BGP is healthy, there's a routing policy problem inside the VNet, possibly a User-Defined Route (UDR) on the GatewaySubnet that is overriding ExpressRoute-learned routes. Never apply UDRs to the GatewaySubnet, this is one of the most common self-inflicted ExpressRoute wounds I've seen in enterprise environments.

Also confirm the gateway SKU. If you're on a UltraPerformance or ErGw3AZ SKU, verify FastPath is configured correctly if you enabled it. FastPath bypasses the gateway for data traffic, and a misconfiguration there causes the connection to appear healthy while data doesn't flow through.

5
Use Network Watcher and Connection Monitor for End-to-End Validation

Once you've confirmed the circuit and gateway are healthy, you need to validate actual end-to-end connectivity. Azure Network Watcher is your friend here, specifically the Connection Troubleshoot and Connection Monitor features, which can tell you exactly where in the path traffic is being dropped.

In the portal: Network Watcher > Connection Troubleshoot. Select your source VM (in Azure), set destination to an on-premises IP address, and run the test. This performs an actual probe and tells you which hop is failing and what the latency looks like at each step.

For persistent monitoring, set up Connection Monitor (the v2 version, not the deprecated classic one):

# Create a connection monitor test group targeting your on-premises endpoint
New-AzNetworkWatcherConnectionMonitor `
  -NetworkWatcherName "NetworkWatcher_eastus" `
  -ResourceGroupName "NetworkWatcherRG" `
  -Name "ExpressRoute-Health-Monitor" `
  -SourceResourceId "/subscriptions/[subid]/resourceGroups/[rg]/providers/Microsoft.Compute/virtualMachines/[vmname]" `
  -DestinationAddress "10.1.0.50" `
  -DestinationPort 443 `
  -MonitoringIntervalInSeconds 30

This gives you a continuous health signal, not just a one-time check. When Azure ExpressRoute latency problems are intermittent, this monitor will catch the spikes that you'd miss by running manual tests.

Also check Azure Monitor metrics for your ExpressRoute circuit directly. Navigate to ExpressRoute Circuit > Metrics and add these charts:

  • BitsInPerSecond / BitsOutPerSecond, confirms traffic is actually flowing
  • ArpAvailability, drops below 100% indicate Layer-2 instability
  • BgpAvailability, any value below 100% means BGP sessions are flapping
  • GlobalReachBitsInPerSecond, if you use Global Reach, this confirms that path

Set alert rules on BgpAvailability < 90% and ArpAvailability < 90% so you know about degradation before your users do.

Advanced Troubleshooting

When the standard checks don't surface the problem, you're dealing with one of the nastier categories of ExpressRoute failures. Here's how I approach the scenarios that stump most engineers.

Route Filter Issues on Microsoft Peering

If you're using ExpressRoute Microsoft Peering to reach Microsoft 365 or Azure PaaS services over the private circuit, route filters are mandatory. A missing or misconfigured route filter is the number-one cause of "Microsoft Peering is configured but I can't reach Exchange Online" tickets I see. In the portal, navigate to Route Filters > [Your Filter] > Rules. Confirm the BGP community values for the services you need are listed. For Microsoft 365, you typically need community 12076:5010 (Exchange Online), 12076:5020 (SharePoint), and 12076:5030 (Skype/Teams). If the filter exists but the service tag isn't in the rules list, add it and wait 5-10 minutes for propagation.

Diagnosing Asymmetric Routing with Azure Monitor Logs

Asymmetric routing causes stateful firewall drops that look completely random. Enable NSG flow logs on the subnets involved and send them to a Log Analytics workspace. Then run this KQL query to find traffic that arrives but never gets a response:

AzureNetworkAnalytics_CL
| where SubType_s == "FlowLog"
| where FlowStatus_s == "D"  // Denied flows
| where SrcIP_s startswith "10."  // Your on-prem range
| project TimeGenerated, SrcIP_s, DestIP_s, DestPort_d, FlowDirection_s
| order by TimeGenerated desc
| take 100

If you see on-premises IPs getting denied at Azure NSGs, but the connection was initiated from Azure, that's asymmetric routing. Your on-premises traffic is returning via a different path (internet or a secondary circuit) and hitting an Azure NSG that has no matching allow rule for the return traffic.

Domain-Joined and Hybrid Identity Scenarios

For organizations using Azure AD DS or hybrid identity over ExpressRoute, a BGP flap of even 30 seconds causes Kerberos ticket failures that persist for hours afterward, way longer than the actual outage. If users are reporting authentication errors that started around the time of a suspected circuit event, check the Azure AD Connect sync logs and look for Event ID 6311 or 6900 in the Application event log on your AADC server. These indicate that sync failed during the connectivity window, and a manual delta sync (Start-ADSyncSyncCycle -PolicyType Delta) is usually needed to clear the backlog.

ExpressRoute Global Reach Connectivity Failures

Global Reach lets two on-premises sites communicate through the Microsoft backbone. When ExpressRoute Global Reach troubleshooting is needed, the most common issue is that the /29 IP address space you provided for the interconnect overlaps with an existing route in either site's BGP table. Run Get-AzExpressRouteCircuitConnectionConfig on both circuits involved and verify the AddressPrefix field points to a clean /29 that doesn't collide with any advertised prefixes on either side.

When to Call Microsoft Support

Escalate to Microsoft Support when: (1) ArpAvailability or BgpAvailability metrics are degraded and your provider confirms their infrastructure is healthy, this points to the MSEE itself, which only Microsoft can fix. (2) You're seeing packet loss inside the Microsoft backbone (confirmed by traceroutes showing drops at 12076 ASN hops). (3) A circuit shows "Provisioned" in the portal but has never actually passed traffic, and the service key has been active for more than 10 business days. Open a Severity A ticket and reference your circuit's Resource ID and the specific metric graph showing the degradation window. Screenshots from Azure Monitor save significant back-and-forth with the support team.

Prevention & Best Practices

I know this is the section people skip when they're in crisis mode, but read it once you're back up, because the best ExpressRoute troubleshooting session is the one you never have to run.

The single biggest thing you can do is set up proactive alerting before something breaks. Most of the outages I get called into were preceded by warning signals in Azure Monitor that nobody had configured alerts for. By the time the NOC noticed, BGP had been flapping intermittently for 48 hours and the on-call engineer walked into a full outage instead of a maintenance window.

Maintain configuration documentation in version control. Your CE router BGP config, the VLAN IDs, the /30 subnets, the MD5 keys (stored securely in a secrets manager, not in a text file), all of this should be in a git repository. When a router is replaced and the config is "restored from memory," that's where VLAN ID and peer ASN mismatches creep in. I've seen this exact scenario cause a 4-hour outage that could have been a 10-minute fix with proper documentation.

Test your secondary link regularly. Deliberately fail over to the secondary path once a quarter in a maintenance window. This confirms the secondary is actually working and prevents the "we had redundancy but never knew one side was broken" scenario. You can do this by temporarily disabling the primary peering in the portal and confirming traffic shifts as expected.

Use ExpressRoute circuit resiliency mode. If your circuit was created recently, check whether it supports the new Maximum Resiliency deployment option under the circuit's Configuration blade. This uses two diverse physical paths at the MSEE level, not just two ports on the same device, a meaningfully stronger guarantee against physical failures.

Quick Wins
  • Set Azure Monitor alerts on BgpAvailability < 95% and ArpAvailability < 95%, you'll get a heads-up before a flap becomes a full outage
  • Enable ExpressRoute circuit diagnostics and stream logs to a Log Analytics workspace so you have historical data when a post-incident review comes up
  • Tag all ExpressRoute-related resources (circuits, gateways, connections) with a consistent criticality: high tag so they surface immediately in cost and change-management reviews
  • Document your failover runbook: who calls the provider, what PowerShell commands to run first, who has portal access, test this annually so the 2 AM call isn't the first time someone runs through it

Frequently Asked Questions

My ExpressRoute circuit shows "Enabled" in Azure but the provider says their side is "Not Provisioned", who's right?

Both are correct, and this is actually expected in a specific ordering scenario. When you create an ExpressRoute circuit in Azure, the portal immediately sets CircuitProvisioningState to "Enabled", that just means Azure has allocated the resource. ServiceProviderProvisioningState won't flip to "Provisioned" until your connectivity provider completes their physical and logical provisioning on their edge equipment, which can take anywhere from a few hours to a few business days depending on the provider. Give your provider the service key from the circuit's Overview blade and follow up with them. There's nothing you can do in Azure to accelerate their side.

How long does it take for BGP route changes to propagate through ExpressRoute?

In practice, BGP route advertisements over ExpressRoute typically propagate within 30 to 90 seconds for straightforward prefix changes. However, if you're changing route filters on Microsoft Peering or modifying community values that affect traffic engineering inside the Microsoft backbone, plan for up to 10 minutes before changes are fully reflected. During that window, you may see intermittent connectivity to affected services. For large prefix table changes, like adding a new /8 summary, I'd allow a full 5 minutes before declaring something broken and checking the portal metrics for BgpAvailability to confirm the session is stable throughout.

Can I use ExpressRoute and a VPN gateway at the same time for the same VNet?

Yes, and this is actually a supported high-availability design called ExpressRoute/VPN coexistence. Both an ExpressRoute gateway and a VPN gateway can exist in the same VNet simultaneously. The key things to get right: you need to use the VpnGw SKU (not Basic) for the VPN gateway, and you must configure the routing carefully, by default ExpressRoute routes take priority over VPN routes, which is what you want for the primary/backup pattern. Use Local Network Gateway weights or BGP route preferences to ensure failover happens automatically if ExpressRoute goes down. The GatewaySubnet must be at least /27 to accommodate both gateways.

Why is my ExpressRoute bandwidth utilization showing 100% but I only bought a 1 Gbps circuit?

You may be hitting what Microsoft calls "burst" territory. Most ExpressRoute circuits support bursting above the contracted bandwidth for short periods, but sustained saturation at or above your circuit's rated capacity will cause queuing and latency spikes that look like random packet loss. Check the BitsInPerSecond and BitsOutPerSecond metrics in Azure Monitor and set the time grain to 1-minute intervals to see if you're consistently exceeding your contracted bandwidth. If you are, the options are: upgrade the circuit bandwidth in the portal (this can often be done without a service interruption for most providers), add a second circuit and implement ECMP load-balancing across both, or work with your network team to move certain traffic classes back to direct internet connectivity using split tunneling.

What's the difference between Azure Private Peering and Microsoft Peering, which one should I be troubleshooting?

Private Peering connects your on-premises network to your Azure Virtual Networks, your VMs, your internal load balancers, your Azure Kubernetes clusters. If VMs in a VNet can't reach on-premises resources (or vice versa), Private Peering is what you troubleshoot. Microsoft Peering is specifically for accessing Microsoft cloud services over the private circuit, think Microsoft 365 (Exchange Online, SharePoint, Teams), Azure Storage, Azure SQL, and other PaaS endpoints via their public IP addresses. If your users are trying to use Outlook or Teams through the ExpressRoute circuit instead of the internet and it's not working, that's a Microsoft Peering problem, and route filters are almost always the culprit.

After fixing the ExpressRoute issue, VMs still can't reach on-premises. What am I missing?

Nine times out of ten, this comes down to one of three things that persist even after the circuit is healthy again. First, check for User-Defined Routes (UDRs) on the subnets, if a UDR has a 0.0.0.0/0 or a specific on-premises prefix pointing to an NVA or internet gateway, it overrides the ExpressRoute-learned routes entirely regardless of circuit health. Second, check your on-premises firewall or stateful inspection device, if it dropped state during the BGP flap, existing sessions may be stuck in a half-open state; a connection reset from the client usually clears it. Third, check DNS, if your DNS servers are on-premises and unreachable during the outage, VMs may have cached negative DNS responses that survive the network recovery. Flush the DNS cache on affected VMs with ipconfig /flushdns (Windows) or systemd-resolve --flush-caches (Linux).

Related Microsoft Fix Guides

H
Sai Kiran Pandrala
Our team includes certified Microsoft engineers, Azure architects, and system administrators with 10+ years of enterprise IT experience. Every guide is written from hands-on troubleshooting, not guesswork. We test every fix before publishing.