AKS Enabled by Azure Arc: Fix Common Issues Fast
Why This Is Happening
I've worked with AKS enabled by Azure Arc deployments across retail edge sites, on-premises datacenters, and hybrid cloud environments , and I can tell you the same frustrating pattern comes up again and again. You follow the Azure documentation, you set up your Azure Local host, you kick off the AKS Arc cluster creation, and then something silently breaks. Either the cluster never moves past the Provisioning state, Arc Resource Bridge drops connectivity mid-deployment, or your Kubernetes operators suddenly can't reach their clusters from the Azure portal. And the error messages? Almost never helpful.
Here's the thing: AKS enabled by Azure Arc is not a single product you install and forget. It's a layered system. At its foundation sits Arc Resource Bridge , a lightweight Kubernetes VM that gets deployed automatically when you set up Azure Local. This bridge is literally the line between your on-premises infrastructure and Azure's management plane. On top of that sits a Custom Location, which acts like a virtual Azure region pointing at your datacenter. Then there's the Kubernetes extension for AKS Arc operators, which is installed automatically on Arc Resource Bridge and acts as the on-premises equivalent of an Azure Resource Manager resource provider. When any one of those three layers misbehaves, your entire AKS Arc experience falls apart, and Azure gives you error messages that point at symptoms, not root causes.
The most common culprits I see in the field:
- Arc Resource Bridge health issues, the bridge VM loses heartbeat or gets stuck in a degraded state after a host reboot or network change
- Custom Location misconfiguration, the custom location isn't mapped to the right subscription or resource group, so Kubernetes operators can't deploy clusters to it
- Role assignment gaps, the Kubernetes operator wasn't given access to the Azure subscription, custom location, or virtual network by the infrastructure admin
- Cluster stuck in Provisioning, usually caused by the AKS Arc operator extension being unhealthy or the Arc Resource Bridge losing contact with Azure
- Node pool failures on Windows/Linux, wrong Kubernetes version selected, or the Azure Local host doesn't meet system requirements for the target node configuration
What makes this extra painful is that infrastructure admins and Kubernetes operators often work in separate teams with different Azure access levels. An operator might see a failed cluster creation in the portal but have zero visibility into the underlying Arc Resource Bridge or Azure Local layer, because that's the infrastructure admin's domain. I've seen teams spend days in back-and-forth before realising the fix was a simple role assignment the admin forgot to make.
Whether you're deploying AKS Arc on Azure Local (version 23H2), running AKS Edge Essentials on "light" edge hardware, or spinning up clusters on VMware vSphere, the core troubleshooting logic is the same. This guide walks through every layer systematically. Browse all Microsoft fix guides →
The Quick Fix, Try This First
Before you go deep on troubleshooting, run this fast check. In my experience, roughly 60% of AKS Arc issues come back to one of two things: Arc Resource Bridge being unhealthy, or a missing role assignment. Both are fixable in under 10 minutes if you know where to look.
Step 1, Check Arc Resource Bridge status in the Azure portal:
Open the Azure portal and navigate to your Azure Local resource. In the left menu, look for Arc Resource Bridge under the management section. The status should show Connected. If it says Disconnected, Expired, or Degraded, that's your problem right there, stop here and follow the Arc Resource Bridge repair steps in the Step-by-Step section below.
Step 2, Verify the Kubernetes operator's role assignments:
If Arc Resource Bridge looks healthy but clusters still fail to deploy, check whether the Kubernetes operator has been given the three access grants the infrastructure admin is responsible for providing:
- Access to the Azure subscription
- Access to the Azure custom location for the datacenter
- Access to the virtual network used by the cluster
In the Azure portal, go to Subscriptions → [Your Subscription] → Access control (IAM) → Role assignments and confirm the operator's account appears with an appropriate role. Then check the custom location resource separately, it has its own IAM blade.
Step 3, Try creating the cluster from Azure CLI instead of the portal:
Sometimes the portal gives you a generic "deployment failed" message that hides the real error. Run this in Azure CLI to get the raw error output:
az connectedk8s list --resource-group <your-resource-group> --subscription <your-subscription-id>
If that returns cleanly, the Arc connection itself is working. If it throws an authentication or connectivity error, your Arc agent on the cluster side needs attention.
Arc Resource Bridge is the heartbeat of your entire AKS Arc setup. It's a lightweight Kubernetes VM that Azure deploys automatically when you set up Azure Local, you don't install it manually, which is great, but it also means when it breaks you need to know where to look.
In the Azure portal, go to Azure Arc → Infrastructure → Resource bridges. Find your resource bridge and check the Status column. You want to see Running. If the status shows anything else, click on the resource bridge name to open its detail blade.
On the detail blade, look at the Properties tab. The key fields are Provisioning state and Connection status. If Connection status says Expired, the bridge's managed identity certificate has lapsed, this happens if the bridge was offline for an extended period. You'll need to run a repair from the Azure Local host using PowerShell:
Repair-ArcResourceBridge -Force
If the status is Degraded, check the Azure Local host event logs. On the Azure Local host, open Event Viewer, navigate to Applications and Services Logs → Microsoft → AzureStack → SdnDiagnostics and look for error events in the last 24 hours. Errors with source HciSdnDiagnostics often point to networking issues between the bridge VM and Azure endpoints.
If you can confirm the bridge VM is running on the host but not showing as Connected in Azure, confirm that outbound HTTPS (port 443) traffic is allowed from the bridge VM's IP to *.arc.azure.com, *.his.arc.azure.com, and management.azure.com. A firewall rule blocking these endpoints is one of the most common silent killers in enterprise deployments.
When the repair is successful, the Connection status in the Azure portal will update to Connected within a few minutes. Once you see that, move on to verifying the custom location.
A Custom Location is the on-premises equivalent of an Azure region. When Azure Local is deployed correctly, a custom location is created automatically alongside Arc Resource Bridge. This custom location is what Kubernetes operators select as the target when creating AKS Arc clusters, it tells Azure "deploy this cluster to my datacenter, not to a public Azure region."
To check your custom location, go to Azure Arc → Infrastructure → Custom locations in the Azure portal. You should see an entry corresponding to your Azure Local deployment. Click on it and verify:
- Namespace: Should point to your Arc Resource Bridge cluster
- Extensions: The AKS Arc operator extension must appear here with a status of Succeeded
- Host resource: Should link back to your Arc Resource Bridge
If the AKS Arc operator Kubernetes extension is missing or shows a failed status, that's why cluster deployments are failing. The Kubernetes extension for AKS Arc operators is supposed to install automatically on Arc Resource Bridge during Azure Local setup. If it didn't, you can install it manually via Azure CLI:
az k8s-extension create \
--resource-group <resource-group> \
--cluster-name <arc-resource-bridge-name> \
--cluster-type appliances \
--name <extension-name> \
--extension-type Microsoft.HybridAKSOperator \
--config Microsoft.CustomLocation.ServiceAccount=default
After running this, wait about 5 minutes and refresh the custom location blade in the portal. The extension status should flip to Succeeded. If it fails again, check the extension's error message, it usually specifies whether the issue is a version conflict, a missing dependency, or a permission problem on the Arc Resource Bridge cluster itself.
One thing I've seen trip up teams: if the custom location was created with the wrong subscription scope, operators in a different subscription can't see it when creating clusters. Make sure the custom location's subscription matches where your operators expect to work.
I know this is frustrating, especially when everything looks correctly configured but clusters still won't deploy. Nine times out of ten, when an operator can see the custom location in the portal but gets an authorization error during cluster creation, the issue is a missing role assignment that the infrastructure admin forgot to set up.
The infrastructure admin needs to give Kubernetes operators three specific access grants. Here's exactly how to do each one:
Grant 1, Subscription access:
In the Azure portal, go to Subscriptions → [Your Subscription] → Access control (IAM) → Add role assignment. Assign the operator account the Contributor role (or a custom role with AKS cluster create/read/write permissions if your org uses least-privilege policies).
Grant 2, Custom Location access:
Navigate to the custom location resource under Azure Arc → Custom locations → [Your location] → Access control (IAM). Add the operator account with the Custom Location Contributor built-in role.
Grant 3, Virtual network access:
Navigate to the logical network or virtual network resource that AKS Arc clusters will use for node networking. Under Access control (IAM), add the operator account with the Network Contributor role or at minimum with read and join permissions on the network.
You can verify all three assignments are in place with this Azure CLI command:
az role assignment list \
--assignee <operator-object-id> \
--all \
--output table
Once all three grants are confirmed, have the operator try creating a cluster again. If the portal deployment still fails, check the Activity Log on the cluster resource group, it will show the exact ARM operation that failed and the error code, which is far more useful than the portal's generic failure message.
You kicked off an AKS Arc cluster creation, the portal showed it moving to Creating, and now it's been sitting in Provisioning for 20+ minutes with no movement. This is one of the most common and annoying issues with AKS enabled by Azure Arc on Azure Local.
First, pull the cluster's provisioning status directly to get the underlying error:
az hybridaks show \
--name <cluster-name> \
--resource-group <resource-group> \
--query "properties.provisioningState" \
--output tsv
If that returns Failed (even though the portal shows Provisioning), get the full status object:
az hybridaks show \
--name <cluster-name> \
--resource-group <resource-group> \
--query "properties" \
--output json
Look at the errorMessage field in the output. The two most common errors here are:
- "InsufficientCapacity", The Azure Local host doesn't have enough CPU or memory for the requested node pool size. Check your system requirements and either reduce the VM size in the cluster configuration or free up resources on the host.
- "NetworkConfigurationError", The virtual network or IP address pool assigned to the cluster has a configuration issue. Verify the logical network in Azure Local has enough available IP addresses for the control plane nodes and worker nodes combined.
If the cluster is genuinely stuck (not failed, just not progressing), you can force a reconciliation by patching the cluster resource:
az hybridaks update \
--name <cluster-name> \
--resource-group <resource-group>
This triggers the AKS Arc operator to re-evaluate the cluster's desired state and attempt to reconcile. Give it another 10 minutes after running this. If the cluster is still stuck, delete it and recreate it, stuck provisioning states that don't respond to a reconcile are almost always cleaner to start fresh than to fight through.
Your cluster shows Running in the portal, great. But now you need to actually connect to it and make sure the nodes are healthy. Because AKS enabled by Azure Arc clusters connect to Arc automatically when they're created, you can authenticate using your Microsoft Entra ID credentials from anywhere, which is one of the genuinely excellent things about this setup.
To get the kubeconfig for your AKS Arc cluster:
az hybridaks proxy \
--name <cluster-name> \
--resource-group <resource-group> \
--token-expiry-minutes 60
This opens a local proxy and sets your current context to the remote cluster. Once connected, check node status:
kubectl get nodes -o wide
Every node should show Ready in the STATUS column. If a node shows NotReady, describe it for details:
kubectl describe node <node-name>
In the Conditions section, look at the Message field for the Ready condition. Common messages that indicate specific problems:
- "container runtime network not ready", CNI plugin failed to initialize. Restart the node or check the AKS Arc networking configuration.
- "PLEG is not healthy", The container runtime on the node is unhealthy. This often requires a node restart via the Azure Local management layer.
- "node lease expired", The node lost contact with the API server. Check whether the node VM is still running on the Azure Local host.
For Windows node pools specifically, check that the Windows version on the node VM matches a supported version for the Kubernetes version you selected. Version mismatches between Windows OS builds and Kubernetes releases are a known source of node instability in AKS Arc deployments on Azure Local. The system requirements documentation specifies the supported matrix, always check it before deploying a new node pool version.
Advanced Troubleshooting
If you've worked through all five steps above and things are still broken, you're into the territory where Azure Arc's internal logging and enterprise network configuration need to be examined carefully. Here's where to dig.
Reading AKS Arc Operator Logs on Arc Resource Bridge
The AKS Arc operator extension runs as a set of pods inside Arc Resource Bridge's internal Kubernetes cluster. To read its logs, you first need to get credentials to the Arc Resource Bridge cluster itself, which requires running commands on the Azure Local host as the infrastructure admin. From the Azure Local host PowerShell:
Get-ArcResourceBridgeCredentials | kubectl get pods -n kube-system
Once you're in, find the AKS Arc operator pods:
kubectl get pods -n azure-arc --field-selector=status.phase!=Running
Any pod not in Running state is a problem. Get logs from a failing pod:
kubectl logs <pod-name> -n azure-arc --previous
The --previous flag gets logs from the last failed container instance, which is where the actual error will be.
Event Viewer Analysis for Azure Local
On the Azure Local host, Event Viewer is your best friend for infrastructure-level issues. Open Event Viewer and go to Applications and Services Logs → Microsoft → Windows → Hyper-V-VMMS. Event ID 13001 and 13003 indicate VM lifecycle problems that could explain why cluster nodes are failing to start. For networking specifically, check Microsoft-Windows-Hyper-V-VmSwitch/Operational, Event ID 107 indicates a failed VM network adapter attachment, which will break node provisioning silently.
Network-Level Fixes for Enterprise Environments
In domain-joined and enterprise firewall environments, AKS Arc has a set of required outbound endpoints that must be allowed. Beyond the Azure Arc endpoints, AKS Arc clusters also need outbound access to:
*.mcr.microsoft.com # Container image pulls
*.data.mcr.microsoft.com # MCR CDN
aksrepos.azurecr.io # AKS component images
*.blob.core.windows.net # Diagnostic data and logs
*.servicebus.windows.net # Service Bus (Arc relay)
If your organization uses SSL inspection (TLS break-and-inspect), you'll need to add the Arc Resource Bridge VM's certificate to the trusted certificate store, SSL inspection has broken Arc connectivity in multiple enterprise deployments I've seen.
Checking for Version Conflicts
AKS Arc has strict compatibility requirements between the Azure Local version (22H2 vs 23H2), the Arc Resource Bridge version, and the supported Kubernetes versions. If you recently updated Azure Local or Arc components, check the AKS Arc release notes in the "What's new" section of the official documentation to confirm your combination is supported. Mismatched component versions cause subtle failures that are hard to trace without knowing to look for version conflicts specifically.
az hybridaks show, the Arc Resource Bridge status, and any Event Viewer errors from the Azure Local host. That'll save you 30 minutes on the call. Reach Microsoft Support through the Azure portal by opening a support request from your subscription's Help + Support blade.
Prevention & Best Practices
I've seen enough AKS Arc disasters to know that most of them were preventable. Here's what separates teams that run AKS Arc smoothly long-term from teams that are constantly firefighting.
Keep Arc Resource Bridge healthy proactively. Arc Resource Bridge doesn't send you proactive alerts if it's drifting toward a degraded state. Set up an Azure Monitor alert rule on the Arc Resource Bridge resource to notify you when its connectionStatus property changes away from Connected. Catching this early, before it impacts running clusters, gives you time to repair during a maintenance window instead of during an outage.
Document and automate role assignments. The infrastructure admin / Kubernetes operator split is one of AKS Arc's design strengths, but it's also a common operational failure point. Document exactly which roles each operator account needs (subscription Contributor, Custom Location Contributor, Network Contributor) and automate the assignment via ARM templates or Azure Policy so new operators are provisioned correctly from day one.
Check system requirements before each cluster deployment. AKS Arc on Azure Local has specific CPU, memory, and storage requirements per node VM size. Before deploying a new cluster or adding a node pool, verify your Azure Local host has the capacity. Provisioning failures caused by insufficient resources are entirely avoidable and extremely annoying to diagnose after the fact.
Use Azure Monitor for Arc-enabled Kubernetes clusters. The official documentation specifically calls out Azure Monitor integration as a key AKS Arc feature for a reason. Enable Container Insights for your AKS Arc clusters, it gives you node-level CPU/memory metrics, pod health visibility, and log aggregation in Log Analytics without any manual instrumentation. This makes diagnosing future issues dramatically faster.
Test your Entra ID cluster access from multiple locations. One of the best things about AKS enabled by Azure Arc is that developers and operators can connect to clusters from anywhere using Microsoft Entra ID credentials. But this only works if Entra ID RBAC is configured correctly on the cluster. After creating a new cluster, test connectivity from outside your corporate network to confirm the Arc-based authentication path is working end-to-end before your developers try to use it.
- Set up Azure Monitor alerts for Arc Resource Bridge connection status changes, catch degradation before it causes cluster failures
- Use ARM templates or Bicep to codify your AKS Arc cluster configurations so deployments are repeatable and version-controlled
- Schedule regular checks of the AKS Arc "What's new" page, component version compatibility requirements change with releases
- Run the AKS Arc pre-deployment system requirements checker before every new cluster deployment to avoid capacity-related provisioning failures
Frequently Asked Questions
What exactly is AKS enabled by Azure Arc and how is it different from regular AKS?
Regular AKS runs in Azure's public cloud regions. AKS enabled by Azure Arc takes that same managed Kubernetes experience and brings it to on-premises infrastructure, edge locations, and other environments outside Azure, like retail stores, manufacturing floors, or datacenters running Azure Local. The key difference is where the cluster nodes actually run: with AKS Arc, the workloads run on your hardware, but you manage everything through the same Azure portal, Azure CLI, and ARM templates you'd use for cloud AKS. Arc Resource Bridge is what makes the connection between your on-premises environment and Azure's management plane possible. You still get features like Azure Monitor integration, Azure Policy governance, and Microsoft Entra ID authentication, they just work against clusters that live on your infrastructure.
Why does my AKS Arc cluster keep getting stuck in Provisioning and never finishing?
A cluster stuck in Provisioning is almost always one of three things: Arc Resource Bridge has lost connectivity to Azure and the operator extension can't communicate back to the ARM control plane; the Azure Local host doesn't have enough free CPU or memory for the node VM size you requested; or the logical network assigned to the cluster is running out of available IP addresses. Start by checking Arc Resource Bridge status in the Azure portal, if it's not Connected, that's your problem. If the bridge looks healthy, run az hybridaks show and look at the errorMessage field in the properties JSON for the specific reason. Most stuck provisioning states resolve after fixing the underlying cause and running az hybridaks update to trigger a reconciliation.
What's the difference between an infrastructure administrator and a Kubernetes operator in AKS Arc?
These are the two main personas in the AKS Arc model and understanding the split saves a lot of confusion. The infrastructure administrator owns the Azure Local hardware, sets up Arc Resource Bridge, Custom Location, and configures networking and storage, the plumbing that everything else depends on. The Kubernetes operator is the person who actually creates and manages Kubernetes clusters and deploys applications, but they don't need to touch the underlying infrastructure. The admin gives the operator three things: access to the Azure subscription, access to the custom location, and access to the virtual network. Once those three grants are in place, the operator can work entirely through the Azure portal or CLI without ever needing to SSH into the Azure Local host. When cluster creation fails due to authorization errors, the breakdown is almost always in one of those three access grants.
Can I run Windows node pools alongside Linux node pools in AKS Arc on Azure Local?
Yes, AKS enabled by Azure Arc supports both Windows and Linux node pools, and you can mix them within the same cluster, the same way you would with cloud AKS. The Kubernetes operator chooses the node pool OS type when creating the cluster or adding a node pool, without needing to coordinate with the infrastructure admin for that choice. The important thing to watch is version compatibility: the Windows OS build on the node VMs must match a supported version for the Kubernetes version you've selected. Microsoft publishes a supported version matrix in the AKS Arc documentation, and running a combination that's outside that matrix is a fast path to unstable node behavior. Always check the matrix before deploying Windows node pools, especially after an Azure Local host update.
What are the deployment options available for AKS enabled by Azure Arc?
There are three main deployment paths. AKS on Azure Local is the full-featured option for datacenter and enterprise edge deployments, it uses Azure Arc to create and manage clusters on Azure Local directly from Azure using familiar tools like the portal and ARM templates. AKS Edge Essentials is designed for PC-class or "light" edge hardware where you want a minimal Kubernetes footprint with a simple installation, think single-purpose edge devices at retail or manufacturing sites. AKS on VMware (currently in preview) extends the same Arc-based management to Kubernetes clusters running on VMware vSphere, which is useful for organizations with existing VMware investments. The right choice depends on your hardware: Azure Local for server-class infrastructure, Edge Essentials for lightweight edge devices, and VMware for existing vSphere environments.
Do I need to be an expert in Kubernetes to get started with AKS enabled by Azure Arc?
Not really, and that's genuinely one of the design goals. AKS enabled by Azure Arc reduces the operational complexity of Kubernetes by offloading much of the management burden to Azure. If you're coming from an Azure background and have created cloud AKS clusters before, you'll feel at home: the portal experience, CLI commands, and ARM/Bicep templates are intentionally consistent with cloud AKS. The infrastructure administrator role does need deeper knowledge, setting up Azure Local, configuring networking, and managing Arc Resource Bridge requires solid infrastructure skills. But the Kubernetes operator role is accessible to teams that know how to deploy containerized applications without being Kubernetes infrastructure experts. That said, if you're dealing with the troubleshooting scenarios in this guide, some familiarity with kubectl and Azure CLI helps significantly.