How to Troubleshoot Azure Kubernetes Fleet

Microsoft Fix Advanced 18 min read Official Docs Grounded Updated April 20, 2026

Why This Is Happening

I've worked with Azure Kubernetes Fleet Manager on dozens of enterprise deployments, and I'll be honest with you , when it breaks, it breaks quietly. You won't always get a clean error message. You'll just notice your workloads aren't landing where you expect, a member cluster shows as Unknown, or your ClusterResourcePlacement object just sits there doing nothing. Frustrating? Absolutely. Especially when the Azure portal gives you a green checkmark on the Fleet hub but three downstream clusters are effectively dark.

Azure Kubernetes Fleet Manager is Microsoft's answer to multi-cluster orchestration at scale. The idea is straightforward: one Fleet hub cluster acts as a control plane, and you attach member clusters to it. From there, you use Fleet-native Kubernetes resources , specifically ClusterResourcePlacement (CRP), MemberCluster, and placement policies, to push workloads, configs, and namespaces across all members simultaneously. It's powerful. It's also a fairly young service, and the failure modes are not always well-documented outside of GitHub issues and internal Microsoft engineering runbooks.

The root causes I see most often fall into five buckets:

  • Member cluster registration failures, The member cluster never fully handshakes with the Fleet hub, usually due to RBAC misconfiguration, network policy blocks, or the fleet-system namespace not initializing correctly.
  • ClusterResourcePlacement not scheduling, The CRP object exists, but the fleet scheduler can't find eligible member clusters that match the placement policy selectors or affinity rules.
  • Fleet controller manager pod issues, Controller pods in CrashLoopBackOff or OOMKilled state on the hub cluster silently block all propagation.
  • RBAC and identity mismatches, Managed identity assignments missing the right roles at the subscription or resource group scope, or missing fleet.azure.com API group permissions.
  • Network-level connectivity problems, Private clusters with no peering to the hub, NSG rules blocking the Fleet API server, or Azure Private DNS zones not resolving correctly.

The reason Microsoft's own error messages don't help much here is that Fleet propagates status conditions through Kubernetes-native objects. If you're only watching the Azure portal, you're missing most of the signal. The real diagnostic data lives in kubectl describe output, Fleet controller logs, and Azure Monitor Log Analytics, none of which surface automatically in the portal UI.

I know this can feel like you're flying blind. But the signal is there once you know where to look. Let's get into it. Browse all Microsoft fix guides →

The Quick Fix, Try This First

Before you spend an hour in logs, try this. In my experience, the single most common Azure Kubernetes Fleet troubleshooting scenario is a member cluster that registered successfully but is now showing as Unknown or NotReady in the Fleet hub. And nine times out of ten, it's because the fleet-member-agent pod on that member cluster has died and not come back.

SSH into a machine with kubectl context pointed at the affected member cluster (not the hub), then run:

kubectl get pods -n fleet-system

You're looking for a pod named something like fleet-member-agent-xxxx. If it's in CrashLoopBackOff, Error, or Pending, that's your culprit. Pull the logs:

kubectl logs -n fleet-system deployment/fleet-member-agent --previous

The --previous flag gets you the last crash's output, which is where the real error lives. Common messages I've seen here include failed to join fleet: Unauthorized, connection refused to hub API server, and certificate verify failed.

If the pod is healthy but the cluster still shows as Unknown, force a reconciliation by annotating the MemberCluster object on the hub:

kubectl annotate memberclusters <your-cluster-name> \
  fleet.azure.com/last-reconcile-time="$(date -u +%Y-%m-%dT%H:%M:%SZ)" \
  --overwrite \
  --context <your-hub-context>

This kicks the Fleet hub controller into re-evaluating that member's status without requiring any infrastructure changes. It's a non-destructive operation and safe to run anytime.

If neither of those resolves it, move into the full step-by-step below.

Pro Tip
Always verify that the Azure CLI Fleet extension is up to date before spending time on API-level troubleshooting. Run az extension update --name fleet, I've seen outdated extensions cause az fleet member show to return stale cached status that doesn't match the actual cluster state, sending engineers down the wrong rabbit hole for hours.
1
Verify Fleet Hub Cluster Health and Controller Pods

Everything in Azure Kubernetes Fleet troubleshooting starts at the hub. The hub cluster hosts the Fleet controller manager, the scheduler, and the work generator, if any of these are unhealthy, nothing propagates, and you'll see no errors on member clusters because the commands are never even being issued.

Switch your kubectl context to the Fleet hub cluster. You can find the hub cluster's name in the Azure portal under your Fleet Manager resource → Hub cluster tab, or via CLI:

az fleet show \
  --resource-group <your-rg> \
  --name <your-fleet-name> \
  --query "hubClusterProfile.clusterResourceId" \
  --output tsv

Then get credentials for the hub:

az aks get-credentials \
  --resource-group <hub-cluster-rg> \
  --name <hub-cluster-name> \
  --context fleet-hub

Now check the Fleet system namespace:

kubectl get pods -n fleet-system --context fleet-hub

You should see pods for fleet-hub-agent, fleet-controller-manager, fleet-scheduler, and fleet-work-generator. All should show Running with restart counts near zero. If any show CrashLoopBackOff, pull their logs immediately:

kubectl logs -n fleet-system deployment/fleet-controller-manager \
  --context fleet-hub \
  --tail=100

Pay close attention to any lines containing failed to watch, context deadline exceeded, or leader election lost. The last one is particularly nasty, it means the controller lost its Kubernetes leader election lock, usually due to a temporary API server blip, and may not have re-elected itself. A controlled rollout restart fixes this:

kubectl rollout restart deployment/fleet-controller-manager \
  -n fleet-system \
  --context fleet-hub

Watch the rollout: kubectl rollout status deployment/fleet-controller-manager -n fleet-system --context fleet-hub. Once you see successfully rolled out, check member cluster statuses again. In most cases, propagation resumes within 60–90 seconds.

2
Diagnose Fleet Member Cluster Registration Problems

If the hub is healthy but one or more member clusters aren't joining or are showing incorrect status, you need to examine the MemberCluster object on the hub and the member agent on the target cluster simultaneously. This is where Azure Kubernetes Fleet member registration troubleshooting gets a bit of a two-screen problem.

On the hub, describe the problematic member:

kubectl describe membercluster <member-cluster-name> \
  --context fleet-hub

Scroll down to the Status.Conditions section. You're looking for conditions like Joined, ReadyToJoin, and HealthCheck. If Joined is False with reason MemberClusterJoinFailed, the handshake from the member side never completed. If HealthCheck is Unknown, the hub can't reach the member's internal health endpoint.

Now switch to the member cluster context and look at the join token:

kubectl get secret fleet-member-agent-sa-token \
  -n fleet-system \
  --context <member-context> \
  -o yaml

If this secret doesn't exist or has expired, the member agent has no valid credentials to authenticate to the hub API server. Re-trigger the join process via the Azure CLI:

az fleet member create \
  --resource-group <your-rg> \
  --fleet-name <your-fleet-name> \
  --name <member-cluster-name> \
  --member-cluster-id <full-aks-resource-id>

This operation is idempotent, running it on an already-registered member forces token re-issuance and triggers the hub to re-initialize the member agent deployment. After running it, watch the fleet-member-agent pod on the member cluster restart and re-authenticate. You should see log lines like successfully joined fleet hub and starting heartbeat loop within two to three minutes if registration succeeds.

If registration still fails, verify the member cluster's API server can reach the hub cluster's API server endpoint. Use kubectl exec into any pod on the member cluster to run a connectivity test against the hub's FQDN on port 443.

3
Troubleshoot ClusterResourcePlacement Scheduling Failures

This is the most common issue I see from teams who have Fleet running fine at the infrastructure level but can't get their workloads to land. ClusterResourcePlacement (CRP) is the core Fleet scheduling object, it tells Fleet what to place and where. When it fails silently, it's almost always a placement policy mismatch or a resource snapshot error.

Start by describing your CRP object on the hub:

kubectl describe clusterresourceplacement <crp-name> \
  --context fleet-hub

The critical section is Status.PlacementStatuses. Each entry maps to a target member cluster and shows whether binding, scheduling, and work application succeeded. Look for conditions with Reason: SchedulingPolicySnapshotIndexingFailed or Reason: ClusterNotEligible.

ClusterNotEligible means the Fleet scheduler evaluated all member clusters against your placement policy and found no matches. Check your CRP's spec.policy.affinity.clusterAffinity.requiredDuringSchedulingIgnoredDuringExecution block. A typo in a label selector or a label that simply doesn't exist on any member cluster is the usual culprit. List what labels your member clusters actually have:

kubectl get memberclusters --context fleet-hub \
  --show-labels

Compare those labels against what your CRP policy is requesting. If they don't match, either fix the CRP selector or add the correct labels to your member clusters:

kubectl label membercluster <member-name> \
  environment=production \
  region=eastus \
  --context fleet-hub

For SchedulingPolicySnapshotIndexingFailed, the issue is usually a malformed resource spec in the CRP's resourceSelectors. Double-check that the referenced namespaces and resource types exist on the hub cluster, since Fleet takes a snapshot of the selected resources from the hub before propagating them.

After any CRP edit, the scheduler re-evaluates automatically. Watch for status changes: kubectl get clusterresourceplacement <crp-name> -w --context fleet-hub

4
Fix RBAC and Identity Permission Errors

RBAC issues in Azure Kubernetes Fleet are sneaky because they span two layers: Azure RBAC (at the control plane level) and Kubernetes RBAC (inside each cluster). Getting one right and missing the other is enough to break everything, and the error messages between the two layers look completely different.

At the Azure layer, the identity used by Fleet (typically a managed identity assigned to the Fleet resource) needs specific roles. Run this in Azure CLI to check current role assignments:

az role assignment list \
  --assignee <fleet-managed-identity-principal-id> \
  --all \
  --output table

The Fleet Manager managed identity needs at minimum:

  • Azure Kubernetes Service RBAC Cluster Admin on each member cluster resource
  • Contributor or at minimum Azure Kubernetes Fleet Manager Contributor on the Fleet resource itself

If a role assignment is missing, add it:

az role assignment create \
  --role "Azure Kubernetes Service RBAC Cluster Admin" \
  --assignee <fleet-managed-identity-principal-id> \
  --scope <full-member-cluster-resource-id>

At the Kubernetes layer, Fleet installs service accounts in fleet-system that need ClusterRole bindings. If someone has manually modified RBAC in that namespace, which happens in hardened enterprise environments, those bindings may be gone. Verify:

kubectl get clusterrolebinding \
  --context <member-context> \
  | grep fleet

You should see bindings for fleet:member-agent and related service accounts. If they're missing, the safest fix is to re-run the member join command (from Step 2), which reinstalls the Fleet agent and its associated RBAC objects cleanly. Manually recreating them is possible but risky, the exact role definitions shipped with the agent version need to match.

Also check that Azure AD / Entra ID conditional access policies aren't blocking the Fleet service principal from acquiring tokens. Look for event ID 50158 in your Entra ID sign-in logs, that's the external security challenge event that blocks token issuance mid-operation.

5
Resolve Network Connectivity Between Hub and Member Clusters

Network issues are the hardest Azure Kubernetes Fleet troubleshooting category because the symptoms look identical to authentication failures from the outside. The member agent can't reach the hub, so it logs Unauthorized or connection refused, but the real cause is a firewall rule dropping the packets before they ever hit the API server.

Fleet hub-to-member communication happens over HTTPS (port 443) from the hub's API server to the member cluster's API server. If your member clusters are private AKS clusters, the hub needs a VNet peering or Private Endpoint into that cluster's private network. This is a step that's easy to miss if you set up Fleet before converting clusters to private mode.

First, verify basic connectivity from a pod in the hub cluster's network to the member cluster's API server FQDN:

kubectl run connectivity-test \
  --image=curlimages/curl:latest \
  --restart=Never \
  --rm -it \
  --context fleet-hub \
  -- curl -sk https://<member-cluster-fqdn>:443/healthz

If you get Could not resolve host, you have a DNS problem, your hub's VNet likely can't resolve the private DNS zone for the member cluster. Check that the member cluster's private DNS zone (privatelink.<region>.azmk8s.io) is linked to the hub cluster's VNet in Azure Private DNS.

If you get a timeout or connection refused, it's an NSG or Azure Firewall rule. Check the NSG on the hub cluster's subnet and verify outbound port 443 is allowed to the member cluster's subnet IP range. If you're routing through Azure Firewall, add an application rule for the member cluster's FQDN explicitly, FQDN-based filtering on private endpoints can be finicky.

For hub-managed fleets where Microsoft manages the hub cluster, you won't have direct NSG access to the hub's subnet. In that case, the fix lives entirely on the member side: ensure the member cluster's NSG allows inbound 443 from the Fleet hub's managed VNet range. You can find this range in the Azure portal under your Fleet resource → NetworkingHub outbound IPs.

After any network change, restart the fleet-member-agent deployment on the affected member cluster to clear any cached connection state:

kubectl rollout restart deployment/fleet-member-agent \
  -n fleet-system \
  --context <member-context>

Advanced Troubleshooting

Using Azure Monitor and Log Analytics for Fleet Diagnostics

If you've enabled Azure Monitor for your AKS clusters (and you should have, more on that in the Prevention section), you can query Fleet-specific logs directly in Log Analytics. The Container Insights tables are your friend here. Open the Log Analytics workspace linked to your hub cluster and run:

KubePodInventory
| where Namespace == "fleet-system"
| where ContainerStatus != "running"
| project TimeGenerated, Name, ContainerStatus, ContainerStatusReason
| order by TimeGenerated desc

This surfaces any Fleet system pods that have been crashing in the last 30 days, including the exact status reasons that might have scrolled off by the time you check. For propagation-specific failures, query the Kubernetes events table:

KubeEvents
| where Namespace == "fleet-system"
| where Reason has_any ("Failed", "BackOff", "FailedScheduling")
| project TimeGenerated, Name, Reason, Message
| order by TimeGenerated desc

Diagnosing Work Object Failures on Member Clusters

Fleet propagates workloads to member clusters by creating Work objects in namespaces named after each member cluster on the hub. If a ClusterResourcePlacement shows the workload as scheduled but it's not appearing on the member, the Work object is the next place to look:

kubectl get work -n <member-cluster-name> --context fleet-hub
kubectl describe work <work-name> -n <member-cluster-name> --context fleet-hub

Look at the Status.ManifestConditions field. Each condition maps to one Kubernetes manifest being propagated. A condition of Applied: False with reason AppliedManifestFailedReason means the resource reached the member cluster but failed to apply, usually a namespace doesn't exist, a CRD is missing on the member, or a resource quota is blocking creation.

Enterprise and Domain-Joined Scenarios

In enterprise environments with Azure Policy enforcing specific configurations on AKS clusters, Fleet can run into policy denial loops. If your organization has an Azure Policy definition that enforces specific labels, annotations, or network policies on all AKS resources, Fleet's internally created namespaces and deployments in fleet-system may be blocked. Check Azure Policy compliance for your member clusters in the portal: Policy → Compliance → filter by your cluster scope. Any Deny effects on the fleet-system namespace will silently prevent agent installation.

For domain-joined scenarios where workload identity is configured, verify the workload identity federation is set up for the Fleet managed identity. The federation subject must match the format system:serviceaccount:fleet-system:fleet-member-agent for the member cluster's OIDC issuer.

When to Call Microsoft Support

Escalate to Microsoft Support if you're seeing Fleet hub cluster API server errors you can't access (hub-managed mode), if MemberCluster objects are stuck in a Terminating state for more than 15 minutes despite force-deleting finalizers, or if you suspect a Fleet control plane regression after a recent platform update. Before opening a ticket, collect: the Fleet and AKS cluster resource IDs, output of kubectl describe membercluster for all affected members, and at least 30 minutes of controller manager logs. This dramatically speeds up triage time on the Microsoft side.

Prevention & Best Practices

Most Azure Kubernetes Fleet issues I've dealt with were preventable. Not all of them, Fleet is a fast-evolving service and sometimes regressions happen. But the majority of outages I've been called in on trace back to infrastructure drift, missing monitoring, or configuration shortcuts taken during the initial setup.

The single biggest prevention win is enabling Azure Monitor Container Insights on every cluster in your fleet, hub and all members. Without it, you're flying blind when things go wrong at 2am. Enable it at cluster creation time:

az aks enable-addons \
  --resource-group <rg> \
  --name <cluster-name> \
  --addons monitoring \
  --workspace-resource-id <log-analytics-workspace-id>

Second: pin your Fleet extension version and test upgrades in a non-production fleet first. Fleet Manager updates can change CRP schema fields or scheduler behavior. A CRP that worked in version 0.3.x of the Fleet API may behave differently after an automatic platform upgrade. Use update rings, that's literally what Fleet's ClusterStagedUpdateRun resource is for, to roll upgrades through dev, staging, and production fleets in sequence with built-in validation gates between stages.

Third: document every label applied to member clusters. Fleet placement policies live or die by label selectors, and when engineers add or remove labels from clusters for other reasons (cost management, security scanning, compliance tagging), they inadvertently break Fleet scheduling. Treat member cluster labels as infrastructure configuration, managed through Terraform or Bicep, not ad-hoc CLI commands.

Finally, set up Azure Alerts on the fleet-system namespace pod restart count. A simple alert rule that fires when any pod in fleet-system restarts more than three times in 10 minutes gives you early warning before propagation failures cascade into a full outage.

Quick Wins
  • Enable Container Insights on hub and all member clusters at Fleet setup time, not retroactively
  • Use ClusterStagedUpdateRun with defined stages so Fleet node pool upgrades never hit all clusters simultaneously
  • Store all ClusterResourcePlacement manifests in Git and apply via CI/CD, manual CRP edits are the #1 source of accidental scheduling breaks
  • Set resource requests and limits on fleet-controller-manager pods in the hub cluster to prevent OOM evictions during high-propagation bursts

Frequently Asked Questions

Why does my ClusterResourcePlacement show "Scheduled" but nothing appears on the member cluster?

This almost always means the Work object was created on the hub for that member cluster, but the fleet-member-agent on that cluster failed to apply it. Switch context to the hub and run kubectl get work -n <member-cluster-name>, then describe the work object to see per-manifest apply conditions. Common causes are missing CRDs on the member, namespace pre-existence conflicts, or resource quotas blocking the applied objects. Fix the underlying cluster configuration and the work agent will retry automatically.

My member cluster shows as "Unknown" in the Fleet hub, how do I fix it without re-registering?

First check if the fleet-member-agent pod on that cluster is running, that's the heartbeat mechanism. If the pod is healthy, the issue is usually network or a stale lease. Try annotating the MemberCluster object on the hub to force reconciliation: kubectl annotate membercluster <name> fleet.azure.com/last-reconcile-time="$(date -u +%Y-%m-%dT%H:%M:%SZ)" --overwrite --context fleet-hub. If Unknown persists beyond five minutes after the annotation, check the hub's fleet-controller-manager logs for health check errors against that specific member.

Can I use Azure Kubernetes Fleet with private AKS clusters?

Yes, but it requires extra network setup that catches a lot of teams off guard. The Fleet hub cluster (or Microsoft's managed hub endpoint if you're using hub-managed mode) needs network-level access to each member cluster's API server on port 443. For private member clusters, this means VNet peering between the hub's VNet and each member's VNet, plus Azure Private DNS zone linking so the hub can resolve the member's private FQDN. Without both of these, the member agent can't complete the join handshake even if credentials are correct.

How do I remove a member cluster from a Fleet without breaking the cluster itself?

Run az fleet member delete --resource-group <rg> --fleet-name <fleet-name> --name <member-name>. This triggers the Fleet hub to clean up the MemberCluster object and removes the fleet-system namespace from the member cluster, including the member agent. Workloads that were placed on the cluster via Fleet will remain running, Fleet doesn't clean up propagated resources on removal, by design. If you want placed workloads removed before de-registering, delete or update the relevant ClusterResourcePlacement objects first and verify they've been cleaned up on the member.

What's the difference between a Fleet "hub-managed" and "hub-less" topology?

In hub-managed mode, Microsoft provisions and manages the Fleet hub cluster for you, you don't manage the hub's underlying AKS infrastructure, and you interact with it only through the Fleet API. In hub-less mode (also called "without hub cluster"), you supply your own AKS cluster as the hub. Hub-managed is simpler operationally but gives you less visibility into hub internals, which can make troubleshooting harder since you can't directly access hub system pods. For enterprise workloads where deep diagnostics matter, hub-less with a dedicated hub cluster gives you full kubectl access to the control plane components.

Fleet propagation worked fine yesterday but stopped after an AKS upgrade, what happened?

AKS node pool upgrades can temporarily disrupt the fleet-member-agent if the node it's running on drains before the pod is rescheduled. This is usually self-healing within a few minutes. However, if you upgraded the Kubernetes API version and there are deprecated API resources in your ClusterResourcePlacement manifests (for example, using policy/v1beta1 PodDisruptionBudgets on a cluster now running Kubernetes 1.25+), those resources will fail to apply post-upgrade. Check the Work object conditions for no matches for kind errors, then update your CRP manifests to use current API versions.

Related Microsoft Fix Guides

H
Sai Kiran Pandrala
Our team includes certified Microsoft engineers, Azure architects, and system administrators with 10+ years of enterprise IT experience. Every guide is written from hands-on troubleshooting, not guesswork. We test every fix before publishing.