Azure Kubernetes Fleet CrashLoopBackOff, ImagePullBackOff & Pod Fix Guide

Microsoft Fix Advanced 14 min read Official Docs Grounded Updated April 20, 2026

What's in This Guide

Why This Happens
The Quick Fix
Step-by-Step Solution
Advanced Troubleshooting
Prevention & Best Practices
FAQ

Why This Is Happening

Here's a scenario I see constantly in enterprise Azure environments: you've set up Azure Kubernetes Fleet Manager, your hub cluster looks healthy, your member clusters joined without complaint , and then workloads start falling apart. Pods on member clusters go into CrashLoopBackOff. Some are stuck in ImagePullBackOff. Others just never appear at all, even though you know you propagated the deployment. You check the pod logs, you stare at kubectl describe pod, and the error messages give you almost nothing useful to work with.

I know this is frustrating , especially when it blocks your work and your team is waiting on you.

The thing most engineers miss is that pod-level failures in Azure Kubernetes Fleet Manager almost always trace back to a layer above the pod itself. The pod is just where the pain surfaces. The actual problem is usually sitting in the resource propagation pipeline, specifically inside ClusterResourcePlacement (CRP) or ResourcePlacement (RP) objects that didn't schedule, didn't sync, or couldn't apply resources to your member clusters correctly.

When the Fleet scheduler can't find the right clusters based on your placement policy, or when namespace prerequisites aren't met, your workload resources never reach the member cluster in a valid state. The kubelet on the member cluster then tries to start pods from partial or missing configuration: wrong image registry credentials, missing ConfigMaps, incomplete Secrets. That's what causes ImagePullBackOff (the image pull secret wasn't propagated) and CrashLoopBackOff (the app starts but crashes because environment variables, mounted volumes, or config files are missing).

There's also a subtler failure mode with PlacementScheduled: False. This happens when your placement policy uses PickN and requests more clusters than actually satisfy the label selector, or when PickFixed references a cluster name that doesn't match any joined member in the fleet. In both cases, resources never propagate at all, your pods simply don't exist on the member cluster, which shows up as an application outage with no obvious kubectl error to chase.

The Fleet resource propagation chain has four key objects you need to understand to troubleshoot any of this effectively: ResourceSnapshot, SchedulingPolicySnapshot, ResourceBinding, and Work. Each one is a checkpoint. When the chain breaks at any point, pods fail downstream. This guide walks you through each checkpoint so you can find exactly where things went wrong.

Browse all Microsoft fix guides →

The Quick Fix, Try This First

Before digging into the full propagation chain, start with this 60-second check. It catches the most common Azure Kubernetes Fleet scheduling failure in one command.

Run this against your hub cluster (not a member cluster):

kubectl describe clusterresourceplacement <your-crp-name>

Scroll to the Conditions section. You're looking for a condition named ClusterResourcePlacementScheduled. If its status reads False, you've found your problem. The message field will say something like: "couldn't find all the clusters needed as specified by the scheduling policy".

That single condition status tells you the scheduler gave up before any work resources were ever created on member clusters, which explains every pod-level symptom you're seeing. No pod can run if Fleet never sent the resource definition to the member cluster in the first place.

From here, the fix depends on why the scheduler failed:

If your policy is PickFixed: double-check that the cluster name in your CRP spec exactly matches the name shown in kubectl get memberclusters. One character off and the scheduler silently fails.
If your policy is PickN: count how many member clusters actually satisfy your label selectors. If you asked for 3 clusters labeled env: prod but only 2 have that label, the scheduler reports SchedulingPolicyUnfulfilled and stops.
If you're using ResourcePlacement (namespace-scoped): the target namespace must already exist on every member cluster before RP runs. If it doesn't, apply fails silently at the work layer.

For ResourcePlacement, run the equivalent check:

kubectl describe resourceplacement -n <namespace> <your-rp-name>

Look for ResourcePlacementScheduled instead. The same logic applies, the condition names just carry the ResourcePlacement prefix rather than ClusterResourcePlacement.

Pro Tip

Always run troubleshooting commands from the hub cluster context, not a member cluster. The placement objects, bindings, and snapshots all live on the hub. Switching to a member cluster context and wondering why you can't find a CRP object is one of the most common time-wasters I've seen, fleet-level control plane objects are hub-only.

Confirm Your Hub Cluster Context and Check Placement Status

Every troubleshooting session for Azure Kubernetes Fleet Manager pod failures starts at the hub cluster. This is non-negotiable. Member clusters don't carry the control-plane objects you need to inspect.

Verify you're on the hub:

kubectl config current-context

Switch if needed, then check the status of your ClusterResourcePlacement:

kubectl get clusterresourceplacement <CRPName> -o yaml

In the output, find the status.conditions block. You're scanning for three conditions:

ClusterResourcePlacementScheduled, did the scheduler find and assign clusters?
ClusterResourcePlacementSynchronized, were the work resources synced to member clusters?
ClusterResourcePlacementApplied, were the resources actually applied on member clusters?

If Scheduled is False, you never reach the sync or apply stages. Fix scheduling first. If Scheduled is True but Applied is False, the problem is at the member cluster layer, the work resource reached the cluster but something prevented application. That narrows your search considerably.

For ResourcePlacement (namespace-scoped), the same check applies but scope your command:

kubectl get resourceplacement -n <namespace> <RPName> -o yaml

Condition names shift to the ResourcePlacement prefix: ResourcePlacementScheduled, ResourcePlacementApplied, ResourcePlacementAvailable. The logic behind each condition is identical, only the scope and naming differ.

If all three conditions show True but pods are still in CrashLoopBackOff, the placement chain is healthy and the issue is application-level (missing env vars, bad image tag, crashing entrypoint). Move to pod log inspection in that case.

List and Inspect ClusterResourceBindings to Find Your Target Cluster

Once you've confirmed the placement object's condition status, the next layer down is the ClusterResourceBinding (or ResourceBinding for namespace-scoped placements). These bindings are what actually tie a ClusterResourcePlacement to a specific member cluster.

List all bindings associated with your CRP:

kubectl get clusterresourcebinding -l kubernetes-fleet.io/parent-CRP=<CRPName>

The output shows one binding per member cluster the scheduler selected. Binding names follow a predictable pattern: {CRPName}-{clusterName}-{suffix}. So if your CRP is named deploy-webapp and it's bound to kind-cluster-1, you'll see something like deploy-webapp-kind-cluster-1-be990c3e.

Check the WorkCreated and ResourcesApplied columns. Both should be True. If WorkCreated is False, the scheduler made a selection but the work controller never created the work resource on the member cluster namespace, this points to a controller issue or a namespace-level permission problem. If ResourcesApplied is False, the work resource exists but the member cluster's apply controller rejected it.

For ResourcePlacement bindings, the command looks slightly different:

kubectl get resourcebinding -n <namespace> -l kubernetes-fleet.io/parent-CRP=<RPName>

The binding name format follows the same {RPName}-{clusterName}-{suffix} pattern. The columns shift to WorkSynchronized and ResourcesApplied, same meaning, different label.

Once you've identified the binding for your target cluster, describe it for full detail:

kubectl describe clusterresourcebinding <binding-name>

This is where you'll often find the exact error message that's causing pod failures, things like image pull secret references that don't exist on the target cluster, or resource quotas blocking apply.

Pull the Latest ResourceSnapshot to Verify What Was Propagated

The ResourceSnapshot (or ClusterResourceSnapshot for cluster-scoped placements) is Fleet's record of exactly what it tried to propagate. If your pods are running the wrong config, have a bad image reference, or are missing environment variables, this is where you verify whether the source resource was correct to begin with.

To find the most recent ClusterResourceSnapshot for a CRP:

kubectl get clusterresourcesnapshot \
  -l kubernetes-fleet.io/is-latest-snapshot=true,kubernetes-fleet.io/parent-CRP=<CRPName>

This label selector is the key, without is-latest-snapshot=true, you'll get every historical snapshot and have to manually sort through them. The label makes this fast and deterministic.

For ResourcePlacement, scope the command to the namespace:

kubectl get resourcesnapshot -n <namespace> \
  -l kubernetes-fleet.io/is-latest-snapshot=true,kubernetes-fleet.io/parent-CRP=<RPName>

Replace <RPName> with your ResourcePlacement name and <namespace> with the namespace where your RP object lives.

Once you have the snapshot name, inspect it fully:

kubectl get clusterresourcesnapshot <snapshot-name> -o yaml

Look at the spec.resourceSnapshotSpec.resources field. This is a literal copy of every Kubernetes object Fleet planned to push to member clusters. If your Deployment has the wrong image tag here, that's what got propagated. If a Secret is missing from this snapshot, that's why pods are failing with authentication errors on image pull.

This step is especially useful when your CRP selected resources via a label selector and you're not sure exactly what got picked up. The snapshot removes all ambiguity, you see the real propagated state, not what you thought you configured.

Inspect the Work Resource on the Member Cluster Namespace

The Work resource is Fleet's delivery mechanism. It lives on the hub cluster inside a special namespace named after each member cluster, the format is fleet-member-{clusterName}. The work controller on the hub writes Work objects into these namespaces; the member cluster agent then pulls and applies them.

To find the Work resource for a specific CRP and member cluster:

kubectl get work -n fleet-member-<clusterName> \
  -l kubernetes-fleet.io/parent-CRP=<CRPName>

For ResourcePlacement, the label key on the Work resource still uses parent-CRP, even though you're working with an RP object, not a CRP. This is consistent across both placement types:

kubectl get work -n fleet-member-<clusterName> \
  -l kubernetes-fleet.io/parent-CRP=<RPName>

Once you have the Work resource name, describe it:

kubectl describe work <work-name> -n fleet-member-<clusterName>

The Work resource's status conditions tell you whether the member cluster agent successfully applied each resource. Look for conditions like Applied and check their message fields. Common failure messages here include permission errors (the member cluster's service account can't create the resource type), namespace-not-found errors (especially with ResourcePlacement when the target namespace wasn't pre-created), and conflict errors (another controller or existing resource is blocking apply).

The Work resource's spec.workload.manifests field also shows you the exact Kubernetes YAML that was delivered to the member cluster. Cross-reference this against your ResourceSnapshot to confirm the pipeline delivered what the snapshot recorded, if they differ, you have a controller bug worth escalating.

Fix the Scheduling Policy and Validate Member Cluster Labels

If your root cause is PlacementScheduled: False, which covers a big chunk of pod failures, you need to fix either the placement policy or the member cluster configuration. Here's how to work through each case.

Case 1: PickFixed policy with wrong cluster name. Get the exact names of all joined member clusters:

kubectl get memberclusters

Compare what you see to what's in your CRP or RP spec under policy.placementType: PickFixed and policy.clusterNames. Copy-paste the names, don't retype them. One wrong character means the scheduler silently discards the cluster from consideration.

Case 2: PickN policy with insufficient matching clusters. First, find how many clusters currently carry the required labels:

kubectl get memberclusters -l env=prod

If the output shows 1 cluster but your CRP asks for numberOfClusters: 2, the scheduler reports SchedulingPolicyUnfulfilled and stops. Either label a second cluster correctly or lower numberOfClusters in your CRP spec to match reality.

Case 3: ResourcePlacement with missing target namespace. ResourcePlacement can only place resources into namespaces that already exist on the member cluster. If the namespace doesn't exist, apply fails silently at the Work layer. The fix is to create the namespace first, either manually or via a ClusterResourcePlacement in namespace-only mode that runs before your ResourcePlacement. After the namespace is confirmed present on all target member clusters, re-trigger your RP by patching or reapplying it.

After any policy fix, watch the CRP conditions update in real time:

kubectl get clusterresourceplacement <CRPName> -w

You should see ClusterResourcePlacementScheduled flip to True, followed by Synchronized and Applied within a few seconds to a couple of minutes depending on cluster count and resource size.

Advanced Troubleshooting for Azure Kubernetes Fleet Manager

When the standard propagation chain checks don't surface the problem, you need to go deeper. Here's what I reach for in complex fleet deployments.

Read the Fleet Scheduler Logs Directly

The scheduler's decision log is incredibly detailed about why it rejected clusters. To find the scheduler pod on your hub cluster:

kubectl get pods -n fleet-system -l app=fleet-scheduler

Then tail its logs:

kubectl logs -n fleet-system <scheduler-pod-name> --tail=200 | grep -i "CRPName"

The scheduler emits structured log entries for every placement decision, which clusters it evaluated, which it skipped and why, and what policy constraints caused rejections. This is the fastest way to debug PickN failures where the label math isn't obvious.

Check the Fleet Member Agent Logs on Member Clusters

Switch to a member cluster context and inspect the fleet member agent:

kubectl get pods -n fleet-system -l app=fleet-member-agent
kubectl logs -n fleet-system <member-agent-pod> --tail=300

The member agent is responsible for pulling Work resources and applying them. If you're seeing CrashLoopBackOff or ImagePullBackOff and the Work resource status looks clean, the member agent logs often show apply-time errors that don't bubble back up to the hub's Work status conditions reliably. Things like webhook admission rejections, OPA/Gatekeeper policy blocks, or resource quota exceeded errors appear here first.

Namespace-Scoped ResourcePlacement: The Silent Failure Mode

ResourcePlacement is namespace-scoped and has a hard requirement: the target namespace must already exist on the member cluster before ResourcePlacement tries to apply resources there. This is different from how most engineers expect Kubernetes to behave, where a Deployment can create its own namespace on apply in some tooling setups.

If you're using both ClusterResourcePlacement and ResourcePlacement together, the correct pattern is:

Create a CRP in namespace-only mode to propagate just the namespace object to all member clusters.
Wait for the namespace CRP's Applied condition to go True.
Then create your ResourcePlacement to propagate workloads into that namespace.

Also be careful that your CRP and RP don't select overlapping resources. If both a CRP and an RP try to manage the same Deployment, you'll get conflict errors in Work status and unpredictable apply behavior.

Reserved Namespace Rejections

One edge case that trips people up: Fleet's scheduler will reject any placement that selects a reserved namespace, system namespaces like kube-system, fleet-system, or default in some configurations. If your resource selector accidentally picks up resources from one of these namespaces, the PlacementScheduled condition will be False with a message about reserved namespaces. Tighten your resource selector label to exclude system-managed resources.

When to Call Microsoft Support

Escalate to Microsoft Support when: (1) the Work resource status shows Applied: True but pods still don't appear on the member cluster after 10+ minutes; (2) scheduler logs show internal errors unrelated to your policy configuration; (3) ClusterResourceBindings are stuck in a terminating state and won't delete; or (4) you're on a preview API version and hitting behavior that contradicts the published docs. Have your hub cluster logs, CRP YAML, and binding names ready when you open a ticket, support will ask for all of these immediately.

Prevention & Best Practices for Azure Kubernetes Fleet

Most Azure Kubernetes Fleet Manager pod failures are preventable. After working through enough of these incidents, certain patterns become obvious. Here's what you should build into your Fleet deployment process from the start.

Always pre-validate your label selectors before applying a placement policy. Before you create a CRP with a PickN policy, run kubectl get memberclusters -l <your-labels> and count the results. If the count is less than your numberOfClusters, you already know the scheduler will fail. Make this a pre-flight check in your GitOps pipeline or deployment script, it takes two seconds and saves minutes of debugging.

For ResourcePlacement, always provision namespaces first via ClusterResourcePlacement. Treat namespace creation as infrastructure provisioning, not application deployment. A ClusterResourcePlacement in namespace-only mode is the right tool. Make namespace readiness a gate in your CD pipeline before any RP that targets that namespace is allowed to run.

Use PickAll for internal tooling workloads where cluster count doesn't matter. With PickAll, the PlacementScheduled condition is always True, there's no way to fail scheduling because you're not specifying a count. For monitoring agents, log collectors, and other fleet-wide utilities, PickAll avoids an entire class of scheduling failures.

Keep your image pull secrets propagated as part of the same CRP as your workload. The most common cause of ImagePullBackOff in fleet environments is a Deployment landing on a member cluster without its corresponding pull secret. Group the Secret and the Deployment in the same resource selector, or use a separate CRP for pull secrets that's applied before the workload CRP.

Monitor Work resource status continuously, not just at deploy time. Member clusters can drift, nodes get replaced, namespaces get manually deleted, quotas change. A Work resource that was Applied: True at deploy time might silently fail to re-apply after a member cluster event. Build alerts on ResourcesApplied: False conditions across your bindings.

Quick Wins

Add a kubectl get clusterresourceplacement readiness check to your CI/CD pipeline that gates on ClusterResourcePlacementApplied: True before marking a deploy successful
Label all member clusters consistently at join time, retroactive label changes after you've written placement policies cause unexpected scheduling shifts
Keep ClusterResourcePlacement and ResourcePlacement resource selectors mutually exclusive, overlapping selectors on the same resources create apply conflicts that are hard to diagnose
Pin image tags in your propagated Deployments, using :latest combined with different pull policies on different member clusters leads to version skew that causes intermittent CrashLoopBackOff

Frequently Asked Questions

How do I find the ClusterResourceBinding for a specific member cluster?

Run kubectl get clusterresourcebinding -l kubernetes-fleet.io/parent-CRP=<CRPName> on your hub cluster. The output lists every binding associated with that CRP. Binding names follow the format {CRPName}-{clusterName}-{suffix}, so you can identify the one you want by looking for your cluster name in the binding name. The two key columns to check are WorkCreated and ResourcesApplied, both should read True for healthy propagation.

How do I find the latest ClusterResourceSnapshot for my CRP?

Use this label selector to get only the most recent snapshot without wading through historical ones: kubectl get clusterresourcesnapshot -l kubernetes-fleet.io/is-latest-snapshot=true,kubernetes-fleet.io/parent-CRP=<CRPName>. Replace <CRPName> with your actual ClusterResourcePlacement name. The is-latest-snapshot=true label is the critical filter, without it, you'll see every snapshot revision Fleet has ever created for that CRP. Once you have the snapshot name, run kubectl get clusterresourcesnapshot <name> -o yaml to see the exact resources that were queued for propagation.

How do I find the work resource associated with a ClusterResourcePlacement?

First, identify the member cluster namespace, it always follows the format fleet-member-{clusterName}. Then run: kubectl get work -n fleet-member-<clusterName> -l kubernetes-fleet.io/parent-CRP=<CRPName>. This scopes the work lookup to both the correct member cluster namespace and the correct placement. Describing the Work resource gives you its apply status and, if apply failed, the exact error message from the member cluster agent, which is often the most direct explanation for why pods are in a bad state.

How do I find the latest ResourceSnapshot for a ResourcePlacement?

The command is similar to the CRP version but you need to scope it to a namespace: kubectl get resourcesnapshot -n <namespace> -l kubernetes-fleet.io/is-latest-snapshot=true,kubernetes-fleet.io/parent-CRP=<RPName>. Replace <namespace> with the namespace where your ResourcePlacement object lives, and <RPName> with your ResourcePlacement name. The ResourceSnapshot for an RP is always namespace-scoped, which reflects the fact that ResourcePlacement itself is namespace-scoped and can only manage resources within its own namespace.

Why does PlacementScheduled show False even though my member clusters are joined?

Three things cause this most often. First, with PickFixed, the cluster name in your policy doesn't exactly match the name returned by kubectl get memberclusters, even a casing difference breaks matching. Second, with PickN, you've asked for more clusters than currently satisfy your label selector, run kubectl get memberclusters -l <your-labels> and count the results against your numberOfClusters value. Third, your resource selector is accidentally picking resources from a reserved namespace (like kube-system), which the scheduler automatically rejects. Note: PickAll is the one policy type where PlacementScheduled is always True by design, regardless of cluster count.

What's the difference between ResourcePlacement and ClusterResourcePlacement troubleshooting?

The underlying architecture and troubleshooting logic are identical, one-to-one condition mapping, same propagation chain (snapshot → binding → work), same scheduler behavior. The main practical differences are: ResourcePlacement is namespace-scoped, so all your commands need -n <namespace> added; condition names use the ResourcePlacement prefix instead of ClusterResourcePlacement (e.g., ResourcePlacementScheduled vs ClusterResourcePlacementScheduled); and ResourcePlacement has an additional hard prerequisite, the target namespace must already exist on member clusters or apply will fail. Start with the ClusterResourcePlacement troubleshooting flow and substitute the appropriate namespace-scoped commands and condition names as you go.

Related Microsoft Fix Guides

Sai Kiran Pandrala

Our team includes certified Microsoft engineers, Azure architects, and system administrators with 10+ years of enterprise IT experience. Every guide is written from hands-on troubleshooting, not guesswork. We test every fix before publishing.