Azure Service Fabric Not Working, Diagnosed and Fixed (2026 Guide)

Microsoft Fix Advanced 16 min read Official Docs Grounded Updated April 20, 2026

What's in This Guide

Why This Happens
The Quick Fix
Step-by-Step Solution
Advanced Troubleshooting
Prevention & Best Practices
FAQ

Why Azure Service Fabric Stops Working

I've seen this scenario play out more times than I can count. You've got a production cluster running perfectly, microservices humming along, update domains cycling cleanly, and then one morning everything grinds to a halt. Nodes won't reactivate. Maintenance jobs sit frozen. Your on-call engineer is staring at Service Fabric Explorer watching a repair task stuck in the Preparing state like it's never going to move again. Nobody panics immediately, but after 20 minutes everyone starts panicking.

Azure Service Fabric not working usually comes down to one of three root causes, and Microsoft's error messages help with almost none of them. The cluster UI tells you something is wrong. It doesn't tell you why.

Root cause 1: A repair task is stuck in the Preparing state. This is the most common scenario I see in enterprise clusters. The Infrastructure Service creates a repair task when Azure signals a maintenance operation, think VM updates, platform repairs, tenant-level changes. That task has to move through a defined sequence of states before anything gets executed. If it stalls at Preparing, it means the Repair Manager is still trying to prepare the environment, typically deactivating nodes, but something is blocking that process. Nine times out of ten, an unhealthy entity in the cluster is holding everything hostage.

Root cause 2: Diagnostic logs aren't reaching your storage account. You can't fix what you can't see. Service Fabric continuously uploads diagnostic data to the storage account attached to your cluster. When that upload breaks, expired SAS token, misconfigured endpoint, storage firewall rule, you lose your visibility right at the moment you need it most. This is where Azure Service Fabric diagnostic log collection failures compound every other problem.

Root cause 3: The repair job sequence is broken across update domains. Service Fabric processes maintenance operations UD by UD, starting at UD0, stepping through UD1, UD2, and so on. If one domain gets stuck waiting for acknowledgement, the entire queue backs up. Jobs sit in WaitingForAcknowledgement state indefinitely. Your Infrastructure Service is waiting for Service Fabric to give the green light, and Service Fabric is waiting for a health check that will never pass.

What makes this genuinely frustrating is that the error messages you do get, when you get any at all, rarely point to the actual bottleneck. You might see a vague cluster health warning, or nothing at all in the Azure portal. That's why knowing exactly where to look is half the battle.

Browse all Microsoft fix guides →

The Quick Fix, Try This First

Before you dig into logs or PowerShell sessions, open Service Fabric Explorer. I know that sounds obvious, but you'd be surprised how many engineers spend an hour in the Azure portal before they actually pull up SFX. Go to your cluster's overview in the Azure portal and click the Service Fabric Explorer link, it opens directly in your browser, no installation needed.

Once you're in, look at the left navigation pane. Click Repair Jobs. This single view tells you the state of every repair task in your cluster, pending, completed, canceled, everything. If you see any task sitting in Created, Claimed, or Preparing state, that task has not yet been approved. It's waiting. And while it waits, it may be blocking other operations across your entire cluster.

If you see a task in Preparing specifically, check whether it's a health check failure or a safety check failure. The Preparing state is where the Repair Manager is actively deactivating nodes to prepare for maintenance. An unhealthy entity, even a single service partition that's reporting warnings, can cause the safety check to fail and freeze the task right there.

Now switch to the Infrastructure Jobs view. Look at the Acknowledgement Status column. Any job showing WaitingForAcknowledgement is waiting for Service Fabric to approve it. If the corresponding repair task in the Repair Jobs view is also stuck, you've confirmed the link. The job exists. The repair task exists. Something in the cluster is preventing approval.

The fastest path out: identify and resolve the unhealthy cluster entity. In SFX, look at the cluster health view, any red or yellow entities are your starting point. Fix those first. The repair task will often unstick itself once health checks pass.

If the cluster looks healthy and the task is still frozen, you may need to cancel the stuck repair task and let the Infrastructure Service resubmit it. That process is covered in the step-by-step section below.

Pro Tip

When you click on a specific repair task in SFX's Repair Jobs view, note the full repair task name, it follows the format Azure/[RepairType]/[JobID]/[UD]/[IncarnationNumber]. That Job ID matches exactly what you'll see in the Infrastructure Jobs view and in your Azure backend logs. Keep it handy. Every time you call Microsoft Support about Azure Service Fabric issues, that Job ID is the first thing they'll ask for, having it ready cuts the support call in half.

Collect Diagnostic Logs from Your Service Fabric Storage Account

Everything starts with logs. Service Fabric automatically captures several categories of diagnostic data and uploads them to the storage account you configured when you set up the cluster. If you're troubleshooting Azure Service Fabric not working, this is your source of truth, not the portal, not SFX alone.

The log files are organized by type. Here's what you're looking for and where to find them on the underlying VMs:

# Service Fabric System Diagnostic Logs (binary format)
D:\SvcFab\Log\*.dtr
D:\SvcFab\Log\*.etl

# Service Fabric Installation and Trace Logs
D:\SvcFab\Log\*.trace

# Performance Counter Logs
D:\SvcFab\Log\*.blg

These are collected recursively from the log root directory and uploaded to your diagnostic storage account. To access them, navigate to the Azure portal, find your storage account associated with the cluster, and look inside the blob containers. The containers are named by log category.

If you've opened a Microsoft support case, they'll ask your permission to access this storage account directly. Under KB 3200697, Microsoft support engineers are authorized to download the .dtr, .etl, .trace, and .blg files, and even make temporary copies, to assist with your incident. If you're working with support, explicitly granting that access speeds resolution dramatically.

For self-service analysis, the .blg files are Windows Performance Counter logs, open them directly with Performance Monitor (perfmon.exe). The .etl and .dtr files are Service Fabric's internal binary trace format. You'll need the Service Fabric SDK and the TraceViewer tool to read them meaningfully. Once you have the traces open, filter by timestamp around when the issue started. Look for repeated failure patterns, specifically around node deactivation, health report submissions, and repair task state transitions.

If the log files aren't appearing in storage at all, that's a separate problem worth fixing immediately. Check your storage firewall rules, verify the connection endpoint in the cluster manifest, and confirm the SAS token or managed identity used for log upload hasn't expired.

Analyze the Infrastructure Jobs View in Service Fabric Explorer

Open Service Fabric Explorer and navigate to the cluster-level view. In the left pane, select Infrastructure Jobs. This view is specifically designed to show you the Azure-initiated maintenance operations that Service Fabric has received and needs to process.

Each entry in this view includes several key fields that you need to understand:

Job ID, This is the stable identifier that persists both inside and outside Service Fabric. It's the same ID you'll see in Azure backend logs and in support communications.
Acknowledgement Status, Either WaitingForAcknowledgement (Service Fabric hasn't approved it yet) or Acknowledged (approved and processing).
Impact Types, Describes what kind of impact the job will have, whether it's a reboot, reimaging, data disk detach, etc.
Current Repair Task, Shows which active repair task is associated with this job's approval on the Service Fabric side.

Jobs only appear in this view if they exist in the current received document from Azure. If a job disappears from the list, it means Azure has retracted it or it was superseded by a new document incarnation.

Click on All Repair Tasks within a specific job entry to see every repair task associated with it. In a five-node cluster organized into five update domains, you may see five separate repair tasks, one per UD, each moving through the state machine independently. They run sequentially: UD0 completes before UD1 starts, UD1 before UD2, and so on. If UD1's task is stuck, UD2 through UD4 are waiting in line behind it.

This sequential dependency is why a single stuck repair task can make your entire Azure Service Fabric cluster appear broken. Nothing's crashed. The sequencing mechanism is just blocked.

Identify the Exact Repair Task State and Ownership

Switch to the Repair Jobs view in Service Fabric Explorer. Here you'll see every repair task, past and present, with its current state. The state tells you exactly who "owns" the task right now and what's supposed to be happening.

Here's the ownership map you need to understand when Azure Service Fabric repair jobs aren't progressing:

Created, Repair Manager owns it. Waiting for a Repair Executor to claim it. If it's stuck here, the Infrastructure Service isn't claiming tasks, check Infrastructure Service health.
Claimed, Repair Executor owns it. Impact not yet specified. Stuck here usually means the executor is having trouble determining repair scope.
Preparing, Repair Manager owns it again. This is the dangerous one. The Repair Manager is deactivating nodes and running health and safety checks. If your task is stuck here, something in the cluster is failing those checks.
Approved, Repair Executor takes over. The task is ready to execute, the repair will actually happen. If it's stuck at Approved without moving to Executing, the executor isn't picking it up.
Executing, Repair Executor is actively performing the repair. The executor must complete all potentially disruptive actions before reporting completion.
Restoring, Repair Manager reactivates nodes. No cancellation possible at this stage.
Completed, Terminal state. Final status is Succeeded, Cancelled, Interrupted, or Failed.

For stuck Preparing state specifically, look at the cluster health tree in SFX. Any partition in Warning or Error status, any replica that's not ready, any node with a health report flagged, these are all candidates for blocking the safety check. The Repair Manager won't approve the repair task until it's confident the cluster can safely absorb the upcoming disruption.

Use PowerShell to get the repair task list programmatically if SFX is slow or unavailable:

Connect-ServiceFabricCluster -ConnectionEndpoint "your-cluster.region.cloudapp.azure.com:19000"
Get-ServiceFabricRepairTask -StateFilter Active

Resolve Unhealthy Cluster Entities Blocking Safety Checks

This is almost always the real fix when a Service Fabric repair task is stuck in Preparing. The Repair Manager is not broken. It's doing exactly what it's designed to do: refusing to approve a disruptive repair until the cluster is healthy enough to absorb it. Your job is to find and fix whatever entity is unhealthy.

In Service Fabric Explorer's cluster health view, look for anything that isn't green. Work top-down: cluster → application → service → partition → replica. A single replica in an unhealthy state can propagate warnings all the way up to the cluster level and block every pending repair task.

Common unhealthy entity patterns I've seen block Service Fabric repair approvals:

A stateful service partition that's lost quorum, often caused by a previous VM restart that took a primary replica offline.
A node stuck in Disabling state from a previous (now long-completed) maintenance window.
A health report with an expired TTL that was never cleared, the service that reported it may have recovered but the report is still sitting in the health store as an error.
System services (like fabric:/System/FailoverManagerService) reporting replica health warnings after a partial cluster restart.

For a node stuck in Disabling state, you can activate it manually through SFX by right-clicking the node and selecting Activate, or via PowerShell:

Enable-ServiceFabricNode -NodeName "NodeName"

For stale health reports, find the source system and either fix the underlying problem or clear the report:

Remove-ServiceFabricNodeHealthReport -NodeName "NodeName" -HealthProperty "YourProperty" -SourceId "YourSource"

Once every entity in the cluster health tree is green, watch the Repair Jobs view. The stuck Preparing task should transition to Approved within a few minutes as the Repair Manager detects the improved health state and completes its safety checks.

Cancel a Stuck Repair Task and Force Resubmission

Sometimes the cluster is healthy, the health tree is all green, and the repair task is still glued to Preparing. I've seen this happen after a cluster upgrade, after a storage account connectivity blip, and occasionally for no apparent reason at all. In these cases, you need to manually cancel the stuck task and let the Infrastructure Service resubmit it.

Before you cancel anything, understand the state rules. You can cancel a repair task in Created, Claimed, or Preparing state relatively cleanly, the Repair Manager has ownership in Preparing, and cancellation causes the task to skip directly to Restoring (which reactivates any nodes that were deactivated during preparation). After Approved, cancellation requires cooperation from the Repair Executor and is harder to guarantee.

To cancel a stuck repair task via PowerShell:

Connect-ServiceFabricCluster -ConnectionEndpoint "your-cluster.region.cloudapp.azure.com:19000"

# List active repair tasks to find the name
Get-ServiceFabricRepairTask -StateFilter Active | Select-Object TaskId, State, Flags

# Cancel the specific task
Stop-ServiceFabricRepairTask -TaskId "Azure/TenantUpdate/addfb79e-1e8c-42c8-a967-b0e2e0afd6b4/0/110"

The TaskId follows the exact format: Azure/[RepairType]/[JobID]/[UD]/[IncarnationNumber]. Get it exactly right, even one character off will return an error.

After cancellation, the task moves to Restoring (Repair Manager reactivates nodes) and then Completed with a Cancelled status. The Infrastructure Service will detect that the job hasn't been processed and will resubmit a new repair task, typically within a few minutes. Watch the Infrastructure Jobs view in SFX for the new task to appear.

In some enterprise scenarios with Azure Service Fabric cluster issues spanning multiple update domains, you may need to cancel tasks for several UDs. Cancel them one at a time, confirm each reaches Completed state before canceling the next, and keep an eye on the overall cluster health throughout the process.

Advanced Troubleshooting for Azure Service Fabric Issues

If the steps above haven't resolved your Azure Service Fabric cluster problems, you're likely dealing with one of the deeper infrastructure-layer issues that requires a more systematic approach. Let me walk you through the advanced diagnostic paths.

Reading Service Fabric Performance Counter Logs

The .blg files uploaded to your diagnostic storage account contain Windows Performance Counter data captured directly from your cluster VMs. Open them with Performance Monitor (perfmon.exe) by going to File → Open → and selecting the .blg file. The counters you want to focus on for Service Fabric troubleshooting are in the Service Fabric counter category, specifically queue depths for the Reliable Messaging layer, node deactivation timing, and health store operation latency. Spikes in any of these around the time your repair task got stuck are diagnostic gold.

Event Viewer Analysis for Service Fabric Nodes

RDP into one of the affected VMs and open Event Viewer. Navigate to Applications and Services Logs → Microsoft → ServiceFabric. The operational and analytic channels here contain events that never make it into SFX. Look for event IDs in the 23000-25000 range, these cover the Repair Manager, node deactivation coordinator, and health subsystem. Filter by time to narrow down to the window when your repair task stopped progressing.

Interpreting the Document Incarnation Number

Every repair task name includes a document incarnation number, that last integer in the repair task ID format. This number is monotonically increasing. It represents the version of the update document that Azure sent to Service Fabric. If you see repair tasks with very high incarnation numbers being created while older ones are still pending, it means Azure has sent multiple updated documents. Service Fabric needs to reconcile these. In this situation, the Infrastructure Service may create new repair tasks that supersede older ones. The older tasks may complete with an Interrupted status, that's expected and not an error.

Domain-Joined and Enterprise Cluster Scenarios

In enterprise Azure Service Fabric deployments where clusters are domain-joined or managed through Azure Policy, I've seen repair task approval failures caused by Group Policy objects that restrict node deactivation commands. Specifically, GPOs that enforce strict service account permissions can prevent the Repair Manager from issuing deactivation calls to nodes it doesn't "own" from an RBAC perspective. If you're in a domain-joined cluster, check with your AD team, a GPO refresh or temporary policy exemption during the maintenance window may be needed.

Operator Force-Approve as a Last Resort

In the Preparing state, an operator with sufficient permissions can force-approve a repair task, bypassing certain safety checks. This capability exists specifically for emergency situations where waiting for health checks to pass isn't an option. Only use this if you've explicitly verified that the safety check failure is a false positive and you understand the risk of proceeding with a repair while the cluster is partially unhealthy. The force-approve option appears in SFX when you drill into the specific repair task details, or via PowerShell with the -Force flag on Approve-ServiceFabricRepairTask. Use it carefully.

When to Call Microsoft Support

Call in the professionals when: repair tasks are cycling through Created and Completed repeatedly without ever reaching Approved; when the Infrastructure Service itself is showing as unhealthy in the system services view; when you're seeing repair task names with Platform repair type rather than Tenant (platform-level repairs are Azure-initiated and require Microsoft involvement to resolve); or when your diagnostic logs show the same error pattern repeating across more than 48 hours. In all these cases, open a Sev B or Sev A ticket at Microsoft Support and provide your cluster resource ID, the stuck repair task IDs, and a time range for the issue. Grant them access to your diagnostic storage account, it will cut resolution time significantly.

Prevention & Best Practices for Azure Service Fabric Clusters

The best Service Fabric cluster is one that handles repair tasks so smoothly you barely notice them. Here's how to get there.

Keep your cluster health green at all times, not just during incidents. I mean this literally. Build health alerting into your monitoring stack so that any cluster entity entering Warning state pages your team within five minutes, even if services are still running fine. The Repair Manager uses that same health state data to decide whether it's safe to approve maintenance. If you let warnings accumulate over days, they'll eventually block a repair task at the worst possible moment, during a critical platform update at 2am.

Set appropriate health policies on your applications. Service Fabric lets you configure per-application health policies that determine how many unhealthy partitions or replicas are tolerated before the application reports as unhealthy at the cluster level. Default policies are strict. In environments where some services are intentionally non-critical, loosening their health policy prevents a single flapping stateless service from blocking VM maintenance for the entire cluster.

Monitor your diagnostic storage account connectivity separately. Don't discover that logs stopped uploading when you need them most. Set up an Azure Monitor alert on the storage account's transaction error rate. If log uploads start failing, you want to know immediately, before an incident makes those logs critical.

Document your repair task format and keep a runbook. The Azure/[RepairType]/[JobID]/[UD]/[IncarnationNumber] naming format is easy to forget under pressure. A simple operations runbook with the PowerShell commands for listing active repair tasks, canceling stuck ones, and checking cluster health covers 80% of the Service Fabric on-call scenarios I've seen.

Test maintenance scenarios regularly. Many teams only discover their cluster's repair task handling is broken during a real Azure platform maintenance event. You can manually trigger repair tasks in a test cluster to verify the full Created → Claimed → Preparing → Approved → Executing → Restoring → Completed lifecycle works end to end. Do this after every significant cluster configuration change.

Quick Wins

Configure Azure Monitor alerts on cluster health state changes, Warning or Error level should page your team within 5 minutes
Set a calendar reminder to rotate or validate your diagnostic storage account SAS token 30 days before expiry
Add the Service Fabric PowerShell commands for repair task management to your team's shared runbook, don't rely on memory during an incident
After any cluster upgrade, verify all system services (visible in SFX under System) return to healthy state before considering the upgrade complete

Frequently Asked Questions

Why is my Service Fabric repair task stuck in Preparing for hours?

A repair task stuck in Preparing almost always means a health check or safety check is failing. The Repair Manager owns the task during Preparing state, it's in the process of deactivating nodes and validating that the cluster can safely absorb the repair. If any entity in your cluster is unhealthy (even a partition in Warning state), the safety check will not pass and the task won't advance. Open Service Fabric Explorer, go to the cluster health view, and find the red or yellow entities. Fix those first. The repair task will typically self-advance to Approved within minutes of the cluster returning to a healthy state. In rare cases where the cluster looks healthy but the task is still frozen, you can cancel the task using Stop-ServiceFabricRepairTask in PowerShell and let the Infrastructure Service resubmit it.

What does WaitingForAcknowledgement mean in the Infrastructure Jobs view?

It means Azure has sent a maintenance job to Service Fabric, things like VM updates, tenant updates, or platform repairs, and Service Fabric hasn't approved it yet. Service Fabric needs to create a repair task, run through the state machine, and get the task to the Approved state before it will acknowledge the job back to Azure. If a job stays in WaitingForAcknowledgement for an extended period, it's a sign that the corresponding repair task is stuck somewhere in the Created, Claimed, or Preparing states. Use the "Current Repair Task" field in the Infrastructure Jobs view to jump directly to the blocked task and diagnose it in the Repair Jobs view. Once the repair task reaches Approved, the Acknowledgement Status changes to Acknowledged.

How do I find the diagnostic log files for my Azure Service Fabric cluster?

Service Fabric uploads diagnostic logs automatically to the storage account you associated with the cluster at creation time. On the underlying VMs, the source files live under D:\SvcFab\Log\ and are collected recursively, system diagnostic logs end in .dtr or .etl, installation traces end in .trace, and performance counter logs end in .blg. You can access the uploaded versions via Azure Storage Explorer or the Azure portal by navigating to your diagnostic storage account and browsing the blob containers. If you've opened a Microsoft support case under KB 3200697, you can grant Microsoft permission to access these files directly, which significantly speeds up root cause analysis.

Can I cancel an Azure Service Fabric repair task that's in Approved or Executing state?

Technically yes, but it requires cooperation from the Repair Executor, you can't just force-cancel it the way you can a task in Created or Preparing state. Once a task reaches Approved, the Repair Executor has taken ownership, and cancellation depends on the executor acknowledging the cancel signal and confirming it's safe to stop. During Executing state, the Repair Executor is actively performing the repair and must complete all potentially disruptive actions before it can acknowledge cancellation. Forcing a cancel mid-execution risks leaving VMs in an inconsistent state. Unless you're dealing with a runaway repair that's actively causing damage, I'd strongly recommend letting Executing tasks complete rather than trying to cancel them.

What's the difference between a repair job and a repair task in Service Fabric?

A repair job is the Azure-side concept, it's the maintenance operation initiated by Azure, and it has a Job ID that's recognized outside Service Fabric. A repair task is the Service Fabric-side entity that gets created in response to a repair job. The repair task combines the repair type (like TenantUpdate or PlatformUpdate), the targeted update domain, and the document incarnation number into a single named entity in the format Azure/[RepairType]/[JobID]/[UD]/[IncarnationNumber]. A single repair job can generate multiple repair tasks, one per update domain. The repair task is what moves through the Created → Claimed → Preparing → Approved → Executing → Restoring → Completed lifecycle inside Service Fabric.

My Service Fabric cluster has five update domains, will all five repair tasks run at the same time?

No, they run sequentially, one update domain at a time. The Infrastructure Service creates separate repair tasks for each UD when a domain-level update is required, but they process in order: UD0 first, then UD1, then UD2, UD3, and UD4. Service Fabric tracks their progress through Service Fabric Explorer, and each task must reach Completed state before the next UD's task advances past Created. This sequential design is intentional, it's how Service Fabric ensures that a maintenance operation never takes down more than one fault domain's worth of nodes simultaneously, protecting cluster availability. If a task for UD2 is stuck, tasks for UD3 and UD4 are queued behind it and won't progress until you resolve the blockage.

Related Microsoft Fix Guides

Sai Kiran Pandrala

Our team includes certified Microsoft engineers, Azure architects, and system administrators with 10+ years of enterprise IT experience. Every guide is written from hands-on troubleshooting, not guesswork. We test every fix before publishing.