How to Troubleshoot Azure Service Fabric Errors
Why This Is Happening
I've worked Azure Service Fabric troubleshooting cases at every scale , three-node dev clusters all the way to 100-node production rings running financial services workloads. And the honest truth is this: Service Fabric is one of the most powerful distributed systems platforms Microsoft offers, but when something goes wrong, the error messages it gives you are genuinely unhelpful unless you know exactly where to look. You'll see things like System.FM reported Error, SourceId='System.FM', Property='State' and have absolutely no idea what broke or why.
I know this is frustrating , especially when your services are down, your cluster health dashboard is lit up red, and your stakeholders are pinging you every five minutes. Let me explain what's actually happening under the hood.
Azure Service Fabric cluster health errors fall into a handful of root causes that I see over and over again:
Node failures and VM-level issues. The most common trigger for Azure Service Fabric node failure fixes is the underlying Virtual Machine Scale Set (VMSS) losing a node, either through Azure infrastructure maintenance, a VM restart policy misfiring, or a node running out of disk space at C:\SFRoot (the default Service Fabric data root on Windows). When a node drops below the cluster's minimum reliability tier, the Failover Manager (FM) starts throwing HealthState=Error across multiple system services simultaneously.
Upgrade rollback failures. Service Fabric application deployment failures during upgrades are brutally common. If your application package doesn't pass health checks within the configured HealthCheckWaitDuration, which defaults to 0 seconds, meaning it checks immediately, the upgrade domain rolls back, and you're left with a cluster stuck in RollbackInProgress state that won't resolve on its own.
Certificate expiration. This one causes middle-of-the-night pages. Cluster certificates in Service Fabric are used for node-to-node communication and client authentication. When a Service Fabric certificate expired situation hits, the cluster's federation layer can't establish trust between nodes, and you'll see System.FederationSubsystem reporting errors in Service Fabric Explorer alongside HTTP 403 responses when connecting with sfctl or the SDK.
Quorum loss on stateful services. If you have stateful services with a target replica set size of 3 and two replicas go down simultaneously, say, during a VMSS scale-in event that removes two nodes, you hit quorum loss. The partition enters QuorumLoss state and stops accepting writes. This is one of the scariest Service Fabric states to recover from because Microsoft's tooling for it is not intuitive.
Misconfigured load balancer probes. Service Fabric sits behind an Azure Load Balancer, and if the health probe endpoint (default port 19080 for the management endpoint, or your custom service's port) becomes unreachable, even temporarily, ALB pulls the backend out of rotation. Your services are healthy internally but completely unreachable externally. A Service Fabric load balancer probe failure looks exactly like a cluster outage from the outside.
Browse all Microsoft fix guides → for related Azure troubleshooting walkthroughs.
The Quick Fix, Try This First
Before diving into deep diagnostics, run this 60-second cluster health check. It tells you exactly which entity is unhealthy and gives you a structured starting point. Open PowerShell with the Service Fabric SDK installed and connect to your cluster:
# Connect to your cluster (replace with your cluster endpoint and cert thumbprint)
Connect-ServiceFabricCluster -ConnectionEndpoint "yourcluster.eastus.cloudapp.azure.com:19000" `
-X509Credential `
-ServerCertThumbprint "AABBCCDD1122..." `
-FindType FindByThumbprint `
-FindValue "AABBCCDD1122..." `
-StoreLocation CurrentUser `
-StoreName My
# Get the overall cluster health, this is your map
Get-ServiceFabricClusterHealth | Select-Object AggregatedHealthState, NodeHealthStates, ApplicationHealthStates
What you get back is a structured health tree. If AggregatedHealthState shows Error, look at NodeHealthStates first. If every node is Ok but an application is Error, that's an application-level problem. If nodes are Error, that's infrastructure.
For a faster visual snapshot, open Service Fabric Explorer directly at https://yourcluster.eastus.cloudapp.azure.com:19080/Explorer. The tree view on the left shows every node, application, service, partition, and replica with color-coded health states. Red means Error, yellow means Warning, green is healthy. Click any red item and you'll see the exact health event description, timestamp, and source system service that reported it.
In about 80% of the cases I've handled, the fix starts right here, you find a single unhealthy node, clear it or restart the Service Fabric service on it, and the cluster self-heals within 5–10 minutes as replicas move and health reports propagate.
# Restart the Service Fabric service on a specific node (run on the node itself via RDP)
Restart-Service FabricHostSvc -Force
# Or use PowerShell remoting if you have it configured
Invoke-Command -ComputerName "NodeName_0" -ScriptBlock { Restart-Service FabricHostSvc -Force }
SourceId like System.RAP (Reconfiguration Agent Protocol), System.NamingService, or System.Hosting. The SourceId tells you which internal subsystem to focus on.
Service Fabric Explorer (SFX) is your primary diagnostic interface. Open it at https://<your-cluster-endpoint>:19080/Explorer in a browser. If you can't reach it, that's a separate problem, skip to Step 4 for connectivity fixes.
Once inside, you'll see a tree structure on the left: the cluster at the root, then nodes, then applications. Here's how to read it efficiently:
Check nodes first. Expand the Nodes section. Each node will show its status: Up, Down, or Disabled. A node that's Down is the most urgent signal, Service Fabric can tolerate a certain number of down nodes based on your reliability tier (Bronze = 0 tolerated, Silver = 1, Gold = 2, Platinum = 3). If you've exceeded that tolerance, system services like Failover Manager (FM) and Cluster Manager (CM) will themselves report errors.
Click on any red or yellow node and look at the Health Events section. You'll see structured events like:
SourceId: System.FM
Property: State
Description: Partition is in quorum loss.
HealthState: Error
TimeToLive: Infinite
Then check applications. Expand the Applications section. Find your application, expand down to the service, then to partitions. A partition in InBuild state for more than a few minutes usually means a replica is stuck placing. A partition in QuorumLoss needs immediate action, you'll handle that in Step 3.
After reading the health tree, you should have a clear picture of whether your issue is node-level or application-level. That distinction drives everything else in this guide. If all your nodes are Up and green, skip directly to Step 3. If nodes are showing as Down or Error, continue to Step 2.
An Azure Service Fabric node failure fix starts by understanding why the node went down. There are three distinct scenarios.
Scenario A: VM is running but Service Fabric is unhealthy. RDP or SSH into the node. On Windows, open Event Viewer and navigate to Applications and Services Logs > Microsoft > ServiceFabric > Admin (Event Log Channel ID: Microsoft-ServiceFabric/Admin). Look for events with Level = Error or Critical. Common culprits are disk full conditions (Event ID 25624, "Failed to open file") or certificate validation failures (Event ID 4349).
Check disk space immediately:
# Run on the node via RDP or Invoke-Command
Get-PSDrive C | Select-Object Used, Free
# Service Fabric needs at least 10 GB free on the data drive
# Default data root is C:\SFRoot, check its size:
Get-ChildItem C:\SFRoot -Recurse | Measure-Object -Property Length -Sum
If disk is the issue, clear the Service Fabric crash dump folder at C:\SFRoot\CrashDumps, these accumulate fast and can fill a 128 GB OS disk within days on a busy node.
Scenario B: VM itself is down or deallocated. Go to the Azure Portal, navigate to your VMSS resource (it's named something like nt1vm or nt0 depending on your node type name), and check the instance list. A stopped/deallocated instance won't run Service Fabric. Start the instance and give Service Fabric 3–5 minutes to rejoin the cluster. Watch SFX, the node should transition from Down → Up → OK.
Scenario C: Node is stuck in Disabling state. This happens when a deactivation request was issued (often by Azure infrastructure for maintenance) but never completed. You'll see the node status as "Disabling" indefinitely in SFX. Fix it with:
# Force-complete the deactivation so the node can come back
# First, check current deactivation status:
Get-ServiceFabricNode -NodeName "NodeName_3" | Select-Object NodeStatus, NodeDeactivationInfo
# If stuck, remove the deactivation intent:
Disable-ServiceFabricNode -NodeName "NodeName_3" -Intent RemoveData
# Wait 60 seconds, then re-enable:
Enable-ServiceFabricNode -NodeName "NodeName_3"
After re-enabling, the node will rejoin and replicas will begin rebuilding. This can take anywhere from 2 minutes to 30 minutes depending on how much state needs to transfer.
This is the most serious state in Azure Service Fabric troubleshooting. When a Service Fabric partition not ready or quorum loss condition hits a stateful service, your application stops writing. Here's exactly how to handle it.
First, confirm the partition ID and state:
# Get all partitions in error state across all applications
Get-ServiceFabricApplication | ForEach-Object {
Get-ServiceFabricService -ApplicationName $_.ApplicationName
} | ForEach-Object {
Get-ServiceFabricPartition -ServiceName $_.ServiceName
} | Where-Object { $_.PartitionStatus -ne "Ready" } |
Select-Object PartitionId, PartitionStatus, ServiceName
If a partition shows InQuorumLoss, you have two replicas (or more) missing and writes are blocked. Your options in order of preference:
Option 1, Bring the missing nodes back online (preferred). This is always the right first move. Quorum loss resolves automatically once enough replicas rejoin. If you brought a node back in Step 2 and you're still seeing quorum loss after 10 minutes, the replica on that node may need to rebuild.
Option 2, Invoke data loss as a last resort. Only do this if you've lost replicas permanently (VM disk corruption, accidental deletion). This will cause data loss, your application must handle state recovery through its own backup/restore mechanism:
# This is destructive, only use if replicas are permanently gone
# Replace with your actual partition GUID
$partitionId = "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
Invoke-ServiceFabricPartitionDataLoss `
-PartitionId $partitionId `
-DataLossMode FullDataLoss `
-OperationId ([System.Guid]::NewGuid())
For Service Fabric primary replica missing scenarios (where a partition has no primary but replicas are otherwise intact), you can trigger a manual reconfiguration by temporarily reducing the target replica count to 1, letting a new primary elect, then restoring it:
Update-ServiceFabricService -Stateful -ServiceName "fabric:/MyApp/MyService" -TargetReplicaSetSize 1
# Wait for partition to become Ready
# Then restore original replica count:
Update-ServiceFabricService -Stateful -ServiceName "fabric:/MyApp/MyService" -TargetReplicaSetSize 3
Service Fabric upgrade stuck rolling back is one of the most common support tickets I see, and Microsoft's default error message, "ApplicationTypeVersion 'v2.0.0' is being deleted" or "Upgrade is rolling back" shown perpetually in SFX, doesn't tell you what health check failed.
Get the actual reason the upgrade failed:
# Get upgrade details including failure reason
Get-ServiceFabricApplicationUpgrade -ApplicationName "fabric:/MyApp" |
Select-Object ApplicationName, UpgradeState, UpgradeDomains,
FailureReason, UpgradeStatusDetails
The FailureReason field will say something like HealthCheck, UpgradeDomainTimeout, or OverallUpgradeTimeout. For health check failures, get the specific health events:
Get-ServiceFabricApplicationHealth -ApplicationName "fabric:/MyApp" |
Select-Object -ExpandProperty UnhealthyEvaluations
Common fixes for stuck Service Fabric application deployment failures:
Fix 1, Health check policy too strict. Your ApplicationHealthPolicy in the upgrade description may be set to 0% max unhealthy, meaning one service in Warning state blocks the whole upgrade. Roll back cleanly first:
Start-ServiceFabricApplicationRollback -ApplicationName "fabric:/MyApp"
Then resubmit the upgrade with a more lenient health policy, increasing MaxPercentUnhealthyServices to 25 while you investigate the underlying health issue separately.
Fix 2, Package validation failure. If the upgrade fails immediately at the first upgrade domain, your package may have a manifest inconsistency. Check with:
Test-ServiceFabricApplicationPackage -ApplicationPackagePath "C:\MyAppPackage" -ImageStoreConnectionString "fabric:ImageStore"
After resolving the rollback, verify the application type is fully removed from the image store before re-uploading:
Get-ServiceFabricApplicationType -ApplicationTypeName "MyAppType"
# If the failed version still shows, remove it:
Unregister-ServiceFabricApplicationType -ApplicationTypeName "MyAppType" -ApplicationTypeVersion "v2.0.0" -Force
When SFX doesn't give you enough detail and PowerShell health queries only show cascading symptoms, you need the raw event logs. Service Fabric emits structured ETW (Event Tracing for Windows) events across several channels, and reading them correctly is what separates a 4-hour debugging session from a 20-minute one.
On-node Event Viewer channels: Open Event Viewer on any cluster node. Navigate to Applications and Services Logs > Microsoft > ServiceFabric. There are three channels:
Admin, Operational events, errors, and warnings. Start here.Operational, Higher-volume events, useful for sequence-of-events analysis.Analytic, Very verbose, only enable when directed by Microsoft Support.
Key Service Fabric event log errors and their meanings:
Event ID 23029, "FabricNode: Unexpected exception" → Node process crash, check for .dmp files in C:\SFRoot\CrashDumps
Event ID 4349 , "Certificate validation failed" → Certificate expired or CN mismatch
Event ID 23482, "Service host process exited" → Your service code crashed; check your application logs
Event ID 18601, "Partition ReconfigurationCompleted" → Normal, replica reconfiguration finished
Azure Monitor / Log Analytics: If your cluster has Azure Diagnostics configured (it should, go to your cluster resource in the Azure Portal, select Diagnostics settings, confirm logs are flowing to a Log Analytics workspace), query Service Fabric events directly:
// KQL query for Service Fabric errors in the last 24 hours
ServiceFabricReliableServiceEvent
| where TimeGenerated > ago(24h)
| where Level == "Error" or Level == "Warning"
| project TimeGenerated, ServiceTypeName, OperationName, Message
| order by TimeGenerated desc
| take 100
Also query the operational channel:
ServiceFabricOperationalEvent
| where TimeGenerated > ago(1h)
| where EventId in (23029, 4349, 23482)
| project TimeGenerated, NodeName, EventId, TaskName, Message
This gives you a timeline across all nodes simultaneously, something you can't get by RDP-ing into each node individually. If you see Event ID 23029 on multiple nodes within seconds of each other, that points to a cluster-wide configuration issue, not a single bad node.
Advanced Troubleshooting
Fixing Service Fabric Certificate Expired Issues
Certificate problems break the entire cluster fast. When your cluster certificate expires, nodes can't authenticate to each other, and the federation layer collapses. You'll see System.FederationSubsystem reporting errors in SFX and be unable to connect with sfctl or the PowerShell SDK.
To rotate a cluster certificate without downtime, you need to add the new certificate as a secondary before the primary expires:
# Add new certificate thumbprint as secondary (do this BEFORE primary expires)
# Navigate in Azure Portal: Service Fabric Cluster > Security > + Add Certificate
# Or via ARM template update, change the "thumbprintSecondary" field in the cluster resource
# After nodes have picked up the new cert (check in SFX that all nodes show new cert), swap primary/secondary:
# Portal: Service Fabric Cluster > Security > Swap Primary/Secondary
If the certificate already expired and your cluster is down, you're in a break-glass scenario. You'll need to directly modify each node's VMSS model to inject the new certificate from Key Vault, then restart the Service Fabric service on each node sequentially, respecting the upgrade domain order (never restart all nodes simultaneously).
Group Policy and Domain-Joined Cluster Nodes
Enterprise clusters with domain-joined VMSS instances sometimes have GPOs that interfere with Service Fabric. Specifically, policies that set aggressive firewall rules can block the following ports that Service Fabric requires internally:
TCP 1025-1027 , Dynamic ports for internal RPC
TCP 4443 , Reverse proxy HTTPS
TCP 19000 , Client connections
TCP 19080 , HTTP management endpoint
TCP 20001-20031, Application port range (configurable)
Check for GPO-imposed firewall rules on a node with:
Get-NetFirewallRule | Where-Object { $_.Enabled -eq "True" -and $_.Direction -eq "Inbound" -and $_.Action -eq "Block" } |
Select-Object DisplayName, LocalPort, RemoteAddress
Cluster Scaling and Node Type Constraints
If you're hitting Azure Service Fabric cluster health errors during a scale-out or scale-in operation, check that your VMSS instance count never drops below the reliability tier minimum. A Gold-tier primary node type needs at least 5 nodes. Scale-in below that and you'll see FM errors immediately.
To check your reliability tier programmatically:
Get-ServiceFabricClusterManifest | Select-String "ReliabilityLevel"
Diagnosing Network-Level Service Fabric Issues
For Service Fabric load balancer probe failure diagnostics, go to the Azure Portal: navigate to your Load Balancer resource (same resource group as your cluster), then Monitoring > Metrics. Plot "Health Probe Status" per backend pool. A probe status of 0 means the probe endpoint is unreachable, check whether port 19080 is open in your NSG (Network Security Group) and that the FabricGateway service is running on the seed nodes.
Escalate to Microsoft Support when: (1) your cluster is in quorum loss and you cannot bring lost replicas back, Microsoft has internal tools to safely recover state that aren't exposed publicly; (2) you've hit a cluster in UpgradePending state for more than 6 hours with no progress, this indicates a stuck platform upgrade that only the Azure fabric team can unblock; (3) your certificate is expired and you cannot connect to the cluster at all, Microsoft Support can assist with emergency certificate rotation procedures on the backend VM infrastructure. Always open a Severity A (Critical) ticket for production outages; don't start with Severity C and escalate, it wastes time you don't have.
Prevention & Best Practices
After handling hundreds of Azure Service Fabric troubleshooting escalations, the clusters I never hear from again all have one thing in common: they invest in observability and operational hygiene before something breaks.
Monitor certificate expiration proactively. The single biggest preventable outage type I see is expired cluster certificates. Set up an Azure Monitor alert on your Key Vault certificate expiration, 60 days out, 30 days out, 14 days out, 7 days out. The alert rule is under Key Vault > Alerts > + New Alert Rule > Certificate About to Expire. Never let this be a surprise.
Configure health watch policies explicitly. Don't rely on Service Fabric defaults for your production upgrade health policies. Define explicit ApplicationHealthPolicy in your upgrade scripts with clear thresholds for MaxPercentUnhealthyServices and MaxPercentUnhealthyDeployedApplications. Know what "healthy enough to continue upgrading" means for your specific application before an upgrade fails at 2 AM and you're making that call under pressure.
Run chaos testing in pre-production. Service Fabric has a built-in Chaos service that deliberately introduces faults to validate cluster resilience. Run it regularly in your staging environment, not just once during initial validation:
# Start chaos with controlled parameters
$chaosParams = New-Object -TypeName System.Fabric.Chaos.ChaosParameters
$chaosParams.MaxConcurrentFaults = 2
$chaosParams.MaxClusterStabilizationTimeoutInSeconds = 60
$chaosParams.WaitTimeBetweenIterationsInSeconds = 10
$chaosParams.WaitTimeBetweenFaultsInSeconds = 5
$chaosParams.TimeToRunInSeconds = 600
Start-ServiceFabricChaos -ChaosParameters $chaosParams
Never run primary node types below minimum instance count. Document your cluster's reliability tier minimum node count in your runbooks and make it a hard gate on VMSS scale-in operations via Azure Policy. A single accidentally-triggered scale-in that drops a Gold cluster from 5 to 4 nodes will cause an outage.
Enable Azure Diagnostics from day one. If you're not flowing Service Fabric ETW events to Log Analytics, you are flying blind during incidents. Set this up at cluster creation, not retroactively during an outage. The ARM template property is diagnosticsStorageAccountConfig in the cluster resource and the Microsoft.Insights/diagnosticSettings extension on your VMSS.
- Set Azure Monitor alerts for cluster health state changes: Error = PagerDuty/phone call, Warning = Slack notification
- Rotate cluster certificates every 12 months, put it in your team calendar as a recurring event, not a ticket that gets deprioritized
- Keep at least one "golden" application package version tagged in your image store, always have a known-good version you can roll back to within 2 minutes
- Test your disaster recovery runbook quarterly, specifically the quorum loss recovery procedure, because that's the one you'll blank on at 3 AM
Frequently Asked Questions
My Service Fabric cluster is showing Error health state but all nodes are Up, what does that mean?
This almost always means an application-level issue rather than infrastructure. When all nodes are Up but the cluster aggregated state is Error, expand the Applications section in Service Fabric Explorer and look for a partition in InQuorumLoss, NotReady, or a service in Error due to repeated crashes. Run Get-ServiceFabricPartition -ServiceName "fabric:/YourApp/YourService" in PowerShell to see partition status. The source system service name in the health event description (e.g., System.RAP, System.FM) tells you which internal component detected the failure, search for that SourceId in Microsoft's documentation for targeted guidance.
How do I fix "The cluster is currently in upgrade" error when I try to deploy my application?
This error means a platform-level cluster upgrade (not your application upgrade) is currently in progress, and Service Fabric is blocking concurrent application changes to protect cluster stability. Run Get-ServiceFabricClusterUpgrade in PowerShell to see the current upgrade state and which upgrade domain it's on. If it shows RollingForwardPending or RollingForwardInProgress, you just need to wait, typical cluster upgrades take 30–90 minutes. If it's been stuck in the same upgrade domain for more than 2 hours, check the cluster health for errors in that upgrade domain's nodes, as a unhealthy node is likely blocking the upgrade from advancing.
Service Fabric Explorer shows a partition as "InBuild" for over an hour, is this normal?
No, that's not normal. A partition in InBuild state means a replica is currently copying state from the primary to build up a new secondary, but it should complete within minutes for reasonably-sized state. If it's been stuck for an hour, first check the target node for disk space, replica builds stop if the destination node runs low. Also check network throughput between nodes using Get-ServiceFabricReplicaHealth on the stuck replica to see if there's a health event explaining the delay. Restarting the FabricHostSvc service on the destination node often clears stuck in-build conditions.
Can I reduce the node count on my primary node type without causing an outage?
Yes, but you must stay above your reliability tier minimum. For Silver reliability you need at least 5 nodes, Gold needs 7, Platinum needs 9 (the old 5-node minimums for Silver were updated, check the current Azure docs for your specific SDK version). Scale in one node at a time, wait for the cluster to stabilize between each reduction (watch SFX until all partitions return to Ready state), and never scale in more than one node per upgrade domain in the same operation. Going below the minimum, even briefly, will cause system service quorum loss and a cluster outage that can take hours to recover from.
How do I connect to Service Fabric Explorer when the cluster certificate is expired?
This is a painful situation. If you have client certificates configured separately from the cluster certificate, try connecting with a client cert that hasn't expired, the cluster may still respond to management requests even if node-to-node communication is broken. If you have no valid certificate at all, you'll need to access the Service Fabric management endpoint directly from within the VNET by RDP-ing into a cluster node and opening a browser to http://localhost:19080/Explorer (note HTTP, not HTTPS, from localhost only). This bypasses external TLS validation and lets you see cluster state. For actual certificate rotation in this emergency state, open a Microsoft Support ticket immediately, the recovery procedure involves direct VM-level access that requires Azure backend assistance.
My Service Fabric service keeps crashing on startup with no useful error message, how do I debug it?
The fastest path to a real error message is enabling application-level crash dumps and checking Event Viewer on the host node. In Service Fabric Explorer, find the specific replica that's crashing, note which node it's hosted on, then RDP into that node. Open Event Viewer and navigate to Windows Logs > Application, you'll typically find a .NET runtime error or an unhandled exception logged there with the full stack trace. Also check C:\SFRoot\Log\Traces for the fabric_traces*.etl files, which can be read with PerfView. For containerized services, run docker logs <container-id> on the node, Service Fabric container logs don't flow to ETW by default and are only visible through Docker directly.