How to Troubleshoot System Center SCOM (Fix Agent & Server Issues)
Why This Is Happening
I've seen this exact situation on dozens of enterprise deployments , you open the SCOM Operations Console first thing Monday morning, and half your monitored servers are showing gray health states. Or maybe the agent installed fine last week, and now it's silently stopped reporting. Or worse: your management server itself is throwing errors and you're flying blind across a 2,000-node environment. I know how stressful that is, especially when your SLA clock is ticking.
System Center Operations Manager is a genuinely powerful monitoring platform, but it's also one of the more architecturally complex products Microsoft ships. You've got agents, management servers, a gateway server layer, two SQL databases (OperationsManager and OperationsManagerDW), the SDK service, the Health Service, Run As accounts, management packs, certificates , and every single one of those pieces can be the source of your pain. Microsoft's own error messages in the Operations Console are notoriously unhelpful. "Health Service Heartbeat Failure" tells you almost nothing about why it's happening.
The most common root causes I encounter when troubleshooting System Center SCOM fall into a few buckets. First, there's the Health Service failing to start or crashing on the agent machine, this is often caused by a corrupt health service store, a certificate mismatch, or a port conflict on TCP 5723. Second is management server connectivity problems, where agents can't reach the primary management server due to DNS resolution failures, firewall rules blocking bidirectional communication, or an overloaded server queue. Third is Run As account permission failures, the account SCOM uses to run workflows simply doesn't have rights to the target resource anymore, often because a password expired or an AD policy changed. Fourth, you'll see database-related slowdowns when the OperationsManager database grows unchecked and the grooming jobs haven't run properly, causing SDK timeouts and alert delivery lag.
SCOM 2019 and SCOM 2022 introduced some architectural changes around TLS 1.2 enforcement and modern certificate requirements that have caused a whole new wave of connectivity errors, particularly Event ID 21016 and Event ID 21006 in the Operations Manager event log, on machines that were running fine under SCOM 2016.
The good news? Almost every SCOM problem has a clear diagnostic trail if you know where to look. Event Viewer, PowerShell, and the SCOM shell give you everything you need. Let's work through it systematically. Browse all Microsoft fix guides →
The Quick Fix, Try This First
When an agent goes gray or stops reporting, the single fastest fix, one that resolves probably 40% of cases I see, is restarting the Microsoft Monitoring Agent service on the affected machine and flushing the Health Service store cache. Don't just restart the service blindly. Clear the store first, or you'll restart into the same corrupted state.
Here's the exact sequence. RDP into the agent machine (or run this remotely via PSRemoting), then open an elevated PowerShell prompt and run:
# Stop the Health Service
Stop-Service -Name HealthService -Force
# Clear the Health Service store (this is safe, SCOM re-downloads config from management server)
Remove-Item -Path "C:\Program Files\Microsoft Monitoring Agent\Agent\Health Service State\*" -Recurse -Force
# Restart the service
Start-Service -Name HealthService
Give it 3–5 minutes. Watch the Operations Console, if the agent turns green, you're done. If it stays gray, you're dealing with something deeper and the rest of this guide has you covered.
For management server–side issues, SDK service not responding, console hanging on "Connecting…", your quick first move is to restart the three core SCOM services on the management server itself in this specific order:
Stop-Service -Name "System Center Data Access Service" -Force
Stop-Service -Name "System Center Management Configuration" -Force
Stop-Service -Name "HealthService" -Force
Start-Service -Name "HealthService"
Start-Service -Name "System Center Management Configuration"
Start-Service -Name "System Center Data Access Service"
Order matters here. Starting the Data Access Service (the SDK service) before HealthService can cause the SDK to fail to register properly. Always bring HealthService up first on a management server.
Before touching anything, your first step when you troubleshoot System Center SCOM should always be to read the event log. Not the System log, the dedicated Operations Manager event log. On the agent machine, open Event Viewer, navigate to Applications and Services Logs → Operations Manager. On the management server, check the same location.
Filter for Warning and Error events. The event IDs that matter most:
- Event ID 21016, "OpsMgr was unable to set up a communications channel to [server]." This is a TLS or certificate handshake failure. In SCOM 2019+, this is often triggered by TLS 1.2 not being properly enabled in the Windows registry.
- Event ID 21006, "The OpsMgr Connector could not connect to [management server]:5723." Port 5723 is blocked or the management server is unreachable. Check firewall rules immediately.
- Event ID 4000, A management group failed to connect.
- Event ID 2115, "A Bind Data Source in Management Group [name] has exceeded the configured threshold." Your management server is overloaded, too many workflows running simultaneously, or the SQL server is slow.
- Event ID 7000 (System log), The HealthService failed to start. Usually a permissions issue on the Health Service store directory or a corrupted service binary.
Write down the exact event IDs and timestamps you see. Cross-reference the timestamps against any recent changes, patch deployments, Group Policy updates, certificate renewals, because the correlation is almost always there once you look for it.
If you see Event ID 21016 consistently, jump straight to Step 4. If you see Event ID 21006, work through Step 2 first. If Event ID 2115 appears, you need Step 5 and the database checks in the Advanced section.
SCOM agents communicate with management servers exclusively over TCP port 5723. I've seen this port blocked by a firewall policy update more times than I can count, it's an easy thing to miss in a quarterly rule review, and it takes down monitoring silently. The agent machines don't always surface this clearly.
From the agent machine, test connectivity to the management server:
# Test TCP port 5723 connectivity
Test-NetConnection -ComputerName "YourManagementServer.domain.com" -Port 5723
# If using SCOM Gateway, also test the gateway server
Test-NetConnection -ComputerName "YourGatewayServer.domain.com" -Port 5723
You want to see TcpTestSucceeded : True. If you get False, the problem is network-level, firewall, routing, or the management server's Windows Firewall is blocking inbound connections on 5723.
On the management server, verify the inbound rule exists in Windows Firewall. Open Windows Defender Firewall with Advanced Security, go to Inbound Rules, and look for rules named "Microsoft Operations Manager, Agent to Management Server" or similar. If they're missing, you can re-add them:
# Re-create the SCOM agent inbound firewall rule
New-NetFirewallRule -DisplayName "SCOM Agent to MS - TCP 5723" `
-Direction Inbound `
-Protocol TCP `
-LocalPort 5723 `
-Action Allow `
-Profile Domain,Private
Also check DNS. An agent that can't resolve the management server FQDN will fail with a generic connectivity error. Run Resolve-DnsName YourManagementServer.domain.com from the agent machine and confirm you get the correct IP address back. Split-brain DNS and stale DNS cache entries are a surprisingly common cause of SCOM agent connectivity failures in large environments.
Run As accounts are the mechanism SCOM uses to run monitoring workflows under specific credentials, particularly for monitoring SQL Server, Active Directory, network devices, and any resource that requires elevated access. When a Run As account's password expires or the account gets locked, you'll see a cascade of grey monitors and missed alerts, often with no obvious error surfaced in the console.
In the Operations Console, navigate to Administration → Run As Configuration → Accounts. Look for accounts flagged with a warning icon. But don't just rely on the console, it doesn't always surface expired credentials proactively.
The faster check is PowerShell using the SCOM shell. On your management server, open the Operations Manager Shell and run:
# Load SCOM module if not auto-loaded
Import-Module OperationsManager
# Connect to management group
New-SCOMManagementGroupConnection -ComputerName "YourManagementServer"
# List all Run As accounts and their associated profiles
Get-SCOMRunAsAccount | Select-Object Name, AccountType, LastModified | Sort-Object LastModified
Cross-check the accounts listed against your Active Directory password expiry policies. Any account whose last-modified date is older than your password expiry window is a candidate for investigation. Update the password in the Operations Console under Administration → Run As Configuration → Accounts → [Right-click account] → Properties → Credentials tab.
After updating credentials, verify distribution. Go to Administration → Run As Configuration → Accounts → [Right-click] → Properties → Distribution tab. If it's set to "More secure," ensure every agent that needs to use it is listed. Missing distribution is one of the most overlooked causes of workflow failures in enterprise SCOM deployments.
This is the big one for organizations that upgraded from SCOM 2016 to SCOM 2019 or 2022. Microsoft enforced TLS 1.2 as the minimum required protocol, and agents on older OS versions, or systems where TLS 1.2 isn't properly configured in the registry, will fail the handshake with Event ID 21016. The fix requires registry changes on both the agent machine and the management server.
On each affected system, verify and set the following registry keys:
# Enable TLS 1.2 for .NET Framework (32-bit and 64-bit)
$regPaths = @(
"HKLM:\SOFTWARE\Microsoft\.NETFramework\v4.0.30319",
"HKLM:\SOFTWARE\Wow6432Node\Microsoft\.NETFramework\v4.0.30319"
)
foreach ($path in $regPaths) {
If (-Not (Test-Path $path)) { New-Item -Path $path -Force }
Set-ItemProperty -Path $path -Name "SchUseStrongCrypto" -Value 1 -Type DWord
Set-ItemProperty -Path $path -Name "SystemDefaultTlsVersions" -Value 1 -Type DWord
}
# Enable TLS 1.2 protocol in Schannel
$tlsPath = "HKLM:\SYSTEM\CurrentControlSet\Control\SecurityProviders\SCHANNEL\Protocols\TLS 1.2\Client"
If (-Not (Test-Path $tlsPath)) { New-Item -Path $tlsPath -Force }
Set-ItemProperty -Path $tlsPath -Name "Enabled" -Value 1 -Type DWord
Set-ItemProperty -Path $tlsPath -Name "DisabledByDefault" -Value 0 -Type DWord
For untrusted domain or workgroup agent scenarios, SCOM also requires mutual certificate authentication. Agents outside the trust boundary need a certificate issued from a CA that both the agent and management server trust. If you're seeing Event ID 21016 specifically on DMZ or workgroup machines, the certificate chain is the likely culprit.
Check the agent's certificate binding to the HealthService:
# On the agent machine, verify certificate is bound correctly
MOMCertImport.exe /SubjectName "FQDN-of-agent-machine"
After making registry changes, restart the HealthService. A reboot is more reliable than a service restart alone when Schannel registry changes are involved. Give it 10 minutes after restart before checking the console.
If you're seeing Event ID 2115, "A Bind Data Source has exceeded the configured threshold", or your Operations Console is consistently slow to load alerts, your management server is overwhelmed. This happens when the workflow queue backs up, usually due to slow SQL response times, too many management packs loaded simultaneously, or management pack rules and discoveries that are running on too short an interval.
Start by checking the management server's current queue depth from the SCOM shell:
# Check management server queue statistics
Get-SCOMManagementServer | Select-Object DisplayName, IsGateway, IsRootManagementServer
# Check for overloaded monitors, look for workflows taking longer than threshold
Get-SCOMAlert -Severity 2 | Where-Object { $_.Name -like "*threshold*" } |
Select-Object Name, TimeRaised, MonitoringObjectDisplayName |
Sort-Object TimeRaised -Descending | Select-Object -First 20
On the SQL Server hosting your OperationsManager database, run this query to check for database bloat and table sizes:
-- Run in SQL Server Management Studio against OperationsManager database
SELECT
t.NAME AS TableName,
s.Name AS SchemaName,
p.rows AS RowCounts,
SUM(a.total_pages) * 8 / 1024 AS TotalSpaceMB,
SUM(a.used_pages) * 8 / 1024 AS UsedSpaceMB
FROM sys.tables t
INNER JOIN sys.indexes i ON t.OBJECT_ID = i.object_id
INNER JOIN sys.partitions p ON i.object_id = p.OBJECT_ID AND i.index_id = p.index_id
INNER JOIN sys.allocation_units a ON p.partition_id = a.container_id
INNER JOIN sys.schemas s ON t.schema_id = s.schema_id
WHERE t.NAME NOT LIKE 'dt%' AND t.is_ms_shipped = 0 AND i.OBJECT_ID > 255
GROUP BY t.Name, s.Name, p.Rows
ORDER BY TotalSpaceMB DESC;
If the Alert, ManagedEntity, or EventView tables are enormous, your grooming settings are too conservative. In the Operations Console, go to Administration → Settings → Database Grooming and reduce retention on closed alerts and resolved states. A typical healthy environment keeps closed alerts for 7 days, not 30+. After adjusting grooming, manually kick off the grooming job via SQL Agent in SSMS, look for the GroomAlertTable and GroomManagedEntityChangeLog SQL Agent jobs under the SCOM instance.
Advanced Troubleshooting
Once you've worked through the standard steps and things still aren't right, it's time to go deeper. Here's where to look for the harder-to-find issues when you troubleshoot System Center SCOM at the enterprise level.
Group Policy Conflicts with the SCOM Agent
Group Policy can quietly break SCOM in ways that are maddening to diagnose. The most common culprit: a GPO that enforces Windows Firewall settings and overwrites the rules you created manually, or a GPO that restricts the LocalSystem or NetworkService account from running services. Run gpresult /H gpresult.html on the affected agent machine and look for firewall policies, service account restrictions, or audit policies that conflict with SCOM's requirements. The HealthService runs as LocalSystem by default, any GPO that limits LocalSystem's network access will break agent communication.
Registry Investigation for Health Service Failures
The HealthService stores its management group configuration here:
HKEY_LOCAL_MACHINE\SOFTWARE\Microsoft\Microsoft Operations Manager\3.0\Agent Management Groups
Under each management group subkey, you'll find the management server name, port, and heartbeat interval. If these values are wrong, particularly if the management server FQDN changed after an upgrade or rename, the agent will never connect. Update the NetworkName and Port values to match the current management server, then restart HealthService. This is faster than uninstalling and reinstalling the agent in most cases.
Event Viewer, System Center Specific Channels
Beyond the Operations Manager channel, also check Applications and Services Logs → Microsoft → Windows → Windows Remote Management for WinRM errors if you're using agentless monitoring. For management pack workflow failures, the Operations Manager channel on the management server will show Event ID 1102 (workflow failed) and Event ID 1103 (module failed to load) with specific module names, these point directly at which management pack rule is broken and often which target system is causing the failure.
SQL Server Connectivity From the Management Server
If your SDK service (System Center Data Access Service) fails to start, nine times out of ten it's a SQL connectivity problem. Verify the management server can reach SQL:
# Test SQL connectivity from management server
$connectionString = "Server=YourSQLServer;Database=OperationsManager;Integrated Security=True;"
$conn = New-Object System.Data.SqlClient.SqlConnection($connectionString)
Try { $conn.Open(); Write-Host "SQL Connection: SUCCESS" }
Catch { Write-Host "SQL Connection FAILED: $_" }
Finally { $conn.Close() }
Also check that the SCOM service account has the sdk_users and configsvc_users database roles in the OperationsManager database, these are required and are sometimes stripped after a SQL security hardening pass.
Pending Management and Agent Push Failures
When agents are stuck in "Pending Management" in the console (Administration → Device Management → Pending Management), it's almost always a WMI or RPC connectivity issue between the management server and the target machine. From the management server, test: Test-WSMan -ComputerName targetmachine. If that fails, WinRM isn't configured on the target, run winrm quickconfig on it. For push installation, also verify that File and Printer Sharing is enabled and that the management server's machine account has local admin rights on the target.
Get-SCOMManagementServer and Get-SCOMAlert -Severity 2, SQL error logs, and the SCOM installation version (check Help → About in the console). Having these ready cuts your support call time dramatically. Visit Microsoft Support to open a case or find your Premier contact.
Prevention & Best Practices
I've seen SCOM environments that run smoothly for years and ones that are in constant firefighting mode. The difference almost always comes down to a handful of operational habits. Once you've got your current issues resolved, these practices will keep you out of trouble when you troubleshoot System Center SCOM in the future.
Set up SCOM to monitor itself. This sounds obvious but most environments skip it. Import the System Center Operations Manager Management Pack for Operations Manager itself, it ships free with SCOM and monitors management server health, database size, queue depth, and workflow failures. Configure alerts on these monitors. You want to know your database is approaching capacity or your queue is backing up before it becomes a crisis, not after.
Control your management pack sprawl. Every management pack you import adds workflows, discoveries, and rules, all of which consume management server CPU, memory, and database I/O. Before importing any community or vendor management pack, read the override recommendations in its guide. Tune discovery intervals. Most out-of-the-box management packs run discoveries every 4 hours, if you have 5,000 monitored objects, that's an enormous and often unnecessary load. Bump non-critical discovery intervals to 12 or 24 hours.
Maintain a certificate rotation calendar. If you're using certificate-based authentication for gateway servers or untrusted-domain agents, put certificate expiry dates in your team calendar with a 60-day advance warning. An expired SCOM certificate causes a complete monitoring blackout for all agents behind that gateway, no alerts, no heartbeats, nothing. It's preventable with a simple calendar reminder.
Schedule weekly database grooming verification. Set a recurring task to check that the SQL Agent jobs for SCOM grooming ran successfully each week. Query the msdb.dbo.sysjobhistory table or just open SQL Server Agent in SSMS and look at the job history for the SCOM-related jobs. Grooming job failures silently allow database bloat that eventually causes performance degradation across your entire monitoring infrastructure.
- Create a dedicated SCOM maintenance mode PowerShell script and schedule it around your monthly patch cycle, agents rebooted without maintenance mode generate hundreds of false alerts that erode trust in your monitoring data.
- Set the OperationsManager and OperationsManagerDW databases to "Simple" recovery model only if you don't need point-in-time SQL restores, this prevents transaction log runaway growth, which is one of the most common SCOM disk space emergencies.
- Document every management pack version you have installed and create a test management group (even a single-VM lab) where you validate new management pack imports before pushing to production, a bad MP import can destabilize a production management server within hours.
- Enable SCOM agent proxy for all monitored nodes that also host monitored applications (SQL, IIS, AD), without the proxy setting enabled, distributed application health models won't resolve correctly and you'll get misleading grey states.
Frequently Asked Questions
Why are my SCOM agents showing gray health state even though the machines are online and responding to ping?
Gray health state means the management server hasn't received a heartbeat from the agent within the heartbeat failure threshold (default: 3 missed heartbeats at 60-second intervals). The machine being pingable proves network reachability at ICMP level, but SCOM communicates over TCP 5723, and that port might be blocked even if ping works. Check Event ID 21006 or 21016 in the Operations Manager event log on the agent machine. The most common causes are a blocked firewall port, the HealthService not running on the agent, or a TLS mismatch in SCOM 2019/2022 environments. Work through Step 2 in this guide first, then Step 4 if you see TLS-related event IDs.
My SCOM Operations Console keeps hanging on "Connecting to server...", how do I fix it?
This almost always means the System Center Data Access Service (SDK service) on the management server is either stopped, slow to respond, or having trouble reaching the OperationsManager SQL database. RDP to your management server and check that all three core services are running: HealthService, System Center Management Configuration, and System Center Data Access Service. If the SDK service is stopped, check the Application event log for errors before starting it, starting a crashing service repeatedly without diagnosing it first wastes time. Then test SQL connectivity using the PowerShell snippet in the Advanced section above. If SQL is unreachable, fix that first before touching the services.
How do I push SCOM agents to remote machines without using the console's built-in push installer?
The console's push installer requires WMI and RPC access between the management server and the target, which is often blocked in hardened environments. The alternative is manual agent deployment using the MOMAgent.msi installer from your SCOM installation media, deployed via your software distribution system (SCCM, Intune, or a simple PowerShell script via GPO startup script). Run the MSI with these properties: MANAGEMENT_GROUP="YourMGName" MANAGEMENT_SERVER_DNS="YourMSFQDN" SECURE_PORT=5723 USE_MANUALLY_SPECIFIED_SETTINGS=1. After installation, approve the agent in the console under Administration → Device Management → Pending Management if manual approval is required in your management group settings.
We get thousands of SCOM alerts every day and our team ignores most of them, how do we fix alert fatigue?
Alert fatigue in SCOM is almost always a management pack tuning problem. The default thresholds that ship with most management packs, Microsoft's included, are calibrated for average environments and will fire constantly in busy production systems. The fix is creating overrides: in the Operations Console, right-click any monitor generating noise, select Overrides → Override the Monitor → For all objects of class [X], and raise the thresholds or extend the consecutive sample requirements before an alert fires. Also audit your alert resolution policy, alerts that auto-resolve should have their Resolution State configured properly. Consider creating a dedicated "Alert Review" weekly process where the team systematically tunes the top 10 noisiest rules each week until the alert volume is manageable.
Can I upgrade from SCOM 2016 to SCOM 2022 in-place without rebuilding my management servers?
Yes, Microsoft supports an in-place upgrade path from SCOM 2016 through 2019 to 2022, but there are important prerequisites. Your SQL Server must be SQL 2019 or 2022 (SCOM 2022 dropped support for older SQL versions). Your OS on the management servers must be Windows Server 2019 or 2022. TLS 1.2 must be enforced in the registry on all components before you run the SCOM 2022 installer, or the installer itself will fail with a cryptic error. Back up the OperationsManager and OperationsManagerDW databases before starting. Upgrade the root management server first, then secondary management servers, then gateways, then agents. Never upgrade agents before the management servers, an agent on a newer version than its management server will fail to connect.
How do I put a server in SCOM maintenance mode using PowerShell so I can automate it around patch reboots?
This is something every SCOM admin should have scripted. Here's the pattern using the SCOM PowerShell module: first load the module and connect (New-SCOMManagementGroupConnection -ComputerName "YourMSFQDN"), then get the monitoring object for your server ($instance = Get-SCOMClassInstance -Name "YourServer.domain.com"), then start maintenance ($instance | Start-SCOMMaintenanceMode -EndTime (Get-Date).AddHours(2) -Reason "PlannedOther" -Comment "Monthly patching"). The -Reason parameter accepts values like PlannedOther, PlannedHardwareMaintenance, or PlannedOperatingSystemReconfiguration. Wrap this in a scheduled task that fires 15 minutes before your patch window starts and you'll eliminate the alert storm that currently follows every Patch Tuesday reboot.