How to Troubleshoot Exchange Server Problems
Why Exchange Server Problems Keep Happening
I've seen this scene play out more times than I can count: it's Monday morning, someone walks in holding a coffee mug and says "email is down." The next four hours are a blur of PowerShell windows, Event Viewer scrolls, and a growing queue of angry Slack messages asking when Outlook will work again. Exchange Server troubleshooting isn't glamorous work , but it's some of the most urgent work in enterprise IT.
Exchange Server is a genuinely complex piece of infrastructure. It's not just a mail server. It's a stack of interdependent services , transport, mailbox, client access, unified messaging, each talking to Active Directory, DNS, IIS, and in modern deployments, a Database Availability Group (DAG) spread across multiple physical nodes. When any one of those layers hiccups, the symptom you see at the surface (email not sending, OWA throwing a 503, Outlook prompting for credentials in a loop) rarely points directly at the actual cause.
The most common root causes I encounter break down into four broad families. First, service failures, a Windows service like MSExchangeTransport or MSExchangeIS crashes and doesn't auto-restart cleanly. Second, mail flow disruptions, the transport pipeline backs up, a connector loses its smart host, or a spam filter starts deferring everything. Third, database problems, a mailbox database refuses to mount after a dirty shutdown, usually following a power event or abrupt failover. Fourth, client connectivity failures, Autodiscover breaks, SSL certificates expire, or an IIS application pool gets recycled and never comes back.
What makes Exchange Server troubleshooting particularly frustrating is that Microsoft's own error messages are often maddeningly vague. You'll get an event ID 1003 in Application log with the text "The Microsoft Exchange Transport service encountered a non-transient failure", which tells you almost nothing about what to actually fix. This guide cuts through that noise.
Whether you're dealing with Exchange Server 2016, Exchange Server 2019, or still maintaining a 2013 environment (no judgment, I've been there), the diagnostic patterns are consistent. The commands change slightly between versions, but the mental model is the same. Browse all Microsoft fix guides →
The Quick Fix, Try This First
Before you spend an hour digging through logs, run the Exchange Health Check. This single PowerShell command saves enormous amounts of time because it gives you a structured snapshot of what's actually broken right now. Open the Exchange Management Shell (not a regular PowerShell window, it needs the Exchange snap-ins loaded) as Administrator, and run:
Test-ServiceHealth | Where-Object {$_.ServicesNotRunning -ne $null} | Format-List
That command interrogates every Exchange-related Windows service and returns a clean list of anything that isn't running when it should be. If you see output, any output, that's your starting point. Make a note of every service name listed under ServicesNotRunning.
Next, check the mail queue in under 30 seconds:
Get-Queue | Where-Object {$_.MessageCount -gt 0} | Sort-Object MessageCount -Descending | Format-Table -AutoSize
If you see a queue called Submission sitting at hundreds or thousands of messages, the transport service has a problem and email is backing up. If you see an Unreachable queue growing, it's a routing or DNS issue. If the queue looks fine but users still can't send, the problem is upstream, likely in Active Directory or the client access layer.
Now restart the core services in the right order. Order matters with Exchange, don't just blast-restart everything at once:
Restart-Service MSExchangeTransport -Force
Restart-Service MSExchangeFrontEndTransport -Force
Restart-Service W3SVC -Force
After those restart (give it 60–90 seconds), check whether OWA loads at https://[yourserver]/owa and whether a test message moves through the queue. If it does, you've resolved the most common class of Exchange failures in under five minutes. If it doesn't, keep reading, the sections below dig into each failure type systematically.
The first thing any Exchange Server troubleshooting session should do is establish a baseline of what's actually running. Windows Services Manager is fine for a visual check, but it doesn't understand Exchange's service dependencies. Use the shell.
Run this in Exchange Management Shell to get a complete service health matrix:
Test-ServiceHealth
This returns a table showing every Exchange server role and whether its required services are running. Look for any row where RequiredServicesRunning is False. Common culprits include:
- MSExchangeTransport, The Hub Transport role. If this is down, email is not moving. Period.
- MSExchangeIS, The Information Store. If this is down, no mailboxes are accessible.
- MSExchangeRPC, RPC Client Access. If this is down, Outlook MAPI profiles will fail to connect.
- MSExchangeADTopology, Active Directory Topology service. If this fails, Exchange can't read its own configuration from AD.
For any stopped service, check its Windows Event Log entries before restarting it. In Event Viewer, navigate to Windows Logs > Application and filter by Source = MSExchange [ServiceName]. Look for events in the 1000–1010 range, these typically contain the actual stop reason. Event ID 1000 is an application crash with a faulting module listed; that tells you whether it's an Exchange DLL or a third-party AV product causing the crash.
Once you've reviewed the logs, restart stopped services individually and watch whether they stay running for at least two minutes. A service that starts and stops again within 30 seconds has an unresolved dependency problem, usually AD connectivity or a missing database file.
You'll know this step worked when: Test-ServiceHealth returns all rows with RequiredServicesRunning: True.
Services running doesn't guarantee email is flowing. Exchange has a dedicated cmdlet for end-to-end mail flow testing that sends a real test message through the transport pipeline and measures the round trip. This is the most reliable way to confirm whether your Exchange Server is actually delivering email.
Run this from Exchange Management Shell:
Test-Mailflow -TargetMailboxServer [YourExchangeServerName] -Timeout 00:02:00
Replace [YourExchangeServerName] with the actual hostname of your Exchange server (not a load balancer or DNS alias, the actual server name). The result will show you TestMailflowResult: either Success or *FAILED*, plus a MessageLatencyTime. If latency is over 30 seconds on an internal test, your transport pipeline has a bottleneck, often an antivirus or anti-spam scan agent adding overhead to every message.
If Test-Mailflow fails outright, check your Send and Receive connectors. Many mail flow failures trace back to a misconfigured connector, particularly after an IP address change, a firewall rule modification, or a certificate renewal that wasn't applied to the connector properly.
Check your connectors:
Get-SendConnector | Format-List Name, Enabled, SmartHosts, AddressSpaces
Get-ReceiveConnector | Format-List Name, Enabled, Bindings, RemoteIPRanges
Look for any connector where Enabled is False, that's an immediate mail flow killer. Also verify that your outbound Send Connector's SmartHosts entry (if you're relaying through a smart host like a spam filter appliance) still resolves in DNS and is reachable on port 25.
You'll know this step worked when: Test-Mailflow returns TestMailflowResult: Success with a latency under 10 seconds.
A dismounted mailbox database is one of the most disruptive Exchange problems you can face, affected users get an Outlook error like "Cannot expand the folder. The set of folders cannot be opened. The information store could not be opened" or simply cannot log into OWA at all. The mailbox database didn't disappear; it's just not accessible.
First, check which databases are mounted and which aren't:
Get-MailboxDatabase -Status | Format-Table Name, Mounted, Server
Any database showing Mounted: False is your problem. Try mounting it:
Mount-Database -Identity "Mailbox Database Name"
If that command fails with error MapiExceptionCallFailed: Unable to mount database, the database has likely suffered a dirty shutdown, meaning Exchange didn't close the transaction logs cleanly before shutdown. This is common after a power outage, a forced server restart, or a storage array failover.
Check the database state with ESEUTIL:
cd "C:\Program Files\Microsoft\Exchange Server\V15\Mailbox\[Database Name]"
eseutil /mh "Mailbox Database.edb" | findstr "State"
If the output shows State: Dirty Shutdown, you need to run a soft recovery. First, try replaying the transaction logs (this is safe and non-destructive):
eseutil /r E0n /l "C:\path\to\logs" /d "C:\path\to\database"
Replace E0n with the actual log prefix for your database (visible in the EDB file directory). After recovery, retry Mount-Database. If replay fails because logs are missing, you're looking at restoring from backup, or in a DAG environment, failing over to the passive copy.
You'll know this step worked when: Get-MailboxDatabase -Status shows Mounted: True for all databases and affected users can open their mailboxes.
When email is flowing internally but users can't connect, Outlook keeps prompting for a password, OWA throws a 503 or a blank page, or Outlook for Mac refuses to set up the account, the problem is almost always in the Client Access layer: IIS, Autodiscover, or a certificate.
Start with the built-in connectivity test:
Test-OutlookWebServices -Identity user@yourdomain.com -ClientAccessServer [ServerName] | Format-List
This tests Autodiscover, OAB, EWS, and the Availability service in one shot. Look for any service that doesn't return a green Success result.
If Autodiscover is failing, check the virtual directory URL configuration, this is the single most common cause of Outlook connectivity problems after a server rename, an IP change, or a migration:
Get-AutodiscoverVirtualDirectory | Format-List InternalUrl, ExternalUrl, InternalAuthenticationMethods
Get-OWAVirtualDirectory | Format-List InternalUrl, ExternalUrl
Get-WebServicesVirtualDirectory | Format-List InternalUrl, ExternalUrl
If the URLs are wrong or blank, set them correctly. For example:
Set-AutodiscoverVirtualDirectory -Identity "[ServerName]\Autodiscover (Default Web Site)" -InternalUrl "https://mail.yourdomain.com/Autodiscover/Autodiscover.xml" -ExternalUrl "https://mail.yourdomain.com/Autodiscover/Autodiscover.xml"
After fixing virtual directories, recycle the IIS application pools:
Get-WebConfiguration system.applicationHost/applicationPools/add | Where-Object {$_.name -like "*MSExchange*"} | ForEach-Object { & "$env:windir\system32\inetsrv\appcmd.exe" recycle apppool /apppool.name:$_.name }
Also verify your Exchange SSL certificate covers all the names users are connecting to. Run Get-ExchangeCertificate | Format-List and check that your certificate's CertificateDomains includes your mail hostname and the Autodiscover record.
You'll know this step worked when: OWA loads without errors and Outlook connects without credential prompts. Test-OutlookWebServices returns all-green.
A growing message queue that isn't draining is one of those Exchange problems that creates escalating panic, every minute it grows, more users are affected, and the backlog becomes harder to process. I've seen queues hit 50,000 messages on a server that just needed one connector fixed.
Get a current view of all queues:
Get-Queue | Format-Table -AutoSize
Focus on queues where Status shows Retry, this means Exchange is actively trying to deliver but hitting an error on each attempt. Drill into the specific error:
Get-Queue -Filter {Status -eq "Retry"} | Format-List Identity, LastError, NextHopDomain, MessageCount
The LastError field is your diagnostic gold mine. Common values and what they mean:
451 4.4.0 DNS query failed, Exchange can't resolve the destination domain. Check your internal DNS server and make sure your Exchange server's NIC DNS settings point to your internal DNS, not a public resolver.421 4.3.2 SERVICE NOT AVAILABLE, The receiving server is refusing connections. Could be their problem or a firewall blocking port 25 outbound.550 5.7.1 Unable to relay, Your Receive Connector is misconfigured and rejecting messages from a legitimate source. Check the connector's PermissionGroups and RemoteIPRanges.
Once you've identified and fixed the root cause, force a retry on stuck queues:
Retry-Queue -Filter {Status -eq "Retry"} -Resubmit $true
If you need to suspend the queue while you work on a fix (to stop Exchange from hammering a failing destination):
Suspend-Queue -Filter {Status -eq "Retry" -and NextHopDomain -eq "problemdomain.com"}
When the fix is in place, resume with:
Resume-Queue -Filter {Status -eq "Suspended"}
You'll know this step worked when: Get-Queue shows your queue MessageCount dropping and Status moving from Retry to Active to eventually Empty.
Advanced Exchange Server Troubleshooting
Reading Exchange Events in Event Viewer
Event Viewer is underused in Exchange troubleshooting because most admins don't know which event IDs to care about. Here's the short list that actually matters:
- Event ID 1022 (MSExchangeIS), Database copy status changed. On a DAG, this fires when a passive copy falls behind or loses sync. Watch for it alongside Event ID 1121 (database copy suspended).
- Event ID 4999 (MSExchange Common), Watson minidump. Exchange crashed hard enough to generate a crash report. The faulting module name tells you whether it's an Exchange DLL or third-party code.
- Event ID 2080 (MSExchange ADAccess), Active Directory topology discovery. If you see failures here, Exchange has lost reliable access to a domain controller. Critical for multi-site deployments.
- Event ID 9646 (MSExchangeIS), A client exceeded MAPI connection limits. This can cause Outlook disconnections across the board when one misbehaving client (often a third-party app) opens too many simultaneous connections.
Group Policy and Exchange Permissions
In domain-joined environments, Group Policy can silently break Exchange. Specifically, policies that modify Security Options under Computer Configuration, things like disabling NTLMv1, restricting anonymous access to named pipes, or modifying LAN Manager authentication levels, can break MAPI, OWA Kerberos auth, or transport service communication.
If Exchange started breaking after a Group Policy refresh or a Windows Update that included security baseline changes, run:
gpresult /h C:\GPReport.html /f
Open the resulting HTML report and search for policies applied to the Exchange server's OU. Look specifically at Windows Settings > Security Settings > Local Policies > Security Options.
DAG (Database Availability Group) Failover Issues
If you're running Exchange in a DAG and one node is failing to host active databases, check the cluster health first:
Get-DatabaseAvailabilityGroup -Status | Format-List
Get-MailboxDatabaseCopyStatus * | Where-Object {$_.Status -ne "Mounted" -and $_.Status -ne "Healthy"} | Format-Table
If a copy shows ContentIndexState: Failed, the search index is corrupt. Rebuild it:
Update-MailboxDatabaseCopy -Identity "DatabaseName\ServerName" -DeleteExistingFiles -CatalogOnly
If a copy shows CopyQueueLength growing continuously, the replication network between DAG members is saturated or broken. Check the DAG replication network adapter binding, it should be on a dedicated, non-routed VLAN separate from client traffic.
Transport Pipeline Agent Debugging
Third-party transport agents (antivirus, DLP, archiving) can silently break mail flow without logging obvious errors. Get a list of installed agents and their state:
Get-TransportAgent | Format-Table Name, Enabled, Priority
If you suspect an agent is the problem, disable it temporarily for testing:
Disable-TransportAgent -Identity "AgentName"
Restart-Service MSExchangeTransport
Then re-run Test-Mailflow. If mail flows cleanly without the agent, contact your AV or archiving vendor, you likely need an updated version that's compatible with your current Exchange CU level.
Some Exchange problems are genuinely beyond what self-service troubleshooting can fix, and knowing when to escalate saves you from making things worse. Call Microsoft Support when: your mailbox database is in a Dirty Shutdown state and log replay has failed; you're seeing repeated Event ID 4999 crashes with an Exchange-owned DLL in the faulting module; your DAG cluster has lost quorum and manual failover attempts are failing; or you've applied a Cumulative Update and Exchange services won't start at all. Microsoft Support has access to the Exchange debug symbols and internal telemetry that simply aren't available to outside admins. Don't waste 8 hours trying to heroically solve a database corruption that needs Microsoft's database repair tools.
Prevention & Best Practices
The best Exchange Server troubleshooting session is the one you never have to run. I know that sounds obvious, but most Exchange outages I've responded to were predictable and preventable. Here's what separates shops that have Exchange crises from shops that don't.
Keep your Cumulative Updates current. Exchange CUs aren't optional security patches, they're cumulative packages that include critical bug fixes for the exact kinds of transport, database, and AD connectivity problems this guide covers. Running Exchange 2019 CU11 when CU14 is available means you're carrying bugs that Microsoft already fixed. Check your current CU level with Get-ExchangeDiagnosticInfo -Server [Server] -Process EdgeTransport -Component ResourceThrottling and compare to the Exchange build number reference page.
Monitor your mailbox database white space and disk usage proactively. Exchange databases that fill their volume don't crash gracefully, they go into a protected mode that looks like a dismount. Set up alerts in your monitoring system when your Exchange volume reaches 80% capacity. Check white space weekly with Get-MailboxDatabase -Status | Format-List Name, DatabaseSize, AvailableNewMailboxSpace.
Test your backup and restore process quarterly. A backup you've never tested is a backup you can't trust. Mount a restored database copy in a recovery database at least once per quarter: New-MailboxDatabase -Recovery -Name RecoveryDB -Server [ServerName] -EdbFilePath [path]. This tells you whether your backups are actually usable before you need them in a crisis.
Watch your certificate expiration dates. An expired Exchange SSL certificate kills OWA, ActiveSync, Autodiscover, and federated sharing simultaneously. Set calendar reminders 60 days and 30 days before each certificate expires. You can see all current certificates and their expiry dates with Get-ExchangeCertificate | Sort-Object NotAfter | Format-Table Thumbprint, NotAfter, Services, Subject.
- Schedule a weekly automated run of
Test-Mailflowand email the results to your admin mailbox, you'll catch degraded mail flow before users do - Set Windows Server to automatically restart failed Exchange services: open Services, right-click each MSExchange service, go to the Recovery tab, and set all three failure actions to "Restart the Service"
- Enable circular logging only if you're running a DAG with at least one healthy copy, on standalone Exchange, losing your transaction logs means losing your recovery point
- Apply the latest Exchange Emergency Mitigation Service (EEMS) rules, in Exchange 2019, this runs automatically via the
MSExchangeMitigationservice and blocks known exploit vectors without waiting for a full CU release
Frequently Asked Questions
Why does Outlook keep asking for my password even though I'm entering the right credentials?
This is almost always an authentication negotiation failure, not an actual wrong password. The most common cause is that Outlook is trying to authenticate with Kerberos but something in the environment, a service principal name (SPN) mismatch, a time skew greater than 5 minutes between client and domain controller, or a missing SPN for the Exchange server, causes Kerberos to silently fall back to NTLM, which then fails due to a policy or proxy that strips NTLM headers. Run setspn -L [ExchangeServer] on the Exchange server and verify that both http/[servername] and http/[server-fqdn] SPNs are registered. Also check that the client computer's clock is within 5 minutes of the DC. If you're in an environment using Modern Authentication (OAuth), verify that hybrid Modern Auth is configured correctly in your Exchange and Azure AD settings, a misconfigured OAuth endpoint produces exactly this symptom.
Exchange mail queue is growing and email isn't going out, what do I check first?
Start with Get-Queue | Format-Table -AutoSize and look at the LastError column for Retry-status queues, that error message is almost always the fastest path to the root cause. The most common reasons for a backed-up outbound queue are: your smart host relay (if you use one) is rejecting connections, port 25 outbound is blocked by a firewall or your ISP, your sending IP has been blacklisted, or a transport agent is stuck processing messages. Check your internet IP against major blacklists at MXToolbox, and verify port 25 connectivity by running Test-NetConnection -ComputerName [DestinationMXHost] -Port 25 from the Exchange server. If that connection is refused or times out, the block is network-level, not Exchange-level.
How do I fix Exchange OWA showing a blank page or 503 error?
A blank OWA page or 503 typically means an IIS application pool serving the OWA virtual directory has crashed or stopped. Open IIS Manager on the Exchange server, go to Application Pools, and look for any pool whose name contains "MSExchange" showing a Stopped state, right-click and Start it. Then check the Windows Application event log for Event ID 5002 (worker process crash) which will tell you why the pool stopped. If the pool keeps stopping, the underlying problem is usually a missing Exchange DLL (from a failed CU install) or an incompatible third-party ISAPI filter. Run %windir%\system32\inetsrv\appcmd list config "Default Web Site/OWA" /section:system.webServer/isapiFilters to see what's loaded.
My Exchange database won't mount and ESEUTIL says Dirty Shutdown, is my data gone?
No, a Dirty Shutdown doesn't mean data loss, it means Exchange didn't gracefully write the final transaction log entries before shutdown. Your data is intact in the database file; you just need to bring it back to a "Clean Shutdown" state before Exchange will mount it. First try replaying the existing transaction logs: eseutil /r [LogPrefix] /l [LogPath] /d [DatabasePath]. If all required logs are present, this brings the database back to clean shutdown in minutes with zero data loss. If logs are missing, you have two options: restore from a backup (which may mean some data loss up to the backup point), or if you're on a DAG, allow replication to rebuild the copy from a healthy node by running Update-MailboxDatabaseCopy.
How do I check if Exchange is properly connected to Active Directory?
Exchange depends heavily on Active Directory for configuration data, recipient lookups, and authentication, and AD connectivity problems show up as everything from sporadic mail delivery failures to complete service outages. Run Test-ReplicationHealth in Exchange Management Shell for a DAG health check that includes AD replication, and run Get-ExchangeDiagnosticInfo -Server [ServerName] -Process EdgeTransport -Component ResourceThrottling to check whether Exchange is throttling due to resource pressure that includes AD response times. For the AD side specifically, run dcdiag /test:replications /v on a domain controller in the same site as your Exchange server. Event ID 2114 in the Application log from MSExchange ADAccess source means Exchange has stopped using a specific domain controller, look at the event detail to see which DC and why.
ActiveSync stopped working for mobile devices after I renewed the SSL certificate, how do I fix it?
When you renew or replace an Exchange SSL certificate, you must explicitly assign the new certificate to Exchange services, the renewal alone doesn't do this automatically. Check which certificate is currently assigned to IIS and SMTP services: Get-ExchangeCertificate | Format-List Thumbprint, Services, Subject. You're looking for the services field, it should show a value that includes I (IIS) and S (SMTP) for your mail hostname certificate. If your new certificate doesn't show those flags, assign it: Enable-ExchangeCertificate -Thumbprint [NewCertThumbprint] -Services IIS,SMTP,POP,IMAP. After that, restart IIS with iisreset /noforce and have a mobile device user try to sync. Also verify the ActiveSync virtual directory URL: Get-ActiveSyncVirtualDirectory | Format-List InternalUrl, ExternalUrl, it should match your certificate's CN or a SAN entry exactly.