How to Fix Azure Monitor

Microsoft Fix Intermediate 18 min read Official Docs Grounded Updated April 20, 2026

What's in This Guide

Why This Happens
The Quick Fix
Step-by-Step Solution
Advanced Troubleshooting
Prevention & Best Practices
FAQ

Why Azure Monitor Troubleshooting Is So Painful

I've seen this exact scenario play out on dozens of Azure environments: you set up Azure Monitor, configure your alerts, wire up a Log Analytics workspace , and then nothing. Logs aren't flowing. Alerts aren't firing. Metrics look wrong or stop appearing entirely. You're staring at an empty chart at 2 AM wondering if the whole monitoring stack is broken, or if your VM is just quietly on fire with no one watching.

Azure Monitor troubleshooting is genuinely hard because the product is not a single service , it's a constellation of loosely coupled components. You've got the Azure Monitor Agent (AMA), the older Log Analytics Agent (also called MMA or OMS agent), Data Collection Rules (DCRs), Log Analytics workspaces, metric pipelines, action groups, alert rules, and Diagnostic Settings, all of which can fail independently and all of which have different failure modes with different symptoms.

The error messages don't help. "Heartbeat data not received" tells you the agent is offline but not why. "Alert rule evaluation failed" gives you nothing actionable. Missing data in a Log Analytics query could mean the agent stopped, the DCR was misconfigured, the workspace ran out of ingestion capacity, network rules are blocking the agent endpoint, or your Kusto query itself has a time-range issue. Azure won't tell you which one.

Common root causes I see in practice:

Agent connectivity issues, The Azure Monitor Agent or Log Analytics Agent can't reach the ingestion endpoint (*.ods.opinsights.azure.com or *.oms.opinsights.azure.com) due to NSG rules, firewall policies, or missing private endpoints.
Misconfigured or missing Data Collection Rules, Since AMA replaced MMA, DCRs are required. If no DCR is associated with a machine, no data flows regardless of agent health.
Workspace permission errors, Contributor access on a VM doesn't grant access to write to its associated workspace. This catches people constantly.
Expired or rotated workspace keys, Agents using primary/secondary keys (not managed identity) fail silently when keys are regenerated.
Diagnostic Settings not configured for new resource types, Platform logs and metrics for PaaS resources require Diagnostic Settings to be explicitly turned on per resource.
Action group delivery failures, Alerts fire correctly but email, webhook, or Logic App targets fail, making it look like alerting is broken.

The good news: virtually every Azure Monitor issue is diagnosable and fixable if you know where to look. This guide walks you through every layer, from agent health to alert delivery. Browse all Microsoft fix guides →

The Quick Fix, Try This First

Before you go deep, run this triage sequence. It resolves about 60% of Azure Monitor troubleshooting cases in under ten minutes.

Step 1: Check if the Azure Monitor Agent is reporting a heartbeat. Open the Azure Portal, navigate to your Log Analytics workspace, then go to Logs. Run this query:

Heartbeat
| where TimeGenerated > ago(1h)
| summarize LastHeartbeat = max(TimeGenerated) by Computer, OSType
| order by LastHeartbeat desc

If your machine doesn't appear at all, the agent is either not installed, not running, or can't reach the workspace. If it appeared an hour ago but not recently, the agent recently died or lost connectivity.

Step 2: Check agent status directly on the machine. For Windows VMs, open PowerShell as Administrator and run:

Get-Service -Name AzureMonitorAgent
# Or for the older Log Analytics agent:
Get-Service -Name HealthService

If the service is stopped, start it:

Start-Service -Name AzureMonitorAgent

For Linux, run:

sudo systemctl status azuremonitoragent
sudo systemctl start azuremonitoragent

Step 3: Validate your Data Collection Rule association. In the Azure Portal, go to Monitor → Data Collection Rules. Find your DCR, click Resources, and verify your VM is listed. If it's missing, that's your problem right there, no DCR association means no data, full stop.

Step 4: Check for Diagnostic Settings gaps. For any PaaS resource (App Service, SQL Database, Key Vault, etc.), go to the resource in the Portal, click Diagnostic settings in the left menu under Monitoring, and confirm a setting exists that points to your workspace. If the list is empty, nothing is being logged.

Those four steps will identify the issue in the majority of cases. If you're still stuck, work through the full step-by-step below.

Pro Tip

The Azure Monitor Agent and the old Log Analytics Agent (MMA) can coexist on a machine but will cause data duplication and occasional conflicts. If you're seeing double entries in your workspace or weird heartbeat gaps, run Get-Service | Where-Object {$_.Name -like "*Health*" -or $_.Name -like "*Azure*"} to see which agents are installed. Pick one and remove the other, Microsoft's official position since 2024 is to migrate fully to AMA.

Diagnose Azure Monitor Agent Health on the VM

This is where I start every Azure Monitor troubleshooting session. The agent is the foundation, if it's broken, everything downstream is meaningless.

On Windows, the Azure Monitor Agent writes its operational logs to two locations. First, check the Windows Event Log. Open Event Viewer, navigate to Applications and Services Logs → Microsoft → Azure Monitor Agent → Agent. Look for error events. Event ID 3000 typically means a DCR download failure. Event ID 5003 means a data upload failure, often network-related.

Second, check the agent's own log files at:

C:\WindowsAzure\Logs\Plugins\Microsoft.Azure.Monitor.AzureMonitorWindowsAgent\

Open the most recent .log file and look for lines containing ERROR or FATAL. A common one is:

Failed to download DCR configuration. StatusCode: 403. Ensure managed identity has correct RBAC assignment.

That 403 tells you the VM's managed identity doesn't have the Monitoring Metrics Publisher or Log Analytics Contributor role on the workspace. Fix it in IAM on the workspace resource.

On Linux, check:

sudo cat /var/log/azure/Microsoft.Azure.Monitor.AzureMonitorLinuxAgent/extension.log

And the agent-level logs at:

sudo journalctl -u azuremonitoragent -n 100 --no-pager

If the agent is healthy in logs but still not sending data, move to the network layer. From the VM, test connectivity to the ingestion endpoint:

Test-NetConnection -ComputerName "global.handler.control.monitor.azure.com" -Port 443

If TcpTestSucceeded comes back False, you've found your issue, a firewall or NSG is blocking the agent. You'll need to allow outbound HTTPS to the Azure Monitor service tags: AzureMonitor and AzureActiveDirectory.

Audit and Repair Your Data Collection Rules

Data Collection Rules are the most common source of Azure Monitor troubleshooting headaches in environments that migrated from MMA to AMA after 2023. People install the new agent but forget that DCRs are now required for every data source, the agent collects nothing without them.

Navigate to Azure Monitor → Data Collection Rules in the Portal. You'll see a list of all DCRs in your subscription. Click on the one targeting your affected VMs and check three things:

Data sources: Click Data sources in the left menu. Confirm the data sources you expect are listed, Windows Event Logs, Performance Counters, Syslog, etc. If this list is empty, the DCR exists but collects nothing. Add the data sources you need.

Destinations: Click Destinations. Confirm your Log Analytics workspace is listed as a destination. If the workspace was deleted and recreated, the resource ID here will be stale and the association silently broken.

Resources: Click Resources. Every VM or Arc-enabled server that should use this DCR must appear here. If your VM is missing, click Add and associate it.

You can also verify and fix DCR associations via PowerShell, which is much faster when dealing with tens of machines:

# List all DCR associations for a VM
$vmResourceId = "/subscriptions/{subId}/resourceGroups/{rg}/providers/Microsoft.Compute/virtualMachines/{vmName}"
Get-AzDataCollectionRuleAssociation -TargetResourceId $vmResourceId

# Create a new association
New-AzDataCollectionRuleAssociation `
  -TargetResourceId $vmResourceId `
  -AssociationName "myDCRAssociation" `
  -RuleId "/subscriptions/{subId}/resourceGroups/{rg}/providers/Microsoft.Insights/dataCollectionRules/{dcrName}"

After fixing the DCR, wait up to 5 minutes and re-run the Heartbeat query from Step 1. You should start seeing data flow.

Fix Log Analytics Workspace Permissions and Access Issues

I know this is frustrating, especially when you're certain you have access, but workspace permissions are one of the sneakiest causes of data gaps in Azure Monitor troubleshooting. Access to a VM and access to its Log Analytics workspace are completely separate RBAC hierarchies.

Go to your Log Analytics workspace in the Portal. Click Access control (IAM) in the left menu. Check that the following identities have appropriate roles:

VM Managed Identity, needs Monitoring Metrics Publisher on the workspace, or Log Analytics Contributor at minimum.
Users querying Logs, need Log Analytics Reader to view data, or Log Analytics Contributor to modify saved queries and alerts.
Action Groups using a Runbook or Function, the service principal behind the action needs Contributor access to the target resource.

Also check the workspace's Access control mode. Go to Properties in the workspace left menu. Under Access control mode, you'll see either Use workspace permissions or Use resource or workspace permissions. If it's set to the latter, users may have access to logs for specific resources (via resource-level RBAC) but not the workspace itself, this causes confusing scenarios where some tables are visible and others aren't.

For workspaces used across multiple teams, I'd recommend switching to Use workspace permissions and managing access at the workspace level. It's less granular but dramatically easier to troubleshoot.

Finally, verify the workspace isn't hitting its daily cap. Go to Usage and estimated costs in the workspace left menu. If you see a warning that the daily cap was reached, data ingestion silently stopped for the remainder of that 24-hour window. Either raise the cap or wait for the reset, it resets at midnight UTC.

# Check ingestion volume via KQL
Usage
| where TimeGenerated > ago(7d)
| summarize TotalGB = sum(Quantity) / 1024 by bin(TimeGenerated, 1d), DataType
| order by TimeGenerated desc

Debug Alert Rules That Are Not Firing

Your logs are flowing, your metrics look fine, but the alert never fired when it should have. This is one of the most common Azure Monitor troubleshooting scenarios and it has several distinct causes.

First, check the alert rule's own history. Go to Azure Monitor → Alerts → Alert rules. Click on the specific rule. In the left menu, click History. This shows you every evaluation, including evaluations that ran but didn't trigger. Look for the pattern: is it evaluating at all, or are there gaps?

If evaluations are happening but not triggering, your threshold or time window may be wrong. A common mistake: setting a metric alert to trigger when CPU exceeds 90% using a 1-minute aggregation. A short spike hits 95% but a 1-minute average never crosses 90%. Change to a 5-minute aggregation and it will catch it.

If evaluations have gaps or errors, open the rule and look at the Condition. For log search alerts, click See last run results. You'll often see:

Query evaluation failed: The query could not be executed within the time limit.

This means your KQL query is too expensive. Optimize it, add where clauses early, avoid join on large tables without filters, and ensure you're using indexed columns like TimeGenerated and Computer.

Second issue: the rule fires but the action group fails. Check the action group separately. Go to Azure Monitor → Action groups, select your group, and click Test. This sends a test notification through every configured action (email, SMS, webhook, Logic App). If an email test doesn't arrive, check the email address, spam filters, and whether the action group is in the correct Azure region for the alert rule, cross-region action group calls have higher latency and occasional failures during regional incidents.

# Check recent alert firings via Azure CLI
az monitor activity-log alert list --output table

# Check action group test status
az monitor action-group test \
  --resource-group "myRG" \
  --name "myActionGroup" \
  --alert-type "metric"

Restore Missing Metrics and Fix Diagnostic Settings Gaps

Platform metrics for Azure resources, things like App Service CPU, SQL DTU consumption, or Storage Account transactions, don't require an agent. They're emitted automatically by the platform. But they only flow to your Log Analytics workspace if you configure Diagnostic Settings for each resource. If metrics are missing, this is almost always the cause.

Go to the affected resource (say, an App Service Plan). In the left menu under Monitoring, click Diagnostic settings. If the page is empty, no settings are configured. Click Add diagnostic setting.

Name it something meaningful. Then select which log categories and metrics to send. Check AllMetrics at minimum. Under Destination details, check Send to Log Analytics workspace and pick your workspace. Click Save.

Metrics typically appear in the workspace within 3–5 minutes. They land in the AzureMetrics table:

AzureMetrics
| where ResourceProvider == "MICROSOFT.WEB"
| where MetricName == "CpuPercentage"
| where TimeGenerated > ago(1h)
| summarize avg(Average) by bin(TimeGenerated, 5m), Resource
| render timechart

For large environments where you have dozens or hundreds of resources without Diagnostic Settings, doing this manually is impractical. Use Azure Policy to enforce it automatically. The built-in policy "Deploy Diagnostic Settings for App Service to Log Analytics workspace" (Policy ID: b4e2cfce-4be8-4dd7-a2d8-3cc3dbf7eaad) can be assigned at subscription level with a DeployIfNotExists effect, it will automatically create Diagnostic Settings on any App Service that doesn't have them.

# Assign the diagnostic settings policy via PowerShell
$assignment = New-AzPolicyAssignment `
  -Name "enforce-diag-appservice" `
  -PolicyDefinitionId "/providers/Microsoft.Authorization/policyDefinitions/b4e2cfce-4be8-4dd7-a2d8-3cc3dbf7eaad" `
  -Scope "/subscriptions/{subscriptionId}" `
  -PolicyParameterObject @{
    logAnalytics = "/subscriptions/{subId}/resourceGroups/{rg}/providers/Microsoft.OperationalInsights/workspaces/{workspaceName}"
  }

Once assigned, existing non-compliant resources need a remediation task. Go to Policy → Remediation and create a task for the assignment. Azure will automatically deploy Diagnostic Settings to every affected resource in scope.

Advanced Azure Monitor Troubleshooting

If the steps above didn't solve it, you're dealing with something deeper. These scenarios come up in enterprise environments, domain-joined machines, and locked-down network configurations.

Private Link and Network Isolation Issues

If your VMs are in a virtual network with no public internet access, the Azure Monitor Agent needs to reach its endpoints via Azure Private Link. Without it, every call to global.handler.control.monitor.azure.com or *.ods.opinsights.azure.com will silently fail.

You need an Azure Monitor Private Link Scope (AMPLS) resource connected to your workspace. Go to Azure Monitor → Settings → Network Isolation. Create an AMPLS, associate your workspace with it, and create a Private Endpoint from your VNet to the AMPLS. The DNS resolution for the monitor endpoints must then resolve to private IPs, check this from the VM:

Resolve-DnsName global.handler.control.monitor.azure.com

You should see a private IP in the 10.x.x.x or 172.x.x.x range if private link is working. A public IP means DNS isn't resolving to the private endpoint, check your private DNS zones for privatelink.monitor.azure.com and privatelink.ods.opinsights.azure.com.

Arc-Enabled Server Agent Problems

For on-premises machines using Azure Arc, the Azure Connected Machine Agent must be healthy before AMA can function. Run:

azcmagent check
azcmagent show

Look for connectivity failures in the output. The Arc agent communicates through a different set of endpoints than AMA itself, *.his.arc.azure.com and *.guestconfiguration.azure.com must also be reachable.

Event Viewer Deep Dive for Windows Agent

For Windows-specific Azure Monitor troubleshooting, Event Viewer is your best friend. The three channels that matter most:

Applications and Services Logs\Microsoft\Azure Monitor Agent\Agent, Agent operational events
System, Look for Event ID 7034 (service crashed unexpectedly) or 7031 (service terminated unexpectedly) with source Service Control Manager and description matching AzureMonitorAgent
Application, Crash reports from AzureMonitorAgent.exe

Registry Check for Workspace Configuration

For the legacy Log Analytics Agent (MMA), workspace configuration is stored in the registry. If the agent is connecting to the wrong workspace, check:

HKLM:\SYSTEM\CurrentControlSet\Services\HealthService\Parameters\Service Connector Services\Log Analytics - {WorkspaceID}\

Look at the Workspace ID and Workspace Key values. If the key was rotated in the portal but not updated here, the agent will fail authentication with a 403 on every upload attempt.

Kusto Query Debugging

If your data is in the workspace but your dashboards and alerts look wrong, the issue may be in your queries rather than the data pipeline. Use search to verify data exists without a pre-structured query:

search in (Heartbeat, Perf, Event) "YourComputerName"
| where TimeGenerated > ago(24h)
| summarize count() by $table

This bypasses any query logic and shows you raw record counts by table, confirming data presence independently of your specific query logic.

When to Call Microsoft Support

Escalate to Microsoft Support when: the Azure Monitor Agent is healthy, network connectivity is confirmed, DCRs are correctly associated, permissions are correct, but data still isn't flowing after 30+ minutes. This pattern suggests a backend ingestion pipeline issue on Microsoft's end. Open a support ticket with category Azure Monitor / Logs / Data not ingesting and include the output of azcmagent check, the agent log files, and your workspace resource ID. Microsoft's internal tooling can see ingestion errors that are invisible from the portal.

Prevention & Best Practices

Once you've fixed the immediate issue, the real win is making sure you don't have to go through this again. Azure Monitor troubleshooting is almost always reactive, here's how to make it proactive.

Monitor your monitoring. This sounds circular but it's the most important thing. Create an alert rule on the Heartbeat table that fires when any machine stops sending heartbeats for more than 15 minutes:

Heartbeat
| summarize LastHeartbeat = max(TimeGenerated) by Computer
| where LastHeartbeat < ago(15m)

Point this alert at an action group that pages your ops team. This means you'll know about a broken agent before you need the data, not after an incident when you're frantically discovering the logs are blank.

Use Azure Policy to enforce agent deployment. The built-in initiative "Enable Azure Monitor for VMs" (also called VM Insights) contains multiple policies that deploy AMA, configure DCRs, and enable dependency agent across all VMs in a subscription or management group. Assign it with DeployIfNotExists so new VMs are automatically onboarded.

Standardize on Azure Monitor Agent, not MMA. Microsoft officially retired the Log Analytics Agent in August 2024. Any machine still running HealthService (MMA) is running unsupported software and will have increasing reliability issues. Plan your migration to AMA, it's a one-day project for most environments and the new agent is significantly more reliable.

Document your workspace architecture. Keep a record of which workspaces serve which purpose (security, operations, application, etc.), which subscriptions feed them, and what the daily cap is set to. I've seen outages caused purely by someone increasing VM count without adjusting the workspace cap, and nobody knew what the cap was because nobody had written it down.

Test your action groups monthly. The Test button in every action group takes 10 seconds to use. Make it part of your monthly ops checklist. Webhook URLs expire, email addresses change, Logic Apps get disabled. You don't want to discover this during an actual incident.

Quick Wins

Enable the VM Insights solution on your workspace, it adds pre-built dashboards and maps that make performance issues immediately visible without any custom query work.
Set your workspace retention to at least 90 days, the default 30 days is insufficient for most incident post-mortems and compliance requirements.
Configure a second action group as a backup for every critical alert, if the primary email group fails, the backup SMS or webhook still fires.
Use Azure Monitor Workbooks instead of static dashboards, they support parameters, conditional formatting, and drill-down, making ongoing operational visibility far easier to maintain.

Frequently Asked Questions

Why is my Azure Monitor data delayed by hours sometimes?

Azure Monitor ingestion typically completes within 2–5 minutes for agent-collected data, but delays of up to 90 minutes are within the SLA for log data. Longer delays usually mean one of two things: the workspace is approaching its daily ingestion cap (which throttles incoming data), or there's an active Azure regional incident affecting the ingestion pipeline. Check the Azure Service Health dashboard under Monitor → Service Health for active incidents affecting Azure Monitor in your region. For cap-related delays, go to the workspace's Usage and estimated costs blade and look for throttling warnings.

My alert fired but I never got the email, where did it go?

First, confirm the alert actually fired by checking Monitor → Alerts → Alert history, if it shows "Fired" with a timestamp, the problem is in the action group delivery, not the alert rule itself. Go to the action group, click the email action, and verify the address. Then test the action group using the Test action group button. If the test email doesn't arrive, check your spam/junk folder, and verify the sending domain azurealerts@microsoft.com isn't blocked by your email gateway or Microsoft 365 mail flow rules.

How do I know if my Log Analytics workspace is out of storage or hit its daily cap?

Navigate to your workspace in the Portal and click Usage and estimated costs in the left menu. You'll see a bar chart showing daily ingestion volume versus your cap. If today's bar is cut off flat at the cap line, ingestion stopped at that point. You can also query this directly: run _LogOperation | where Category == "Ingestion" | where Detail contains "OverQuota" in the Logs blade, any results mean the cap was hit. To raise the cap, click Daily volume cap on that same page and increase the GB limit.

Can I use Azure Monitor troubleshooting for containers running in AKS?

Yes, but the setup is different. AKS uses the Container Insights add-on (enabled via az aks enable-addons --addons monitoring), which deploys a containerized version of the monitoring agent as a DaemonSet named omsagent or ama-logs. Check its health with kubectl get pods -n kube-system | grep ama. If pods are CrashLooping or in Error state, check their logs with kubectl logs -n kube-system {pod-name}. The most common issue in AKS is the Container Insights add-on being installed but pointing to a workspace in a different region than the cluster, this causes high latency and intermittent failures.

What's the difference between Azure Monitor Metrics and Azure Monitor Logs, and why does it matter for troubleshooting?

Metrics are lightweight numeric time-series data (CPU %, memory bytes, request count) stored in a dedicated metrics database, they're retained for 93 days and available near-real-time with no workspace required. Logs are structured records (events, traces, diagnostics) stored in a Log Analytics workspace and queried with KQL. This matters for troubleshooting because they have completely separate pipelines that fail independently. If your metric chart in the Portal is empty but your workspace has data, the Diagnostic Settings for that resource may be sending logs but not metrics (or vice versa). Always check both pipelines when diagnosing data gaps.

After reinstalling the Azure Monitor Agent, how long before data starts showing up?

After a clean reinstall, expect 5–10 minutes before the first Heartbeat record appears in the workspace. Performance counters and event logs typically begin flowing within 5 minutes of the first successful heartbeat. If you don't see a Heartbeat within 15 minutes of reinstallation, the agent is either failing to start (check Services.msc for AzureMonitorAgent status) or failing to reach the workspace endpoint (run the Test-NetConnection check to port 443 on the ingestion endpoint). Also confirm the DCR is still associated with the machine, a reinstall doesn't automatically re-create DCR associations that were manually configured.

Related Microsoft Fix Guides

Sai Kiran Pandrala

Our team includes certified Microsoft engineers, Azure architects, and system administrators with 10+ years of enterprise IT experience. Every guide is written from hands-on troubleshooting, not guesswork. We test every fix before publishing.