How to Troubleshoot Azure
Why Azure Troubleshooting Is So Hard
I've spent years inside Azure subscriptions , everything from scrappy startup tenants with two virtual machines to sprawling enterprise environments running thousands of resources across a dozen regions. And the one thing that never changes? When something breaks in Azure, the error messages are almost designed to confuse you. You get a red banner that says "Deployment failed" with a correlation ID, a bunch of nested JSON, and absolutely no clear indication of what went wrong or where to start.
That's what this guide is for. Whether you're staring at error code AuthorizationFailed, your Azure CLI keeps throwing AADSTS70011, your Virtual Machine won't start, or your Resource Group just vanished from the portal , I'm going to walk you through every serious fix in order.
Azure issues broadly fall into five buckets. First: authentication and identity problems, expired tokens, misconfigured service principals, multi-factor authentication timeouts, and Entra ID (formerly Azure Active Directory) conditional access policies blocking your login. Second: subscription and quota problems, your subscription is disabled, you've hit a regional vCPU quota, or your spending limit kicked in. Third: resource deployment failures, ARM template errors, policy violations, SKU availability in specific regions. Fourth: networking and connectivity issues, NSG rules blocking traffic, private endpoint misconfigurations, ExpressRoute outages. Fifth: Azure platform-level incidents, actual Microsoft outages that have nothing to do with your configuration.
Microsoft's error messages rarely tell you which bucket you're in. That's the core frustration. The portal might say "The resource operation completed with terminal provisioning state 'Failed'", which could mean literally any of the five things above. That's why having a systematic Azure troubleshooting process matters more than any individual fix.
The good news: almost every Azure problem leaves a trail. Activity logs, resource health events, diagnostic settings, and the Service Health dashboard all contain the real answer. You just have to know where to look and in what order. I'll show you exactly that.
Browse all Microsoft fix guides: Browse all Microsoft fix guides →
The Quick Fix, Try This First
Before you go deep into logs and diagnostics, run through this fast checklist. I'd say 40% of the Azure troubleshooting cases I've seen are solved in under three minutes by one of these steps.
Step 1: Check Azure Service Health right now. Go to portal.azure.com, type "Service Health" into the top search bar, and open it. Switch to the Service Issues tab. Filter by your subscription and the regions you care about. If there's an active incident, you're done, wait for Microsoft to resolve it. There is nothing wrong with your configuration. I've seen engineers spend four hours debugging "their" network problems during an Azure outage they didn't know about. Don't be that person.
Step 2: Clear your Azure credentials and re-authenticate. Stale tokens cause a shocking number of Azure portal and CLI failures. In the Azure CLI, run:
az account clear
az login
If you're using Azure PowerShell, run:
Disconnect-AzAccount
Connect-AzAccount
Then re-select your subscription:
az account set --subscription "Your Subscription Name"
Step 3: Verify your subscription is active. In the portal, navigate to Subscriptions (search for it in the top bar). Check the Status column. It should say "Active". If it says "Disabled", "Warned", or "Deleted", that's your root cause, and Step 1 in the detailed section below covers exactly how to fix it.
Step 4: Check resource-level health. Navigate to the specific resource that's failing, then click Resource health in the left sidebar under the "Help" section. This is one of the most underused features in all of Azure. It gives you a direct Azure platform assessment of whether the resource itself is healthy, degraded, or experiencing a platform incident, separate from anything you did.
If none of these quick steps resolved your Azure issue, keep reading. The full step-by-step section covers every major scenario with exact commands and portal paths.
3f2e4a1b-8c7d-4e9f-b2a1-6d5c8e3f1a2b. Microsoft Support needs this exact ID to pull the backend logs for your specific failed operation. Without it, support cases can drag on for days while they search for your incident.
This is always Step 1. I don't care how confident you are that the issue is on your end, check this first. Azure Service Health gives you real-time visibility into platform incidents, planned maintenance windows, and health advisories affecting your specific subscriptions and regions.
In the Azure portal, search for Service Health in the top search bar and open it. You'll see four tabs: Service Issues, Health Advisories, Security Advisories, and Planned Maintenance. Click Service Issues first. Use the filters at the top to narrow by your subscription, the affected region (East US, West Europe, etc.), and the service type (Virtual Machines, Storage, Azure SQL, etc.).
If you see an active incident, click it to read the impact statement and estimated resolution time. Microsoft posts updates there every 30–60 minutes during active incidents. You can also configure alerts so you get an email or webhook notification the moment an incident is declared. Do this now, don't wait until the next outage:
az monitor activity-log alert create \
--name "AzureServiceHealthAlert" \
--resource-group YourResourceGroup \
--condition category=ServiceHealth \
--action-group /subscriptions/{sub-id}/resourceGroups/{rg}/providers/Microsoft.Insights/actionGroups/{ag-name}
Next, check the Activity Log. Navigate to your resource group or subscription, click Activity log in the left menu, and filter the timeframe to "Last 1 hour" or "Last 6 hours" depending on when the issue started. Look for entries marked with a red circle, those are failed operations. Click any failed entry to expand it and see the full JSON response, which almost always contains the actual error code buried in the statusMessage field. That's the real error. Whatever the portal surface showed you was just a summary.
What you should see if it's working: Either a green "Available" status in Service Health with no active incidents, or a clear error code in the Activity Log that you can now look up specifically.
A disabled, over-quota, or spending-limited subscription is one of the most common reasons Azure resources fail to deploy or start, and one of the most confusing, because the error messages don't always say "subscription problem" directly. You might see QuotaExceeded, SubscriptionNotFound, BillingAccountNotFound, or just generic deployment failures.
Go to Subscriptions in the portal. Check the Status column. If it says anything other than "Active", here's what to do:
- Disabled (spending limit reached): Click the subscription, then click Spending limit at the top. Click Remove spending limit. You'll need a credit card on file. This only applies to free trial and MSDN/Visual Studio subscriptions, pay-as-you-go subscriptions don't have spending limits.
- Disabled (payment failed): Go to Cost Management + Billing, click Payment methods, and update your card. It can take up to 24 hours for the subscription to re-enable after payment is processed.
- Disabled (policy): This usually happens in enterprise agreements. Contact your Azure billing admin or open a support ticket with Microsoft.
For quota errors, run this command to see your current vCPU usage and limits in a region:
az vm list-usage --location "eastus" --output table
If you're hitting quota, you can request an increase directly in the portal. Go to Help + Support, click New support request, select Service and subscription limits (quotas) as the issue type, choose your subscription, then select Compute-VM (cores-vCPUs) subscription limit increases. Standard increases are usually approved within a few hours. Larger increases (say, 500+ cores) may require a business justification and can take 1–3 business days.
What success looks like: Subscription status shows "Active" and your deployment retries succeed without a quota-related error code.
Authentication failures are the second most common Azure troubleshooting scenario I deal with. The error codes are frustratingly cryptic, AADSTS50076, AADSTS70011, AADSTS65001, AuthorizationFailed, and the fix depends entirely on which one you have. Let me break down the most frequent ones.
AADSTS50076, MFA required: Your account requires multi-factor authentication but the token request didn't include it. This often happens with service principals or automation scripts. If this is interactive login, just complete the MFA prompt. If it's a service principal, you need to check your Conditional Access policies in Entra ID, a policy may have been applied that now requires MFA for service principal logins in certain contexts.
AADSTS70011, Invalid scope: The application is requesting a permission scope that doesn't exist or isn't consented to. Check your app registration in Entra ID under App registrations → [Your App] → API permissions. Verify the requested scopes exist and click Grant admin consent if needed.
AuthorizationFailed, RBAC issue: Your account doesn't have the right role on the resource. Run this to see your current role assignments:
az role assignment list --assignee your@email.com --output table
If you need to assign a role (and you have Owner or User Access Administrator on the scope), run:
az role assignment create \
--assignee your@email.com \
--role "Contributor" \
--scope /subscriptions/{subscription-id}/resourceGroups/{resource-group-name}
Token cache corruption: If your Azure CLI commands started failing suddenly after a long session, clear the token cache entirely:
az account clear
rm -rf ~/.azure/msal_token_cache.json
az login
On Windows, the token cache lives at C:\Users\{username}\.azure\msal_token_cache.json, delete that file and re-authenticate.
What success looks like: az account show returns your subscription details without error, and you can list resources with az resource list --output table successfully.
Deployment failures are where Azure troubleshooting gets genuinely technical. You've clicked Deploy (or run your Bicep/ARM template), and you get back "The resource operation completed with terminal provisioning state 'Failed'." Here's how to actually find out what went wrong.
In the portal, go to your Resource Group and click Deployments in the left menu. You'll see a list of all deployments, click the failed one. On the overview page, you'll see which specific resources failed. Click the failed resource name, then click Operation details. This is where the real error lives, nested in the response body JSON.
From the CLI, you can get deployment operation details with:
az deployment group show \
--resource-group YourResourceGroup \
--name YourDeploymentName \
--query properties.error
Common deployment error codes and their fixes:
- InvalidTemplateDeployment / RequestDisallowedByPolicy: An Azure Policy is blocking the deployment. Run
az policy assignment list --scope /subscriptions/{sub-id} --output tableto see active policies. Contact your Azure admin if you need an exemption. - SkuNotAvailable: The VM size or SKU you requested isn't available in that region. Run
az vm list-skus --location eastus --size Standard_D --output tableto find available SKUs. Consider switching to a different region or a comparable SKU. - ResourceQuotaExceeded: You've hit a per-resource-type quota. See Step 2 for quota increase instructions.
- ParentResourceNotFound: You're trying to deploy a child resource before the parent exists. Check your template's
dependsOndeclarations.
For Bicep-specific issues, run a what-if analysis before deploying to catch problems ahead of time:
az deployment group what-if \
--resource-group YourResourceGroup \
--template-file main.bicep \
--parameters @parameters.json
What success looks like: The deployment shows "Succeeded" in green on the Deployments blade, and all expected resources appear in the resource group.
Networking problems are the sneakiest Azure troubleshooting category. Everything looks right, the resource is running, your credentials are valid, there's no service incident, but you still can't connect. Nine times out of ten, the culprit is an NSG rule, a missing route, or a DNS resolution failure inside a Virtual Network.
Start with Network Watcher. In the portal, search for Network Watcher and open it. The two tools you'll use most are IP flow verify and Connection troubleshoot.
IP flow verify tells you if a specific traffic flow (source IP, destination IP, port, protocol) is allowed or denied by your NSG rules, and which exact rule is responsible:
az network watcher test-ip-flow \
--direction Inbound \
--local 10.0.0.4:3389 \
--protocol TCP \
--remote 203.0.113.10:* \
--vm MyVM \
--resource-group MyRG \
--nic MyNIC
Connection troubleshoot checks end-to-end connectivity from a VM to a target, including latency and hop analysis:
az network watcher test-connectivity \
--source-resource MyVM \
--dest-address 8.8.8.8 \
--dest-port 443 \
--resource-group MyRG
NSG rule audit: View all rules on an NSG sorted by priority:
az network nsg rule list \
--nsg-name MyNSG \
--resource-group MyRG \
--output table
Remember: NSG rules are evaluated in priority order (lowest number first). A Deny rule at priority 100 beats an Allow rule at priority 200. If your traffic is being blocked, look for high-priority deny rules covering your port range.
DNS issues inside VNets: If your VMs can't resolve internal names, check your Virtual Network's DNS server configuration. Go to Virtual Networks → [Your VNet] → DNS servers. If you're using Azure-provided DNS (168.63.129.16), make sure your NSG isn't blocking UDP port 53 to that address, that's a mistake I've seen more times than I can count.
What success looks like: IP flow verify returns "Allow" for your expected traffic flows, and connection troubleshoot shows "Reachable" for your target endpoints.
Advanced Azure Troubleshooting
If the steps above didn't resolve your issue, you're dealing with something more complex. Here's how to go deeper.
Azure Monitor and Log Analytics
For anything production-critical, Azure Monitor is your primary investigation tool. If you have a Log Analytics workspace connected to your resources, you can query logs directly using KQL (Kusto Query Language). To find all failed operations on a resource in the last 24 hours:
AzureActivity
| where TimeGenerated > ago(24h)
| where ActivityStatusValue == "Failure"
| project TimeGenerated, ResourceGroup, ResourceId, OperationNameValue, Properties
| order by TimeGenerated desc
To find authentication failures in Entra ID sign-in logs:
SigninLogs
| where TimeGenerated > ago(1h)
| where ResultType != 0
| project TimeGenerated, UserPrincipalName, AppDisplayName, ResultType, ResultDescription, IPAddress
| order by TimeGenerated desc
Azure Diagnostics and Resource Health Events
For VM-level issues, enable boot diagnostics on your Virtual Machine (VM → Boot diagnostics → Enable). This lets you see the serial console output and a screenshot of the VM's display, essential for diagnosing VMs that boot-loop or get stuck at startup. You can also SSH or RDP into the VM via the Serial Console directly from the portal even when the VM appears unresponsive to normal network connections.
For App Service failures, check the Diagnose and Solve Problems blade. Go to your App Service, click Diagnose and solve problems in the left menu. This runs automated checks across a dozen categories, availability, performance, configuration, and surfaces specific findings. It's genuinely useful and underused.
Group Policy and Enterprise Environments
In domain-joined or Entra ID-joined enterprise environments, Conditional Access policies and Azure AD PIM (Privileged Identity Management) add complexity. If you're getting blocked in an enterprise tenant:
- Check Conditional Access in Entra ID admin center (Protection → Conditional Access → Policies) for policies that might be blocking your account or device.
- If your account uses PIM, you may need to activate an eligible role before it becomes effective. Go to Entra ID → Privileged Identity Management → My roles and activate the required role.
- Check the Entra ID Sign-in logs (Monitoring → Sign-in logs) and filter for Failure status to see exactly which Conditional Access policy is blocking the login and why.
Azure Resource Manager Throttling
If you're running automation at scale, you may hit ARM API throttling. The error is 429 TooManyRequests with a Retry-After header. ARM limits are applied per subscription, per region, per hour: 12,000 read requests, 1,200 write requests, 1,200 delete requests. If you're hitting this, implement exponential backoff in your automation and consider spreading operations across multiple subscriptions or staggering deployments.
Escalate to Microsoft Support when: (1) Azure Service Health shows no active incident but your resource health shows "Degraded" or "Unavailable" without any configuration change on your end; (2) you have a Correlation ID from a failed operation that you can't explain through Activity Logs or diagnostics; (3) a billing or subscription issue isn't resolving through self-service; or (4) you're experiencing data loss or a security incident. Go to Microsoft Support, create a new support request, and always include the Correlation ID, your Subscription ID, the affected Resource ID, and a precise UTC timestamp of when the failure occurred. This cuts resolution time significantly.
Prevention & Best Practices
The best Azure troubleshooting session is the one you never have. I've seen the same preventable problems show up repeatedly across different organizations. Here's what separates the teams that barely notice incidents from the ones that scramble every time something breaks.
Set up Azure Service Health alerts before you need them. Go to Service Health → Health alerts → Add service health alert. Configure it to fire on Service Issues, Planned Maintenance, and Health Advisories for your active regions and services. Route it to an Action Group that sends email, SMS, and a Teams/Slack webhook. This takes 15 minutes to set up and will save you hours of confusion during the next Azure incident.
Enable diagnostic settings on all production resources. Every resource in Azure has a Diagnostic settings blade. Turn on all logs and route them to a Log Analytics workspace. Yes, it costs money, but 30 days of logs for a small environment is typically under $20/month. Not knowing why something failed costs exponentially more. For Virtual Machines, also enable Azure Monitor agent and configure data collection rules to capture system and security event logs.
Use Azure Policy to enforce a consistent baseline. Create policies that require tags on all resources, enforce allowed VM SKUs, require diagnostic settings to be enabled, and restrict resource creation to approved regions. This prevents the configuration drift that makes Azure troubleshooting so much harder in mature environments. Start with the built-in policy initiatives for "Azure Security Benchmark", it covers a huge amount of ground with minimal effort.
Implement a proper tagging strategy today. Resources without tags are nearly impossible to attribute during an incident. At minimum, tag every resource with Environment (prod/staging/dev), Owner (team email), CostCenter, and Application. You can enforce this with Azure Policy. When something breaks at 2am, you want to know immediately which team owns it and what application it belongs to.
Test your RBAC assignments regularly. Run quarterly access reviews in Entra ID (Identity Governance → Access reviews). Accumulated permissions, service principals with Contributor access that were granted "temporarily," developers with Owner on production, are both a security risk and a troubleshooting nightmare. The principle of least privilege makes the blast radius of any misconfiguration much smaller.
- Enable Resource Health alerts for every production resource, 5-minute setup, instant notification when Azure detects platform-level degradation
- Turn on Azure Advisor recommendations and review them weekly, it flags security gaps, performance issues, and cost waste with specific remediation steps
- Set subscription spending alerts at 80% and 100% of your expected monthly budget under Cost Management → Budgets
- Lock production resource groups with a
CanNotDeleteorReadOnlymanagement lock to prevent accidental deletion
Frequently Asked Questions
Why does Azure keep saying "Authorization failed" even though I'm the subscription owner?
Being a subscription Owner means you have the Owner role at the subscription scope, but some operations also require specific Entra ID directory roles, not just Azure RBAC roles. For example, managing Entra ID users, app registrations, or conditional access requires roles like Global Administrator or Application Administrator in Entra ID, which is separate from your Azure subscription role. Check if the failing operation involves an Entra ID resource (not an Azure resource) and verify you have the appropriate directory role. Also double-check that your role assignment hasn't been scoped to a resource group rather than the full subscription.
My Azure VM shows as "Running" in the portal but I can't RDP or SSH into it, what's going on?
"Running" in the portal means the Azure hypervisor reports the VM is powered on at the infrastructure level, it says nothing about whether the OS is healthy or whether your network path is clear. Start with Network Watcher IP Flow Verify to confirm your NSG rules allow the traffic on port 3389 (RDP) or 22 (SSH). Then check whether a User Defined Route or Azure Firewall is intercepting the traffic. If the network path looks clean, use the Serial Console (VM → Serial console) to check whether the OS has booted completely or is stuck, you may see a failed service or a file system error blocking the login screen.
How do I find out which Azure Policy is blocking my deployment?
When a deployment fails with RequestDisallowedByPolicy, the error message in the Activity Log almost always includes the specific policy definition display name and assignment name. Go to Activity Log, find the failed deployment operation, and expand the Status Message in the JSON, look for the policyDefinitionDisplayName and policyAssignmentName fields. With those names, go to Policy → Assignments and filter to find the exact assignment. You can then request an exemption from your Azure admin, or if you own the policy, add an exception for your specific resource or subscription.
What's the difference between Azure Resource Health and Azure Service Health?
Azure Service Health covers the Azure platform broadly, it tells you about datacenter incidents, planned maintenance, and regional outages that affect Microsoft's infrastructure. It's about what Microsoft is experiencing. Azure Resource Health is specific to your individual resources, it tells you whether your particular VM, SQL database, or App Service is healthy from Azure's perspective, and whether any recent platform events affected that specific resource. Think of Service Health as the weather forecast for the whole region, and Resource Health as whether your specific house got hit by the storm. Always check both when troubleshooting, because they surface different types of problems.
How do I fix "The subscription is not registered to use namespace Microsoft.Compute" errors?
This error means the resource provider for the service you're trying to use isn't registered on your subscription. It's common on brand-new subscriptions or when using less common Azure services for the first time. Fix it with one command: az provider register --namespace Microsoft.Compute (replace Microsoft.Compute with whatever namespace appears in your error). Registration usually completes within 1–2 minutes. You can check the status with az provider show --namespace Microsoft.Compute --query registrationState. In the portal, you can also do this at Subscriptions → [Your Sub] → Resource providers, find the namespace and click Register.
Azure deployment worked yesterday but is failing today with the same template, why?
There are three common explanations for this. First, an Azure Policy was added or changed since yesterday that now blocks something your template does, check the Activity Log for RequestDisallowedByPolicy errors. Second, a SKU you're deploying became unavailable in that region due to capacity constraints, try a different VM size or a nearby region. Third, your service principal or Managed Identity lost a role assignment, someone may have cleaned up permissions. Run az role assignment list --assignee [your-sp-object-id] --output table to verify the assignments are still in place. Checking the Activity Log's failed operations will point you to the exact cause within seconds.