How to Troubleshoot Azure Automation, Full Fix Guide
Why Azure Automation Troubleshooting Is So Painful
I've seen this exact scenario more times than I can count: a runbook that worked perfectly last Tuesday suddenly fails on Wednesday morning with a cryptic error like The term 'Get-AzVM' is not recognized as the name of a cmdlet, or a job sits in Queued status for 45 minutes and never runs. You're staring at the Azure portal wondering what changed, because you didn't change anything. I know this is frustrating, especially when it's blocking an automated patch cycle or a nightly backup job that your whole team depends on.
Azure Automation is one of those services where failures tend to be silent or deeply buried. The job output pane shows a red status, but the actual error is three clicks deep in a stream log. And Microsoft's portal error messages? They're often generic to the point of being useless, "Job failed" tells you exactly nothing about whether the problem is a missing module, a broken Hybrid Worker connection, a permissions issue on your Run As account, or a PowerShell version mismatch.
Here's what's actually going on under the hood. Azure Automation runbooks execute either on Azure sandboxes (Microsoft-managed, shared compute) or on Hybrid Runbook Workers (VMs you manage yourself). These two execution environments behave completely differently, and a runbook that works in one often breaks in the other. Sandbox jobs are subject to a hard 3-hour time limit, memory caps around 400 MB, and module restrictions. Hybrid Workers pull from whatever PowerShell environment is installed on your on-premises or Azure-hosted VM, which means outdated modules, stale credentials, and network policies all come into play.
The most common root causes I see in Azure Automation troubleshooting cases are: module version conflicts (especially after the Az module updates), expired or broken Run As accounts (certificates expire after one year by default), Hybrid Worker connectivity failures (Log Analytics agent goes offline silently), insufficient permissions on the Automation Account's managed identity or service principal, and runbook logic errors that only surface at runtime against real data.
There's also a growing class of Azure Automation job failures tied to the deprecation of the classic Run As account, Microsoft ended support for it in September 2023 and accounts that hadn't migrated to Managed Identity started breaking silently. If your automation stopped working sometime in late 2023 or early 2024 and you haven't migrated yet, that's almost certainly your culprit.
The good news: most Azure Automation failures are fixable without opening a support ticket. Let's go through them systematically. Browse all Microsoft fix guides →
The Quick Fix, Try This First
Before you go deep into diagnostics, do this one check first, it resolves probably 40% of Azure Automation troubleshooting cases I see.
Go to your Automation Account in the Azure portal. In the left sidebar, click Modules (under Shared Resources), then click Az.Accounts. Look at the version number. If it's below 2.12.0 or there's a red "Failed" status next to it, that's your problem. A corrupted or outdated Az.Accounts module causes cascading failures in every other Az.* module because they all depend on it for authentication context.
To fix it, click Browse Gallery at the top of the Modules blade, search for Az.Accounts, select it, and click Import. Set the runtime version to PowerShell 7.2 (not 5.1, that's legacy at this point). The import takes 3–5 minutes. After it completes, go back and re-import any other Az.* modules that show failed status, starting with Az.Compute, Az.Resources, and Az.Storage in that order.
While the import runs, check one more thing. Go to Run As Accounts (or Identity if you've migrated to Managed Identity). If you see a certificate expiry date in the past, or a red warning, your authentication is completely broken and every single runbook will fail at the Connect-AzAccount step, regardless of anything else you try.
To renew the Run As certificate without losing the account: click on the Run As account entry, click Renew Certificate, and confirm. This regenerates a fresh 1-year certificate and updates the associated service principal in Entra ID automatically. Takes about 2 minutes. Then retry your failing runbook job.
If the runbook still fails after fixing modules and the Run As certificate, move into the full step-by-step process below.
The Azure portal's job status page shows you a summary, but the real diagnostic information lives in the job streams. Here's how to get to it.
Navigate to: Automation Account → Jobs (left sidebar under Process Automation). Click the failed job. You'll land on a summary page. Now click the Output tab, this shows Write-Output calls and final return values. Then click Errors, this is where exception stack traces appear. Then click All Logs, this shows the complete chronological stream including verbose and warning messages.
Look specifically for lines tagged [ERROR] or Exception. Common patterns and what they mean:
AuthorizationFailed, the identity running the runbook lacks RBAC permissions on the target resourceThe term 'Connect-AzAccount' is not recognized, Az.Accounts module is missing or failed to importAADSTS700016, the service principal (Run As account) doesn't exist in the tenant, likely deleted from Entra IDJob was evicted, the sandbox job hit the 3-hour execution limit or 400 MB memory capThe pipeline has been stopped, usually a-ErrorAction Stopcaught a non-fatal error and terminated execution
Write down the exact error text. You'll need it for the next steps. If the Errors tab is empty but the job still shows Failed, switch to All Logs and filter by Error stream type, sometimes errors get logged to the combined stream instead of the dedicated Errors tab.
If you see no logs at all, that's a different problem: the sandbox worker may have crashed before it could write output. In that case, check the Automation Account's Diagnostic Settings, if Log Analytics integration isn't configured, you're flying blind and need to set that up first (covered in Step 4).
This is the single most common cause of Azure Automation runbook failures in 2025 and 2026, permission issues on the identity that the runbook authenticates as. Let's nail this down systematically.
First, identify what authentication method your runbook uses. Open the runbook code and look for the authentication call. If you see:
Connect-AzAccount -Identity
, it's using a System-assigned or User-assigned Managed Identity. If you see:
$connection = Get-AutomationConnection -Name 'AzureRunAsConnection'
Connect-AzAccount -ServicePrincipal -Tenant $connection.TenantID ...
, it's using the legacy Run As account (service principal with certificate).
For Managed Identity: Go to Automation Account → Identity. Note the Object (principal) ID. Then go to the target resource (e.g., a subscription, resource group, or specific VM) → Access control (IAM) → Role assignments. Search for the Object ID. If it's not there, or it has a role that's too restrictive for what the runbook is trying to do, that's your problem. Add the appropriate RBAC role, typically Contributor at the resource group scope is enough for most automation scenarios, though you should follow least-privilege and use something like Virtual Machine Contributor if the runbook only manages VMs.
For Run As account: Go to Automation Account → Run As Accounts. Check the certificate expiry date. Also verify the associated app registration still exists: copy the Application ID shown and search for it in Entra ID → App registrations → All applications. If it's gone, someone deleted it and you'll need to delete and recreate the Run As account entirely.
After fixing permissions, always wait 5 minutes before retesting, Azure RBAC changes propagate with a short delay and immediate retests can give false negatives.
Module issues are the silent killers of Azure Automation. The portal shows modules as "Imported" but they may actually be broken, outdated, or conflicting with each other. Here's how to do a proper module audit.
Go to Automation Account → Modules → Gallery. In the Modules list, look for any module showing status Failed or Creating (stuck creating means the import timed out). Note them all.
The critical thing to understand about Az module dependencies: you must import them in the right order. Az.Accounts must be version-compatible with every other Az.* module you use. If Az.Accounts is 2.15.0 but Az.Compute was imported when Az.Accounts was 2.8.0, they may conflict. The safest fix is to update all Az modules to their latest versions simultaneously.
Run this PowerShell locally (with Az module installed) to see the latest compatible versions:
Find-Module -Name Az.* | Select-Object Name, Version | Sort-Object Name
Then in the portal, for each Az module you use: delete the existing version (click the module → Delete), wait for deletion to complete, then re-import from the gallery at the latest version targeting Runtime version: PowerShell 7.2.
For custom or third-party modules (anything not in the gallery), you need to upload a ZIP file. The ZIP must contain the module folder at the root level, not nested inside another folder, otherwise the import silently fails but shows "Imported" status. Package it like this:
# Correct structure inside the ZIP:
MyModule/
MyModule.psd1
MyModule.psm1
# NOT: MyModule/MyModule/MyModule.psd1, this breaks silently
After all modules show green "Completed" status, re-run the Test pane. Module fixes take full effect immediately, no restart needed.
If your runbooks are configured to run on a Hybrid Runbook Worker (you'll see the worker group name in the job details instead of "Azure"), you have a completely different set of failure modes to check. The Azure portal has almost no visibility into what's happening on the worker VM, you have to go look at the machine directly.
First, check connectivity. On the Hybrid Worker VM, open PowerShell as Administrator and run:
Test-NetConnection -ComputerName "<your-workspace-id>.agentsvc.azure-automation.net" -Port 443
Replace <your-workspace-id> with your Log Analytics workspace ID (found in Log Analytics workspace → Agents). If this returns TcpTestSucceeded: False, the worker can't reach the Azure Automation service endpoint, check firewall rules, NSG rules, and proxy settings.
Next, check the Hybrid Worker service status:
Get-Service -Name "healthservice" # Log Analytics agent
Get-Service -Name "himagent" # Azure Hybrid Instance Metadata Agent (newer workers)
If either service is stopped, start it:
Start-Service -Name "healthservice"
Then check the agent log at C:\ProgramData\Microsoft\System Center\Orchestrator\7.2\SMA\Sandboxes, or for newer agent-based Hybrid Workers, check the Windows Event Log under Applications and Services Logs → Microsoft → Automation → Operational. Event ID 15000 means the worker registered successfully. Event ID 15002 or 15006 means authentication failure with the Automation account.
Also verify the runbook's PowerShell execution policy on the worker:
Get-ExecutionPolicy -List
It should be at minimum RemoteSigned at the LocalMachine scope. If it's Restricted, runbooks can't execute:
Set-ExecutionPolicy -ExecutionPolicy RemoteSigned -Scope LocalMachine -Force
Once these checks pass, go back to the portal and re-queue the job targeting the same Hybrid Worker group. If it now shows Running instead of Queued indefinitely, the connectivity was the issue.
If you've been troubleshooting Azure Automation without Log Analytics integration, you've been doing it the hard way. Enabling diagnostic settings gives you queryable logs, job history beyond the 30-day portal retention, and the ability to set up alerts so you know about failures before someone opens a ticket.
Go to Automation Account → Diagnostic settings → Add diagnostic setting. Name it something like AutomationToLogAnalytics. Check these log categories:
- JobLogs, job start, end, and final status
- JobStreams, all output, error, verbose, and warning streams
- AuditEvent, who changed what in the Automation account
Send to your Log Analytics workspace. Click Save.
Within 15 minutes, logs start flowing. Now you can query them. Go to Log Analytics workspace → Logs and run:
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.AUTOMATION"
| where Category == "JobStreams"
| where StreamType_s == "Error"
| project TimeGenerated, RunbookName_s, ResultDescription, ResultType
| order by TimeGenerated desc
| take 50
This gives you the last 50 error-stream entries across all runbooks, with timestamps and runbook names. It's dramatically faster than clicking through individual jobs in the portal.
Set up an alert so failures don't go unnoticed. In Log Analytics, click New alert rule. Use this query as the signal:
AzureDiagnostics
| where ResourceProvider == "MICROSOFT.AUTOMATION"
| where Category == "JobLogs"
| where ResultType == "Failed"
Set the threshold to Greater than 0, evaluation period to 5 minutes, and send it to an Action Group that emails your team. Now Azure Automation job failures generate an alert within 5 minutes, not discovered the next morning when someone notices the nightly job didn't run.
Advanced Azure Automation Troubleshooting
If the steps above didn't resolve your issue, you're likely dealing with something more complex, a domain-joined machine policy blocking execution, a network-level restriction, or a multi-tenancy configuration problem. Here's how to dig deeper.
Group Policy Blocking Script Execution
On Hybrid Worker machines that are domain-joined, Group Policy can override local PowerShell execution policy. Even if you set RemoteSigned locally, a GPO applying Restricted policy at the domain level will win at the next policy refresh. Check the effective policy:
gpresult /H C:\temp\gpresult.html
# Open the HTML file and search for "PowerShell"
If you find a GPO enforcing a restrictive execution policy, work with your AD team to either exempt the Hybrid Worker machines or configure the GPO to allow at least RemoteSigned. Alternatively, you can run runbooks with an explicit bypass by calling them through a scheduled task wrapper, but that's a workaround, not a fix.
Registry-Level PowerShell Policy
Sometimes execution policy is set via registry rather than GPO. Check these keys on the Hybrid Worker:
Get-ItemProperty -Path "HKLM:\SOFTWARE\Policies\Microsoft\Windows\PowerShell" -ErrorAction SilentlyContinue
Get-ItemProperty -Path "HKCU:\SOFTWARE\Policies\Microsoft\Windows\PowerShell" -ErrorAction SilentlyContinue
If ExecutionPolicy is set here, Group Policy is managing it. You can't override it locally, it requires a GPO change.
Analyzing Failures with Event Viewer
For Hybrid Worker failures that don't produce any job stream output at all, go to the worker VM's Event Viewer. Navigate to: Applications and Services Logs → Microsoft → Automation → Operational. Key event IDs to look for:
- Event 4000, Runbook job started on the worker
- Event 4001, Runbook job completed
- Event 4010, Worker couldn't contact Azure Automation service (network issue)
- Event 4020, Job sandbox process crashed (usually a PowerShell crash, check Windows Error Reporting)
TLS 1.2 Enforcement
Azure Automation service endpoints require TLS 1.2. Older Windows Server versions (2012 R2, some 2016 configurations) don't enable TLS 1.2 for .NET by default. If your Hybrid Worker is on an older OS and jobs are failing with connection timeout errors, add these registry keys:
[Net.ServicePointManager]::SecurityProtocol = [Net.SecurityProtocolType]::Tls12
# Or set it permanently via registry:
reg add "HKLM\SOFTWARE\Microsoft\.NETFramework\v4.0.30319" /v SchUseStrongCrypto /t REG_DWORD /d 1 /f
reg add "HKLM\SOFTWARE\WOW6432Node\Microsoft\.NETFramework\v4.0.30319" /v SchUseStrongCrypto /t REG_DWORD /d 1 /f
Restart the healthservice after making this change.
Multi-Tenant and CSP Scenarios
If you manage multiple tenants (CSP or multi-tenant app scenarios), Connect-AzAccount -Identity won't work across tenant boundaries. You need an explicit service principal in the target tenant. Verify you're connecting to the right tenant by adding -TenantId explicitly in your Connect-AzAccount call and checking that the app registration exists in that specific tenant's Entra ID, not just the home tenant of the Automation Account.
Escalate to Microsoft Support when: you've confirmed all permissions, modules, and connectivity are correct but jobs still fail silently with no error output; when the Azure Automation service itself shows degraded status on the Azure Status Dashboard (status.azure.com) for your region; or when you're seeing InternalServerError responses from the Automation service API rather than application-level errors. These are platform-side issues that only Microsoft can resolve. Before opening the ticket, collect the job IDs of the failing jobs, the Automation Account resource ID, and the exact timestamps, this dramatically speeds up the triage process on their end.
Prevention & Best Practices for Azure Automation
The best Azure Automation troubleshooting is the kind you never have to do. Most of the failures I see in production environments were predictable and preventable. Here's what separates teams that operate automation reliably from teams that constantly fight fires.
Migrate off Run As accounts completely. Microsoft deprecated them in September 2023. If you're still using Get-AutomationConnection -Name 'AzureRunAsConnection' anywhere in your runbooks, you're running on borrowed time. The migration to System-assigned Managed Identity takes about 30 minutes per runbook and the new pattern is simpler, just Connect-AzAccount -Identity with proper RBAC assignments. Do this migration proactively, not reactively when the certificate expires at 3 AM.
Pin module versions in automation-critical environments. The "always use latest" approach works fine for development. In production, an unexpected Az module update can change cmdlet behavior or deprecate parameters your runbooks depend on. Create a private module repository (a simple Azure Storage blob works) and control when you pull updates. Test new module versions in a dev Automation Account first.
Set runbook job alerts for every production automation. The Log Analytics alert query from Step 5 takes 10 minutes to set up and will save you hours of reactive troubleshooting. Treat an automation failure the same way you'd treat a monitoring alert, with a defined response procedure, not an ad-hoc investigation.
Use source control integration. Automation Account supports GitHub and Azure DevOps integration natively (under Source Control in the left sidebar). When runbooks are in version control, you can see who changed what and when, roll back bad changes in seconds, and run code review on automation logic the same way you would application code. I've seen plenty of "random" automation failures that were actually caused by someone editing a runbook directly in the portal without telling anyone.
Document your Hybrid Worker dependencies. For each Hybrid Worker group, maintain a simple runbook or wiki page listing: the worker VMs, the PowerShell version installed, all non-gallery modules and their versions, and any local dependencies (local files, databases, network shares) the runbooks access. When a worker fails, this document tells you exactly what needs to be re-installed or reconfigured on the replacement.
- Enable diagnostic logging to Log Analytics on every Automation Account, costs pennies per month and saves hours per incident
- Set a calendar reminder to check Run As certificate expiry 60 days before expiration (or migrate to Managed Identity and eliminate this concern entirely)
- Use
-ErrorAction Stopconsistently in runbooks so errors terminate cleanly with clear messages rather than silently continuing into broken state - Tag Automation Account resources with the owning team so Azure Cost Management and incident routing work correctly
Frequently Asked Questions
My Azure Automation runbook job stays in "Queued" status and never starts, what's wrong?
A job stuck in Queued almost always means the execution target isn't available. If you're running on Azure (no Hybrid Worker), this rarely happens but can occur during regional service disruptions, check status.azure.com for your region. If you're targeting a Hybrid Runbook Worker group, it means no worker in that group is online and able to pick up the job. Go to Automation Account → Hybrid Worker Groups → [Your Group] and check if any workers show as Online. If they all show Offline, the Log Analytics agent on those VMs has stopped or lost connectivity. RDP into the worker VM, check the healthservice status (Get-Service healthservice), and restart it if stopped. The job will pick up within 60 seconds of the worker coming back online.
How do I fix "AuthorizationFailed" errors in my runbook even though I assigned Contributor role?
This trips people up all the time. The RBAC assignment might be correct, but you need to verify it's assigned to the right identity. If your runbook uses Connect-AzAccount -Identity, go to Automation Account → Identity and get the Object ID shown there, that's what needs the RBAC assignment, not the Automation Account resource itself. Also check the scope: if you assigned Contributor on a specific resource group but the runbook is trying to operate on a resource in a different group or at subscription level, it will still fail with AuthorizationFailed. Additionally, RBAC changes take 3–5 minutes to propagate, if you just made the assignment, wait and try again before concluding it didn't work.
Can I run a PowerShell 7 runbook and a PowerShell 5.1 runbook in the same Automation Account?
Yes, Azure Automation supports both PowerShell 5.1 and PowerShell 7.2 (and Python 3.8) simultaneously within the same account. Each runbook has its own runtime version setting, configured when you create the runbook or changeable in the runbook properties. The critical thing to understand is that modules are imported per runtime version, a module imported for PowerShell 7.2 is not available to PowerShell 5.1 runbooks, and vice versa. Go to Modules and use the Runtime version filter to make sure your modules are imported under the correct runtime. This is one of the most common sources of "module not found" errors when mixing runtimes in the same account.
My runbook works fine in the Test pane but fails when triggered by a schedule, why?
The Test pane and scheduled jobs run in the same execution environment, so if one works and the other doesn't, the difference is almost always in how parameters are passed. The Test pane lets you type parameter values interactively. A schedule passes parameter values configured when you linked the schedule to the runbook, go to Automation Account → Schedules, find the schedule, and click View linked runbooks to see the parameter values that are actually being passed at runtime. Null or empty parameters where the runbook expects a specific value is the most common cause. Also check that the schedule is configured for the correct timezone, Azure Automation schedules use UTC by default, and a schedule that appears to fire "at midnight" may actually be firing at midnight UTC which could be mid-day in your local timezone depending on where you are.
How long does Azure Automation keep job history and logs?
By default, the Azure portal retains job history for 30 days. After that, the job records disappear from the portal UI and you can no longer retrieve job output. This is a hard platform limit that you can't extend without connecting to Log Analytics, which is exactly why enabling diagnostic settings (covered in Step 5) is so important for production environments. With Log Analytics, you control the retention period independently; the default workspace retention is 30 days but you can extend it to 730 days (2 years) at a modest additional cost. Set this up before you need it, you can't retroactively capture logs that were never sent to a workspace.
Is there a way to test Azure Automation runbooks locally before deploying them?
Yes, and you absolutely should do this before pushing runbooks to production. Install the Az PowerShell module locally (Install-Module Az -Scope CurrentUser), then run your runbook script directly in VS Code or PowerShell ISE, substituting the Get-AutomationVariable and Get-AutomationConnection calls with local equivalents (you can mock these with simple variable assignments). For Managed Identity authentication specifically, you'll authenticate with Connect-AzAccount using your own credentials during local testing. The VS Code Azure Automation extension also lets you edit and publish runbooks directly from VS Code, with syntax highlighting and IntelliSense. For Hybrid Worker scenarios, you can register a local developer machine as a Hybrid Worker in a non-production worker group and test against it directly, which catches module dependency issues before they reach production workers.