Fix Azure Automation: Runbooks, Hybrid Workers & Errors

Microsoft Fix Intermediate 14 min read Official Docs Grounded Updated April 20, 2026

Why Azure Automation Breaks , And Why the Error Messages Are Useless

Here's the scene I see constantly: you've spent two hours wiring up an Azure Automation runbook to stop non-production VMs every night and cut cloud costs. You schedule it, test it in the portal, it runs green. You go home. The next morning you find out it silently failed at 11 PM, your VMs ran all night, and your cost alert fired at 2 AM. The job status says "Failed" , and the error message is something like The subscription is not registered to use namespace 'Microsoft.Automation' or a blank output with no trace at all.

I know that's infuriating. Especially when Azure Automation is supposed to be the tool that removes manual toil from your operations.

Azure Automation problems fall into a handful of repeating categories, and once you know which bucket you're in, fixing them becomes a lot more predictable.

The most common root causes I see:

  • Managed Identity or Run As account misconfiguration, The runbook authenticates with the wrong identity or a certificate that expired quietly.
  • Hybrid Runbook Worker connectivity issues, Your on-premises machine can't reach the Azure Automation service endpoints, usually because of proxy or firewall rules blocking *.azure-automation.net.
  • Module version conflicts, A runbook imports Az.Compute but the Automation account has an old version of the module, causing CommandNotFoundException failures mid-execution.
  • State Configuration drift not being caught, Your PowerShell DSC pull server configuration looks fine in the portal but nodes are actually non-compliant because the agent isn't reporting back correctly.
  • RBAC misconfiguration, The Automation account's Managed Identity has Contributor on the wrong scope, or a new team member's runbook fails because their identity lacks the right role on the subscription.
  • Runbook sandbox limits hit silently, Azure Automation sandboxes enforce a 3-hour job time limit and a 400 MB memory cap. Jobs that approach these ceilings often fail without a clear message.

Microsoft's portal error messages for Azure Automation tend to be either extremely generic or cryptically specific without context. They'll tell you a job failed but not why the authentication token expired. They'll show a DSC node as "Unresponsive" without explaining the agent registration problem underneath it. That's why reading job output streams and digging into the Activity log manually is non-negotiable.

The fixes below are ordered by how frequently I see each issue. Start from the top and work down, most people hit the first two steps and are done.

Browse all Microsoft fix guides →

The Quick Fix, Try This First

Before you spend an hour digging through logs, try this. Roughly 60% of Azure Automation failures I've seen come back to one thing: the Managed Identity either isn't enabled or doesn't have the right role assignment on the target resource scope.

Here's exactly what to check right now:

  1. Go to the Azure portal and open your Automation Account.
  2. In the left sidebar, scroll down to Account Settings and click Identity.
  3. On the System assigned tab, confirm the Status toggle shows On. If it's Off, flip it on and click Save, this generates a new service principal automatically.
  4. Click Azure role assignments (the button appears once the identity is enabled).
  5. Check that there's a role assignment, at minimum Contributor or a custom role, on either the subscription or the specific resource group your runbook targets. If nothing is listed, click Add role assignment, choose the correct scope, and assign Contributor.
  6. Go back to your runbook, click Start, and watch the Output and Error streams live.

If your runbook was using the old Run As account (a service principal with certificate auth) and that cert expired, the fix is the same, migrate to System-assigned Managed Identity. Microsoft deprecated the Run As account feature and the Managed Identity approach is both simpler and more secure.

To update an existing runbook to use Managed Identity, replace the old Add-AzureRmAccount -ServicePrincipal block with:

Connect-AzAccount -Identity

That single line is all you need. The Managed Identity token is injected automatically by the sandbox runtime.

Pro Tip
When a runbook fails with no error output at all, completely blank, almost always the Connect-AzAccount -Identity call itself failed silently because the identity has no role assignment. Add -ErrorAction Stop to that line during debugging so it throws a catchable exception and you actually see the auth failure in the Error stream.
1
Diagnose the Failure Using Job Streams and the Activity Log

When a runbook job fails, the Azure portal's top-level "Failed" status tells you almost nothing. You need to dig into the job streams. Here's how.

In your Automation Account, go to Process Automation > Jobs. Find the failed job and click it. You'll see four stream tabs: Output, Error, Warning, and Verbose. Start with Error first, this is where authentication failures, module errors, and PowerShell exceptions land.

Common errors you'll see and what they mean:

  • AADSTS700016: Application with identifier '...' was not found in the directory, Your Run As account service principal got deleted. Migrate to Managed Identity.
  • The term 'Get-AzVM' is not recognized as the name of a cmdlet, The Az.Compute module isn't imported in your Automation account, or the version is too old to include that cmdlet.
  • AuthorizationFailed: The client '...' does not have authorization to perform action 'Microsoft.Compute/virtualMachines/read', Classic RBAC problem. Your Managed Identity exists but lacks the right role on the target scope.
  • Job was evicted and subsequently reached a Failed state, You hit the 3-hour sandbox time limit or the 400 MB memory cap.

Next, check the Activity log in your Automation Account. Go to Monitoring > Activity log and filter by "Failed" operations in the last 24 hours. This catches account-level issues like failed module imports or configuration changes that don't surface in job streams.

If everything looks clean there, enable verbose logging on your runbook. In the runbook editor, go to Edit > Settings and set both Verbose and Progress logging to On. Re-run the job and check the Verbose stream, it'll show you each command as it executes, making it obvious exactly where execution stops.

Once you've identified the error type, move to the relevant fix step below.

2
Fix Module Conflicts and Missing Cmdlets in Your Automation Account

Module version problems are the second most common Azure Automation issue I troubleshoot. The symptom is almost always a CommandNotFoundException or a cryptic type-mismatch error mid-runbook, even though the same script works perfectly on your local machine.

Here's why: your local machine has the latest Az module. Your Automation account might have a 2-year-old version. The cmdlet you're calling was added in a newer version. They're completely out of sync.

To fix this, go to your Automation Account > Shared Resources > Modules. You'll see a list of every module currently imported. Check the version of Az.Accounts, this is the base module and must be updated before any other Az module. Click on it, then click Update.

Critical ordering rule: Always update Az.Accounts first and wait for it to finish before updating any other Az.* module. Violating this causes partial import failures that are annoying to clean up.

To import a new module that isn't listed, click Browse gallery and search by name. For modules not in the gallery (private or third-party), upload the .zip directly.

You can also import modules via PowerShell if you're automating account setup:

$automationAccountName = "MyAutomationAccount"
$resourceGroupName = "MyResourceGroup"

New-AzAutomationModule `
  -AutomationAccountName $automationAccountName `
  -ResourceGroupName $resourceGroupName `
  -Name "Az.Compute" `
  -ContentLinkUri "https://www.powershellgallery.com/api/v2/package/Az.Compute"

After the import completes (status will show "Available", refresh the page, it's not live-updating), go back to your runbook and re-run the job. If the cmdlet was the only issue, it'll pass that step now.

One thing that catches people: Python runbooks use packages, not modules. If you're running Python 3 runbooks, go to Shared Resources > Python packages to manage dependencies there, it's a separate section entirely.

3
Repair a Broken Hybrid Runbook Worker Connection

If your runbooks need to touch on-premises resources, SQL Server, Active Directory, file shares, anything in your corporate network, you've almost certainly set up a Hybrid Runbook Worker. And at some point it goes silent. The portal shows your Hybrid Worker Group as present, but jobs routed to it hang in "Queued" forever or fail immediately with a connectivity error.

The extension-based Hybrid Runbook Worker (the current, recommended type) runs as an Azure VM extension or an Arc-enabled server extension. Here's how to diagnose it.

On the worker machine itself, open PowerShell as Administrator and check the service status:

Get-Service -Name "HybridWorkerService" | Select-Object Name, Status, StartType

If the service is stopped, start it:

Start-Service -Name "HybridWorkerService"

Check the event log for specific errors:

Get-WinEvent -LogName "Microsoft-Automation/Operational" -MaxEvents 50 |
  Where-Object { $_.LevelDisplayName -eq "Error" } |
  Select-Object TimeCreated, Message

The most common event IDs to look for: Event ID 4502 (worker registered successfully, if you're NOT seeing this, registration failed), Event ID 4504 (worker heartbeat sent), and Event ID 15011 (certificate validation failure against the Automation service endpoint).

For network connectivity, the Hybrid Worker needs outbound HTTPS (port 443) access to these endpoints. Confirm your firewall isn't blocking them:

Test-NetConnection -ComputerName "<your-automation-account-id>.agentsvc.azure-automation.net" -Port 443
Test-NetConnection -ComputerName "*.ods.opinsights.azure.com" -Port 443
Test-NetConnection -ComputerName "*.oms.opinsights.azure.com" -Port 443

All three should return TcpTestSucceeded: True. If any fail, work with your network team to open those egress rules. Proxies need to be configured in the Hybrid Worker configuration file, not just at the OS level.

If connectivity is fine but the worker still isn't picking up jobs, try re-registering it from the portal: go to Process Automation > Hybrid worker groups, find your group, and check the worker status. A "Disconnected" worker that's been offline for more than 30 days may need to be deleted and re-onboarded.

4
Resolve Azure Automation State Configuration (DSC) Node Compliance Issues

Azure Automation State Configuration (the cloud-hosted DSC pull server) is genuinely powerful for keeping VM configurations consistent across dozens or hundreds of machines. But when nodes show up as "Non-Compliant" or "Unresponsive" in the portal, it's often unclear whether the problem is with the configuration itself or with the node's ability to report back.

Unresponsive nodes almost always mean the node stopped sending heartbeats to the pull server. The default reporting interval is every 30 minutes. If a node misses several cycles, the portal marks it Unresponsive. On the node, check:

Get-DscLocalConfigurationManager | Select-Object -Property *

Look at RefreshMode (should be Pull), ConfigurationMode, and especially ReportServerURL, confirm it's pointing at your Automation account's DSC endpoint, not an old URL from a previous setup.

To force an immediate configuration check and report:

Update-DscConfiguration -Wait -Verbose

For Non-Compliant nodes, click the node name in the portal and scroll down to Configuration compliance. This shows you exactly which DSC resources are out of drift, for example, a Service resource showing the Windows Firewall is stopped when your config requires it running.

To re-apply the configuration immediately without waiting for the pull interval:

Start-DscConfiguration -UseExisting -Force -Wait -Verbose

If you're seeing LCM_SummaryMessage errors in the DSC event log (Applications and Services Logs > Microsoft > Windows > Desired State Configuration), Event ID 4262 specifically indicates the node couldn't download the MOF from the pull server, usually a credential or certificate issue on the node's registration.

To re-register a node with the Automation DSC pull server, you'll need the registration key from the portal (Account Settings > Keys) and your Automation account's DSC endpoint URL (Account Settings > Properties):

Register-AzAutomationDscNode `
  -AutomationAccountName "MyAutomationAccount" `
  -ResourceGroupName "MyResourceGroup" `
  -AzureVMName "MyVM" `
  -NodeConfigurationName "MyConfig.WebServer"
5
Fix Schedule and Webhook Failures Blocking Automated Triggers

Your runbook works fine when you click Start manually. But the schedule fires at midnight and nothing happens. Or the webhook from your monitoring system triggers but the job never appears. This is a distinct class of Azure Automation problem and it's separate from runbook execution errors.

For broken schedules: Go to your Automation Account > Shared Resources > Schedules. Find your schedule and check two things: first, is Enabled set to Yes? Schedules can get accidentally disabled. Second, check the Next run time, if it shows a past date or "Never", the schedule has expired (one-time schedules don't repeat) or was created with an incorrect timezone offset.

Delete and recreate the schedule if the timezone is wrong. Azure Automation stores schedules in UTC internally, but the UI lets you specify a timezone. A common mistake: creating a schedule for "9 PM Eastern" but forgetting to set the timezone, so it actually runs at 9 PM UTC (which is 5 PM Eastern).

To create a reliable recurring schedule via PowerShell:

$startTime = (Get-Date).AddMinutes(10)
$schedule = New-AzAutomationSchedule `
  -AutomationAccountName "MyAutomationAccount" `
  -ResourceGroupName "MyResourceGroup" `
  -Name "NightlyVMShutdown" `
  -StartTime $startTime `
  -DayInterval 1 `
  -TimeZone "Eastern Standard Time"

Register-AzAutomationScheduledRunbook `
  -AutomationAccountName "MyAutomationAccount" `
  -ResourceGroupName "MyResourceGroup" `
  -RunbookName "Stop-AzureVMs" `
  -ScheduleName "NightlyVMShutdown"

For broken webhooks: Webhooks in Azure Automation have an expiry date, the default is one year from creation. Go to Process Automation > Webhooks and check the Expires column. An expired webhook returns HTTP 404 to the caller but doesn't log anything useful in the Automation Account itself, which makes it look like the trigger never fired.

Webhooks also have a one-time URL that you must save at creation time, Azure never shows it again. If the URL was lost, you must delete the webhook and create a new one, then update every system that calls it. Build this into your runbook operations checklist: calendar reminder 30 days before webhook expiry to rotate them.

When a webhook successfully triggers a job, you'll see the job appear in Jobs within seconds with a source of "Webhook". If the job never appears, the issue is upstream, the calling system isn't reaching the webhook URL, or it's sending an incorrectly formatted JSON body.

Advanced Troubleshooting for Azure Automation

If the steps above haven't fixed your issue, you're likely dealing with something at the infrastructure or enterprise configuration layer. Here's what to look at next.

RBAC and Managed Identity Scope Problems in Large Organizations

In enterprise environments, Azure subscriptions are nested under Management Groups with inherited deny policies. Your Automation Account's Managed Identity might have Contributor on the subscription but a deny policy at the Management Group level is blocking specific resource actions. Check this in Azure Policy > Compliance, look for "Deny" assignments that affect your Automation Account's identity. You'll need to work with your cloud governance team to add an exemption or adjust the policy scope.

For cross-subscription runbooks, where your Automation Account is in subscription A but managing resources in subscription B, you must add the Managed Identity as a role member on the target subscription explicitly. The identity's home subscription doesn't automatically grant cross-subscription access.

Runbook Hitting Sandbox Resource Limits

Azure Automation sandboxes are shared infrastructure. Each job gets a 3-hour max wall-clock time limit and roughly 400 MB of available memory. Long-running operations like large-scale VM inventories or bulk resource tagging jobs frequently hit these. The symptom is a job that just stops with status "Failed" and the error stream shows Job was evicted.

The fix: break the workload into smaller batches. Use Azure Automation Variables or an Azure Storage Table to checkpoint progress between runs, then trigger the next batch with a webhook or schedule. For jobs that genuinely need to run longer than 3 hours, move them to a Hybrid Runbook Worker where these sandbox limits don't apply, Hybrid Workers run directly on your machine and aren't constrained by the shared sandbox limits.

Source Control Integration Sync Failures

Azure Automation's source control integration (supporting GitHub, Azure DevOps, and generic Git) will fail silently if the Personal Access Token (PAT) used to authenticate with the repo expires. In your Automation Account, go to Source Control, find your source control configuration, and check the last sync time. If it's stale, click into the configuration, update the PAT, and trigger a manual sync.

For Azure DevOps integrations, make sure the PAT has at minimum Code (Read) permissions on the repository. If your organization enforces conditional access policies on Azure DevOps, the sync service principal may be blocked, check the Azure DevOps organization's audit log for blocked sign-in events from the Automation account's service identity.

Python Runbook Package Dependency Resolution

Python 3 runbooks fail when a package has transitive dependencies that aren't explicitly imported into the Automation account. Unlike local pip installs that resolve the entire dependency tree automatically, Azure Automation requires you to import each dependency package individually. Run pip show <your-package> locally to see the Requires field, then import each one listed into Shared Resources > Python packages.

When to Call Microsoft Support

Escalate to Microsoft Support when: your Automation Account shows as healthy but jobs never reach "Running" state (possible back-end platform issue), when DSC node registration fails consistently despite correct credentials and network access (possible service-side certificate trust issue), or when the Hybrid Worker extension installs successfully on an Arc-enabled server but the worker never appears in the portal after 30 minutes. For these, open a support case under Azure Automation > Runbook Authoring and Execution category and include the Automation Account resource ID, the approximate UTC timestamps of failures, and any job IDs, this dramatically speeds up the support triage.

Prevention & Best Practices for Azure Automation

Most of the Azure Automation fires I've helped put out were completely preventable. Here's what I'd put in place on day one of any Automation account setup.

Use Managed Identity from the start, not Run As accounts. The Run As account is deprecated. It relies on a certificate that expires annually and requires manual renewal. Managed Identity is token-based, automatically rotated, and tied directly to Azure RBAC, it's the right architecture for new runbooks and worth migrating existing ones.

Pin your module versions. Don't leave modules on auto-update. When Az.Compute releases a breaking change and your runbook imports it on the next run, you'll have a production outage before you know what changed. Keep a tested set of module versions and update them through a change-controlled process. Tag your Automation Account resources with the last-tested module versions so you know what you validated.

Store secrets in Automation Credentials and Variables, never hardcoded. Azure Automation's Credentials asset type encrypts username/password pairs at rest. Variables marked as Encrypted are also encrypted. This isn't just a security best practice, it means you can rotate credentials without touching a single line of runbook code. Just update the credential asset.

Test runbooks in a non-production Automation Account first. I know it sounds obvious, but many teams deploy runbooks directly to the account that manages production workloads. A separate "dev" Automation Account costs almost nothing and saves you from a runbook that accidentally stops production VMs during a test run.

Set up Azure Monitor alerts on Automation job failures. Go to Monitoring > Alerts in your Automation Account and create an alert rule on the "Failed Jobs" metric. Route it to an Action Group that sends an email or Teams message. Don't wait to discover runbook failures the next morning, get paged in real time.

Quick Wins
  • Enable Managed Identity on your Automation Account and assign role assignments via Azure Policy so new accounts get the right permissions automatically at creation.
  • Set a calendar reminder 60 days before any webhook's expiry date, rotating webhooks proactively takes 5 minutes; discovering an expired one in a production incident takes hours.
  • Use -ErrorAction Stop throughout your runbooks so exceptions always surface in the Error stream instead of getting swallowed.
  • For Hybrid Runbook Worker machines, enable Azure Monitor Agent and onboard them to Change Tracking and Inventory, you'll get visibility into software, registry, and service changes that affect worker health before they cause job failures.

Frequently Asked Questions

Why does my Azure Automation runbook work when I run it manually but fail on a schedule?

The most common reason is timezone misconfiguration on the schedule itself, the job fires at the right local time visually in the portal but the underlying UTC offset is wrong, so it runs at an unexpected time and hits resources that are in a different state than you expect. The second cause is that manually-triggered jobs run in your user session context while scheduled jobs run under the Automation Account's Managed Identity, if that identity lacks a permission your user account has, the scheduled job fails with an authorization error. Check both the schedule's timezone and the Managed Identity role assignments.

What is Azure Automation and how is it different from Azure Logic Apps or Azure Functions?

Azure Automation is built specifically for IT operations and infrastructure management, things like starting and stopping VMs, applying system configurations, running maintenance scripts, and managing resources at scale using PowerShell or Python runbooks. Azure Logic Apps is a low-code integration platform designed for connecting SaaS apps and orchestrating business workflows. Azure Functions is a general-purpose event-driven compute platform for running application code. The big differentiator for Azure Automation is the Hybrid Runbook Worker, which lets your runbooks reach into on-premises environments and manage non-Azure machines, something Logic Apps and Functions don't natively support for server-level management tasks.

My Azure Automation Hybrid Runbook Worker shows as "Disconnected", how do I fix it?

First check the HybridWorkerService Windows service on the machine, if it's stopped, start it and check Event Viewer under Applications and Services Logs > Microsoft-Automation/Operational for Event ID errors. The most frequent causes are expired certificates (check the cert in the Automation extension configuration), firewall rules blocking port 443 outbound to *.azure-automation.net and related Azure endpoints, or the machine being offline long enough that the heartbeat lease expired on the Azure side. For machines offline more than 30 days, you'll likely need to delete the worker from the Hybrid Worker Group in the portal and re-onboard it using the extension-based deployment method.

How do I fix "CommandNotFoundException" errors in my Azure Automation runbooks?

This error means the cmdlet you're calling exists in a module that either isn't imported in your Automation account or is imported in a version too old to include that cmdlet. Go to Automation Account > Shared Resources > Modules and check whether the required module (e.g., Az.Compute, Az.Network) is listed and what version it shows. If the module is missing, click Browse gallery to import it. If it's present but outdated, update Az.Accounts first (wait for it to complete), then update the specific module. After each module update, re-run your runbook, module imports can take 3–5 minutes to fully propagate.

What does "Job was evicted and subsequently reached a Failed state" mean in Azure Automation?

This means your runbook hit one of Azure Automation's sandbox resource limits, either the 3-hour maximum job execution time or the approximately 400 MB memory cap on the shared sandbox infrastructure. It's most common in runbooks that iterate over large collections of Azure resources (hundreds of VMs, thousands of storage objects) or that do heavy in-memory data processing. The fix is either to break the workload into smaller batches across multiple scheduled jobs using checkpointing, or to move the runbook to a Hybrid Runbook Worker where these sandbox constraints don't apply since the job runs directly on your own machine.

Can I use Azure Automation to manage non-Azure machines, like on-premises Windows servers?

Yes, this is exactly what the Hybrid Runbook Worker feature is designed for. You deploy the Hybrid Runbook Worker extension on your on-premises Windows or Linux server (or on machines in other cloud providers), and that machine registers with your Azure Automation Account. Runbooks routed to that Hybrid Worker Group execute directly on the on-premises machine with full access to local resources like Active Directory, SQL Server, file systems, and network shares. For non-Azure machines without a full Azure agent, Microsoft recommends onboarding them through Azure Arc, which provides a consistent management plane and simplifies the Hybrid Worker deployment significantly.

Related Microsoft Fix Guides

H
Sai Kiran Pandrala
Our team includes certified Microsoft engineers, Azure architects, and system administrators with 10+ years of enterprise IT experience. Every guide is written from hands-on troubleshooting, not guesswork. We test every fix before publishing.