Azure Chaos Studio: Setup, Config & Troubleshooting Guide

Microsoft Fix Intermediate 14 min read Official Docs Grounded Updated April 20, 2026

Why Azure Chaos Studio Experiments Fail (And Why the Error Messages Don't Help)

I've watched engineers spend entire afternoons staring at a chaos experiment that's stuck in a "Failed" or "Cancelled" state with nothing more helpful in the portal than a generic status badge. Azure Chaos Studio is a genuinely powerful managed service for chaos engineering , the practice of deliberately injecting real-world faults into your cloud applications to measure how they hold up under stress. But getting it configured correctly the first time? That's a different story.

Here's the scenario I see constantly: a team spins up Azure Chaos Studio for the first time, maybe ahead of a big Game Day drill or a pre-release resilience validation. They create an experiment, hit Start, and... nothing happens. Or the experiment starts and immediately errors out. The portal gives them a red status icon and a vague message like "The experiment action failed." That's it. No fault code, no pointer to which resource caused the problem, no next step. I know how frustrating that is , especially when your entire reliability team is waiting on results.

The root causes almost always fall into one of four buckets:

  1. Missing or misconfigured targets and capabilities. Every Azure resource you want to run faults against must first be registered as a target in Chaos Studio, with the specific capabilities (fault types) explicitly enabled. Skip this step and experiments will silently fail or refuse to start entirely.
  2. Role assignment gaps. Azure Chaos Studio needs a managed identity assigned to the experiment, and that identity must have the correct RBAC permissions on the target resources. Without this, the service literally cannot interact with your VMs, AKS clusters, or Redis caches, even if the experiment definition looks perfect.
  3. Agent-based fault prerequisites not met. When you're running in-guest faults on VMs or virtual machine scale sets, things like memory pressure or process kill, you need the Chaos Agent installed and healthy on the target machine. A lot of teams skip agent health validation and then wonder why their fault never fires.
  4. Experiment logic conflicts. Steps run sequentially and branches run in parallel inside each step. If you've accidentally nested timing dependencies incorrectly, for instance, a fault with a 10-minute duration inside a step that a downstream step depends on finishing, experiments can stall, time out, or produce misleading results.

The underlying architecture of Azure Chaos Studio is not complicated once you understand it, but the setup surface area is wide. You're dealing with Azure Resource Manager targets, capability registrations, managed identities, RBAC, and optionally a VM-resident agent, all of which have to be correctly wired together before a single fault fires. When any link in that chain is broken, the error messages you get back rarely tell you which link snapped.

This guide walks you through every layer, from first-time experiment setup failures to advanced agent health debugging and CI/CD pipeline integration issues. Browse all Microsoft fix guides →

The Quick Fix, Try This First

If your Azure Chaos Studio experiment is failing to start, failing immediately after start, or sitting indefinitely in a "Running" state with no fault activity, run through this checklist before anything else. This resolves the majority of cases I see.

Step 1, Verify targets are enabled. Go to Azure portal → Chaos Studio → Targets. Find the resource you're running faults against. If the target shows "Not enabled" or no capabilities are listed, that's your problem right there. Click the resource, select Manage actions, and enable the capabilities your experiment requires (for example, Virtual Machine Shutdown or CPU Pressure).

Step 2, Check the experiment's managed identity. Open your experiment in the portal. Under Identity, confirm that a system-assigned or user-assigned managed identity is attached. Then go to the target resource's Access control (IAM) blade. Verify the experiment's identity has a role assignment, at minimum Contributor or a purpose-built custom role, on that resource. No role assignment means no fault execution, full stop.

Step 3, Look at the experiment run details. In the portal, open your experiment and click on a past run. Expand each step and branch to see which specific action failed. This is the most useful troubleshooting view in the entire portal, and most people miss it because the top-level status view doesn't drill down. The action-level detail often shows you the exact failure reason, things like "Target not found" or "Capability not enabled on resource".

Step 4, For agent-based faults, check agent health. Navigate to Chaos Studio → Agents and check the health status of agents on your target VMs. A "Degraded" or "Offline" status means the in-guest fault cannot run. The agent communicates over outbound HTTPS on port 443, if your VM has restrictive NSG rules, the agent can install fine but fail to phone home.

If all four of those check out and you're still stuck, move into the step-by-step section below. The problem is almost certainly in RBAC granularity, network configuration, or experiment logic structure.

Pro Tip
Before running any chaos experiment for the first time against a production or preproduction environment, always do a dry run in a shift-left development environment first. Azure Chaos Studio explicitly supports both shift-left (dev/test, no real customer traffic) and shift-right (production with real load) scenarios. Running shift-left first lets you validate your experiment structure, selector groupings, and timing logic without any risk to live workloads, and it means your first real Game Day drill isn't also your debugging session.
1
Enable Targets and Capabilities on Every Resource You Want to Fault

This is the step that trips up the most first-time Chaos Studio users. The service doesn't automatically see all your Azure resources, you have to explicitly onboard each one as a target and then enable the specific capabilities (fault types) you want to use.

In the Azure portal, navigate to: Chaos Studio → Targets. You'll see a filterable list of resources in your subscription. Look for resources showing "Not enabled" in the status column, those cannot be used in any experiment yet.

To enable a target:

  1. Select the resource checkbox.
  2. Click Enable targets in the top menu bar.
  3. Choose either Enable service-direct targets (for faults that run directly against the Azure control plane, no agent needed) or Enable agent-based targets (for in-guest faults on VMs or VMSS).

After enabling the target, click into it and select Manage actions. Here you'll see individual capabilities like Shutdown, CPU Pressure, Virtual Memory Pressure, Network Latency, and so on. Enable only the capabilities your experiment will use, you can always add more later.

Via Azure CLI, this looks like:

az rest --method put \
  --url "https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{resourceGroup}/providers/{providerNamespace}/{resourceType}/{resourceName}/providers/Microsoft.Chaos/targets/Microsoft-VirtualMachine?api-version=2023-11-01" \
  --body '{}'

If the command succeeds, you'll get a 200 response with the target object. If your experiment was previously failing with "Target resource not found in experiment selectors", enabling the target and re-running the experiment will fix it immediately.

2
Assign the Correct RBAC Role to Your Experiment's Managed Identity

Azure Chaos Studio experiments run under a managed identity, either system-assigned (created automatically when you create the experiment) or user-assigned (one you bring yourself). That identity must have permissions to actually interact with the target resources at fault-injection time. Without the right role, the experiment starts, attempts to inject the fault, gets an authorization error from ARM, and fails. The portal often shows this as a generic action failure rather than an explicit "403 Forbidden."

Here's how to set it up correctly:

  1. Open your experiment in the portal.
  2. Go to Identity in the left menu. Enable System assigned if it isn't already on, and copy the Object ID shown.
  3. Navigate to the target resource (e.g., the VM, AKS cluster, or Redis cache).
  4. Click Access control (IAM) → Add role assignment.
  5. Choose a role. For VM-based faults, Virtual Machine Contributor is typically sufficient. For AKS Chaos Mesh faults, Azure Kubernetes Service Cluster Admin Role may be required depending on your cluster configuration.
  6. Under Members, select Managed identity, then find and select your experiment's identity by the Object ID you copied.

If you're operating in a large enterprise environment with a locked-down subscription, you may not have permission to assign roles yourself. In that case, you need to raise a request to your Azure platform team or subscription owner. When you do, ask them specifically for the minimum required role on each target resource the experiment references, not broad Contributor on the resource group.

After the role assignment propagates (usually under two minutes), re-run the experiment. You should see the action move past its previous failure point.

3
Install and Verify the Chaos Agent for In-Guest Faults

Service-direct faults don't need anything installed on the target resource, they call the Azure control plane directly. But agent-based faults, things like CPU pressure, virtual memory pressure, disk I/O stress, or process kill, require the Chaos Agent to be installed and running inside the VM or VMSS instance.

The agent ships as a VM extension, which makes installation straightforward via the portal or CLI. In the portal:

  1. Go to Chaos Studio → Agents.
  2. Click Install agent.
  3. Select the target VM or VMSS and follow the prompts. The portal installs the agent as a VM extension automatically.

Via CLI:

az vm extension set \
  --resource-group myRG \
  --vm-name myVM \
  --name ChaosLinuxAgent \
  --publisher Microsoft.Azure.Chaos \
  --version 1.0 \
  --settings '{"profile": "https://management.azure.com/..."}'

After installation, go back to Chaos Studio → Agents and check the Health Status column. You want to see "Healthy". If you see "Degraded" or "Offline", the most common culprits are:

  • NSG rules blocking outbound port 443 from the VM to the Chaos Studio service endpoint.
  • The VM's system-assigned identity not having the correct profile URL in the agent settings.
  • The VM extension installed but the agent service not starting, check the VM's extension logs at /var/log/azure/Microsoft.Azure.Chaos.ChaosLinuxAgent/ on Linux.

Once the agent is healthy, agent-based fault actions in your experiments will execute correctly.

4
Build Your Experiment Structure Correctly, Steps, Branches, and Selectors

A lot of Chaos Studio experiments fail not because of permissions or agents, but because the experiment logic itself is wrong. Understanding how steps, branches, and selectors work is essential to getting reliable results.

The structure works like this: an experiment has one or more steps that execute one after another, sequentially. Inside each step, you have one or more branches that run simultaneously. Inside each branch, you define one or more actions, either a fault or a time delay. Selectors are named groups of target resources that you reference inside actions, they let you reuse the same resource group across multiple faults without redefining it each time.

Common structural mistakes I've diagnosed:

  • Forgetting time delays between fault steps. If you apply CPU pressure in Step 1 and want to measure recovery before applying memory pressure in Step 2, you need to either use a delay action or ensure Step 1's fault duration is long enough. Without a delay, Step 2 starts immediately after Step 1's action finishes, which may not give your application time to recover and stabilize.
  • Selectors pointing to disabled targets. If you define a selector called AllProdVMs but one of the VMs in that group hasn't had its target capability enabled, the entire action that references that selector can fail.
  • Mixing incompatible fault types in a single branch. Agent-based and service-direct faults can coexist in an experiment, but make sure you're not trying to run an agent-based fault against a resource that only has service-direct capabilities enabled, and vice versa.

When editing experiment JSON directly (via REST API or ARM template), the selector IDs referenced in actions must exactly match the IDs defined in the selectors section, they're case-sensitive GUIDs. A mismatch here produces a validation error when you try to start the experiment.

5
Integrate Chaos Experiments into Your CI/CD Pipeline

One of the most powerful uses of Azure Chaos Studio is running chaos experiments as deployment gates in your continuous integration and continuous delivery pipelines. This way, every new deployment gets automatically resilience-tested before it goes further in the pipeline, catching regressions before they hit production.

The general pattern is: trigger an experiment via the Chaos Studio REST API after a successful deployment, poll for completion, and then evaluate the experiment outcome as a pass/fail gate.

Start the experiment via REST:

POST https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{resourceGroup}/providers/Microsoft.Chaos/experiments/{experimentName}/start?api-version=2023-11-01
Authorization: Bearer {token}

Poll for status:

GET https://management.azure.com/subscriptions/{subscriptionId}/resourceGroups/{resourceGroup}/providers/Microsoft.Chaos/experiments/{experimentName}/executions/{executionId}?api-version=2023-11-01

The response includes a status field, watch for "Success", "Failed", or "Cancelled". In your pipeline, treat anything other than "Success" as a gate failure and halt the promotion.

For Azure DevOps, you can wrap these REST calls in a pipeline task using the Azure CLI task with a service connection that has Contributor rights on the Chaos Studio experiment resource. For GitHub Actions, use the azure/login action and then call the REST API via curl or the Azure CLI.

One important note: make sure your CI/CD pipeline's service principal has the Chaos Experiment Contributor built-in role (or equivalent custom role) on the experiment resource, not just on the target resources. The experiment identity and the pipeline identity are separate things.

Advanced Troubleshooting for Azure Chaos Studio

Diagnosing Experiment Failures with Azure Monitor and Event Logs

When experiments fail in ways that the portal's run detail view doesn't fully explain, Azure Monitor is your next stop. Azure Chaos Studio can emit fault events directly to Azure Monitor and to Application Insights, these give you timestamp-correlated fault injection events alongside your application telemetry, which is invaluable when you're trying to understand whether your app actually detected and responded to a fault.

To configure fault event emission to Azure Monitor, go to your experiment's diagnostic settings and add a Log Analytics workspace as a destination. Once configured, fault start and stop events appear in the AzureDiagnostics table with a ResourceType of EXPERIMENTS.

You can query them like this:

AzureDiagnostics
| where ResourceType == "EXPERIMENTS"
| where TimeGenerated > ago(24h)
| project TimeGenerated, OperationName, ResultType, ResultDescription
| order by TimeGenerated desc

The ResultDescription column is where the real detail lives, it often contains the specific ARM error code that the portal swallows, like AuthorizationFailed, TargetNotFound, or CapabilityNotEnabled.

Private Link and Network-Restricted Environments

Enterprise environments that use Azure Private Link or restrict outbound internet access from VMs are a common source of agent health problems. The Chaos Agent communicates with the Chaos Studio service over HTTPS (port 443), and it needs to reach the service endpoint for your region. If your VM sits in a subnet with a restrictive NSG or a UDR that forces all traffic through a firewall appliance, the agent will install successfully but stay in a "Degraded" or "Offline" state because it can't establish its outbound connection.

To fix this without opening broad internet access, configure a Private Endpoint for Azure Chaos Studio in your VNet. This routes agent communication through the Azure backbone rather than the public internet. The private endpoint configuration is available in the Chaos Studio portal blade under Networking.

AKS Chaos Mesh Integration Failures

Azure Chaos Studio supports AKS faults through Chaos Mesh, a chaos engineering framework that runs inside your Kubernetes cluster. If your AKS Chaos Mesh fault actions are failing, the most common cause is that the Chaos Mesh CRDs (Custom Resource Definitions) haven't been installed in the cluster, or the Chaos Studio managed identity doesn't have Azure Kubernetes Service Cluster Admin Role on the AKS resource.

Verify Chaos Mesh is running in your cluster:

kubectl get pods -n chaos-testing

You should see pods for chaos-controller-manager, chaos-daemon, and chaos-dashboard. If that namespace doesn't exist, Chaos Mesh hasn't been installed. Chaos Studio can install it for you when you enable AKS targets, but only if the managed identity has sufficient cluster admin permissions at the time of target enablement.

Experiment Runs Scheduled via Azure Scheduler

Azure Chaos Studio supports scheduling experiments to run on a defined cadence, useful for recurring BCDR drills or weekly resilience validation runs. If a scheduled experiment isn't firing, verify that the schedule is still active in the experiment's Schedule blade and that the experiment resource hasn't been accidentally moved to a different resource group, which can break the scheduler's reference to it.

When to Call Microsoft Support
If you've confirmed that targets are enabled, capabilities are correct, RBAC is properly assigned, and the Chaos Agent is healthy, but experiments are still failing with cryptic errors in Azure Monitor logs like InternalServerError or ServiceUnavailable, that's a platform-side issue and you should open a support ticket. Also escalate if you're seeing discrepancies between what the Chaos Studio portal reports and what you observe in your application (e.g., the portal says the fault ran successfully but your monitoring shows no disruption occurred). Contact Microsoft Support with your experiment resource ID, subscription ID, and the specific execution ID from the failed run, this dramatically speeds up the investigation.

Prevention & Best Practices for Azure Chaos Studio

I've seen teams get a lot of value from Chaos Studio when they treat it as a continuous practice rather than a one-time drill. The teams that get the most out of chaos engineering, better uptime, fewer major incidents, faster recovery times, all share a few habits that are worth building into your own workflow.

Start with shift-left, earn the right to shift-right. Don't go straight to production chaos experiments. Build your experiment library in a dev or test environment first. Validate that your selectors are correct, your timing makes sense, and your monitoring actually detects the injected faults. Only move to preproduction or production (shift-right) once you have high confidence in the experiment structure. Running a badly formed experiment against production isn't chaos engineering, it's just causing an outage on purpose.

Document your hypothesis before each experiment. A chaos experiment without a hypothesis is just vandalism. Before you run any fault injection, write down what you expect to happen: "I expect the application to failover to the secondary region within 30 seconds when the primary VM is shut down." Then compare the actual outcome to the hypothesis. This is what turns chaos experiments into real resilience data, and it's what makes post-incident analysis actually useful.

Pair every experiment with monitoring and alerting. Use Chaos Studio's Azure Monitor integration to emit fault events, and make sure those events appear on the same timeline as your application health metrics. If your on-call alerting doesn't fire when your chaos experiment injects a fault that should trigger it, that's a finding, your observability has a gap that a real incident would exploit.

Gate production deployments on chaos experiment results. Once you have a stable experiment library, integrate it into your CI/CD pipeline as a deployment gate. This way, any change that regresses resilience gets caught before it reaches production, not after a real incident exposes it.

Quick Wins
  • Enable targets and capabilities in bulk using the Azure CLI or ARM templates, don't do it resource-by-resource in the portal for large environments.
  • Use descriptive selector names like ProdEastUSWebTier instead of auto-generated GUIDs, it makes experiment JSON far easier to read and maintain.
  • Set up an Azure Monitor alert rule that fires whenever a chaos experiment run completes with a "Failed" status, so you know immediately when a scheduled drill breaks unexpectedly.
  • Regularly verify Chaos Agent health on all target VMs using the Chaos Studio → Agents view, agents can go offline after OS updates or NSG changes without any notification.

Frequently Asked Questions

What is Azure Chaos Studio and what is it actually used for?

Azure Chaos Studio is a managed Azure service that lets you deliberately inject faults into your cloud applications and services so you can measure how well they handle disruptions. In practice, teams use it for things like Game Day drills before a major launch, business continuity and disaster recovery testing, high-availability validation when a region or zone goes down, and as an automated gate in CI/CD pipelines to catch resilience regressions before they reach production. It supports two types of faults: service-direct (which hit the Azure control plane with no agent required) and agent-based (which run inside VMs for in-guest failures like CPU or memory pressure). Think of it as a controlled way to find out whether your app's recovery playbooks actually work before a real incident proves they don't.

How do I create and run my first chaos experiment in Azure Chaos Studio?

Start in the Azure portal, search for "Chaos Studio" and open the service. First, go to Targets and enable the Azure resources you want to run faults against, then enable the specific capabilities (fault types) on each target. Next, go to Experiments → Create and build your experiment: add a selector pointing to your target resources, then define steps and branches containing the fault actions you want to run. Assign a managed identity to the experiment and give that identity the appropriate RBAC role on your target resources. Hit Save, then Start. Watch the run detail view, expanding each step and branch, to see the real-time status of each action. If this is your first experiment, start with a service-direct fault like VM Shutdown on a non-production VM so there's no agent setup needed.

Why does my Azure Chaos Studio experiment fail immediately after I start it?

Immediate failure almost always means one of three things: the target resource hasn't been enabled in Chaos Studio (or the required capability hasn't been turned on), the experiment's managed identity doesn't have the correct RBAC role on the target resource, or the experiment references a selector that contains a resource that isn't properly onboarded. Open the failed experiment run in the portal, expand the step and branch that failed, and read the action-level error message, it's usually much more specific than the top-level status. Then cross-check against the Targets view to confirm the resource is enabled, and check the target resource's IAM blade to confirm the managed identity has a role assignment. Fix whichever of those is missing and re-run.

What's the difference between service-direct and agent-based faults in Chaos Studio?

Service-direct faults work by calling Azure's control plane APIs directly, no software needs to be installed on your resources. Examples include rebooting an Azure Cache for Redis cluster, shutting down a VM via the Azure compute API, or adding network latency to AKS pods via Chaos Mesh. Agent-based faults, on the other hand, run inside the guest operating system of a VM or virtual machine scale set, they require the Chaos Agent to be installed on the target machine. Examples of agent-based faults include applying CPU pressure (simulating a CPU spike), virtual memory pressure (simulating a memory leak), killing a specific process, or introducing disk I/O stress. The choice between them depends on whether you're testing at the infrastructure layer (service-direct) or at the application/OS layer (agent-based).

The Chaos Agent shows "Degraded" or "Offline", how do I fix it?

The most common cause is network connectivity, the agent communicates over outbound HTTPS (port 443) to the Chaos Studio service endpoint, and if your VM's NSG or firewall blocks that traffic, the agent will install but fail to maintain its connection. Start by checking the VM's NSG rules for any outbound deny rules on port 443, and check if there's a UDR forcing traffic through a firewall appliance that might be dropping it. If your environment requires private connectivity, set up a Private Endpoint for Chaos Studio in your VNet. For Linux VMs, check the agent logs at /var/log/azure/Microsoft.Azure.Chaos.ChaosLinuxAgent/ for specific error messages. For Windows VMs, check the agent logs in C:\WindowsAzure\Logs\Plugins\Microsoft.Azure.Chaos.ChaosWindowsAgent\. A simple NSG rule adjustment or firewall allowlist entry resolves this in most cases.

Can I schedule Azure Chaos Studio experiments to run automatically, like for weekly BCDR drills?

Yes, Azure Chaos Studio has built-in experiment scheduling so you can run recurring drills without manual intervention. In the portal, open your experiment and look for the Schedule option to set up a recurring cadence. You can also trigger experiments programmatically via the Chaos Studio REST API, which is how teams integrate them into CI/CD pipelines as deployment gates, the pipeline starts an experiment after deployment, polls for the result, and uses the outcome as a pass/fail gate before promoting the build further. For scheduled drills specifically, make sure you have Azure Monitor alerts set up to notify your team when a scheduled experiment run fails unexpectedly, since unmonitored scheduled experiments can silently stop working after resource or network changes.

Related Microsoft Fix Guides

H
Sai Kiran Pandrala
Our team includes certified Microsoft engineers, Azure architects, and system administrators with 10+ years of enterprise IT experience. Every guide is written from hands-on troubleshooting, not guesswork. We test every fix before publishing.