Fix Terraform on Azure: Setup & Configuration Errors Solved
Why This Is Happening
I've seen this scenario play out dozens of times: a developer sits down to spin up their first Azure resource with Terraform , maybe a Linux VM, maybe an AKS cluster , and within ten minutes they're staring at a wall of red error text that tells them almost nothing useful. Error: building AzureRM Client: obtain subscription() or Error: A resource with the ID already exists or the infuriating Backend initialization required message. You followed the docs, you installed the CLI, you ran terraform init, and yet here we are.
Terraform on Azure is genuinely powerful once it's working. HashiCorp's open-source Infrastructure-as-Code tool lets you define your entire Azure topology, virtual machines, storage accounts, networking, AKS clusters, API Management, in declarative configuration files. That means repeatable, version-controlled, predictable deployments instead of click-ops nightmares. But the path to "working" has several very specific potholes that Microsoft's error messages do almost nothing to help you navigate.
The root causes usually fall into one of four buckets. Authentication failures are the most common, Terraform can't figure out how to talk to your Azure subscription because your service principal is misconfigured, your Azure CLI session has expired, or your environment variables are set to conflicting values. Second is provider version conflicts: the AzureRM provider releases frequently, and older .terraform.lock.hcl files or hard-pinned version constraints cause cryptic failures when you pull someone else's code. Third is state file problems, either you're not using a remote backend yet (you should be) or the Azure Storage backend configuration has a permissions gap. Fourth, and increasingly common as Azure adds preview features, is choosing the wrong provider between AzureRM and AzAPI.
The error messages Azure and Terraform surface almost never point directly at the real problem. StatusCode=403 could mean your service principal is missing a role, your subscription ID is wrong, or your tenant ID doesn't match. That's why you need a systematic approach, and that's exactly what this guide gives you.
The Quick Fix, Try This First
Before you go deep on provider configs and backend state, let's rule out the most common culprit: a stale or missing Azure CLI authentication session. This single issue accounts for roughly 60% of the "it was working yesterday" Terraform on Azure complaints I see.
Open a terminal, PowerShell or Bash, it doesn't matter, and run these three commands in sequence:
az login
az account show
az account set --subscription "YOUR_SUBSCRIPTION_ID_OR_NAME"
The az login will open a browser window and force a fresh token. The az account show output will confirm exactly which subscription Terraform is going to act on, check that the id field matches what you expect. The az account set locks in the right subscription if you have access to more than one.
Now do a clean provider re-initialization:
terraform init -upgrade
The -upgrade flag tells Terraform to pull the latest allowed version of every provider in your configuration, rather than using whatever's cached from a previous run. Then run:
terraform validate
terraform plan
If terraform validate passes cleanly and terraform plan shows your expected resource diff without errors, you're good. Go ahead and apply. If you're still hitting errors after this sequence, keep reading, the step-by-step section covers every scenario systematically.
az account show before terraform plan in any troubleshooting session. I can't count how many hours engineers have spent debugging provider configs when the real problem was Terraform silently defaulting to a sandbox subscription rather than the production one. That one command tells you immediately whether authentication is even pointed at the right place.
Your providers.tf (or equivalent) file is the foundation of everything. A misconfigured provider block is where most Terraform on Azure setup problems originate, especially when you're working from a cloned repo or a tutorial that was written against an older AzureRM provider version.
Open your provider configuration and confirm it matches this structure:
terraform {
required_providers {
azurerm = {
source = "hashicorp/azurerm"
version = "~> 3.100"
}
}
required_version = ">= 1.5.0"
}
provider "azurerm" {
features {}
subscription_id = var.subscription_id
}
Two things to check immediately. First, the features {} block inside the provider "azurerm" block, it is not optional. If that block is missing entirely, even as an empty block, you'll get Error: Insufficient features blocks and the provider will refuse to initialize. It's a common gotcha when hand-writing configs from memory. Second, the version constraint. Using ~> 3.100 means "3.x up to but not including 4.0," which lets you receive bug fixes without accidentally pulling in a major version with breaking changes.
After any change to provider version constraints, always run terraform init -upgrade and commit the updated .terraform.lock.hcl file to your repository. Your team will thank you when they clone the repo and get a consistent provider version instead of mysterious drift.
If it worked: terraform init completes without errors and shows "Terraform has been successfully initialized!"
Authentication is the most fragile part of the Terraform on Azure setup, and it has the most ways to go wrong. There are four supported authentication methods: Azure CLI, service principal with client secret, service principal with certificate, and managed identity. Most local development uses Azure CLI. CI/CD pipelines use a service principal. Getting them mixed up is a classic source of errors.
For Azure CLI authentication (local dev), the sequence from the Quick Fix section handles it. But also check that your CLI version is recent enough:
az version
If you're on anything older than 2.50.0, run az upgrade. Older CLI versions have token format issues that AzureRM provider 3.x doesn't handle gracefully.
For service principal authentication (the right choice for any automated pipeline), you need four environment variables set before Terraform runs:
export ARM_CLIENT_ID="your-app-registration-client-id"
export ARM_CLIENT_SECRET="your-client-secret-value"
export ARM_TENANT_ID="your-azure-ad-tenant-id"
export ARM_SUBSCRIPTION_ID="your-target-subscription-id"
On Windows PowerShell, replace export with $env:, for example, $env:ARM_CLIENT_ID = "your-app-registration-client-id".
If you're getting Error: AADSTS70011, your client secret has expired, rotate it in Azure Portal under App Registrations > your app > Certificates & secrets. If you see StatusCode=403, the service principal exists but is missing the right Azure role assignment. Go to your subscription in Azure Portal > Access control (IAM) > Role assignments and confirm your service principal has at least Contributor on the target subscription or resource group.
If it worked: terraform plan outputs a resource diff without any authentication-related errors.
Storing your Terraform state file locally is fine for a solo experiment, but it's a disaster waiting to happen on any real project. If two engineers run terraform apply simultaneously against a local state file, you'll end up with conflicting infrastructure. State corruption is painful and sometimes irreversible. The official Microsoft guidance on this is clear: store Terraform state in Azure Storage.
First, create the storage resources. You can do this once manually (or with a bootstrap script) since the backend itself can't manage its own state:
az group create --name tfstate-rg --location eastus
az storage account create \
--name mytfstateaccount \
--resource-group tfstate-rg \
--location eastus \
--sku Standard_LRS \
--encryption-services blob
az storage container create \
--name tfstate \
--account-name mytfstateaccount
Then add a backend block to your Terraform configuration:
terraform {
backend "azurerm" {
resource_group_name = "tfstate-rg"
storage_account_name = "mytfstateaccount"
container_name = "tfstate"
key = "prod.terraform.tfstate"
}
}
Run terraform init after adding the backend block, Terraform will detect the new backend and ask if you want to migrate your existing local state. Type yes. If you get Error: Failed to get existing workspaces with a 403, your Azure identity (CLI session or service principal) is missing the Storage Blob Data Contributor role on the storage account. Assign it in Azure Portal under the storage account's IAM settings, wait 2-3 minutes for propagation, then retry.
If it worked: Terraform state is now stored as a blob in your Azure Storage container and you can see it at Storage Account > Containers > tfstate.
This is the issue that trips up developers working with newer Azure services or preview features. The AzureRM provider covers stable, generally available Azure resources. But Azure releases preview features constantly, new VM SKUs, networking capabilities, security controls, and AzureRM can lag behind by weeks or months waiting for provider updates. If you need a resource type or API version that AzureRM doesn't support yet, that's what the AzAPI provider is for.
The AzAPI provider sits directly on top of Azure Resource Manager REST APIs, which means it can access any API version immediately without waiting for HashiCorp to release a provider update. You can use both providers in the same Terraform configuration, they're designed to work together.
If you're getting Error: Invalid resource type on a resource that you know exists in Azure, check whether AzureRM supports it. If it doesn't, add the AzAPI provider:
terraform {
required_providers {
azurerm = {
source = "hashicorp/azurerm"
version = "~> 3.100"
}
azapi = {
source = "azure/azapi"
version = "~> 1.13"
}
}
}
provider "azapi" {}
resource "azapi_resource" "example" {
type = "Microsoft.ContainerService/managedClusters/trustedAccessRoleBindings@2024-02-01"
name = "my-binding"
parent_id = azurerm_kubernetes_cluster.aks.id
body = jsonencode({
properties = {
roles = ["Microsoft.MachineLearningServices/workspaces/mlworkload"]
sourceResourceId = azurerm_machine_learning_workspace.ml.id
}
})
}
The type field in azapi_resource takes the format ResourceProvider/ResourceType@APIVersion. You can find valid API versions in the Azure REST API reference. The AzAPI provider also gives you azapi_update_resource for patching properties on existing AzureRM-managed resources, and azapi_resource_action for one-off operations like gracefully shutting down a VM without Terraform taking over its lifecycle.
If it worked: terraform plan shows your AzAPI resource in the diff without any type validation errors.
Sometimes terraform validate passes, authentication is fine, your provider is correct, and then terraform apply fails mid-run with an Azure API error. These are the most context-specific errors to debug, because they depend entirely on what you're deploying and what permissions your identity has. Here's how to get real information out of a failing apply.
First, enable detailed logging. Set this environment variable before running any Terraform command:
export TF_LOG=DEBUG
export TF_LOG_PATH=./terraform-debug.log
On Windows PowerShell: $env:TF_LOG = "DEBUG" and $env:TF_LOG_PATH = ".\terraform-debug.log". The debug log will show you the exact HTTP requests Terraform sends to Azure Resource Manager and the full response, including the actual Azure error body, which is almost always more descriptive than what Terraform surfaces in the console.
Common apply errors and their fixes:
Error: A resource with the ID ... already exists, Something was created outside of Terraform (manually in the portal or by another pipeline). Use terraform import to bring that resource into state management, or delete it manually before applying.
terraform import azurerm_resource_group.example /subscriptions/SUB_ID/resourceGroups/my-rg
Error: compute.VirtualMachinesClient#CreateOrUpdate: Failure sending request: StatusCode=409, The resource already exists in Azure but not in your state. Same solution: import it.
Error: waiting for ... to be deleted: Code="ResourceGroupNotEmpty", You're trying to delete a resource group that still has resources in it not managed by this Terraform configuration. Go to Azure Portal, identify what's still in the resource group, either import those resources or delete them manually.
After the apply succeeds: you should see Apply complete! Resources: X added, Y changed, Z destroyed. with no trailing error lines.
Advanced Troubleshooting
If the step-by-step section didn't fully resolve your Terraform on Azure problems, the issues are likely either permissions-related at the Azure AD level, network-level blocks between your Terraform execution environment and Azure endpoints, or CI/CD pipeline configuration problems. Let's go through each.
Service Principal Permissions and Azure Role Assignments
A service principal needs the right Azure RBAC role on the right scope. Go to Azure Portal > Subscriptions > [your subscription] > Access control (IAM) > Role assignments. Find your service principal (search by application name or client ID). If it's missing or only has Reader, that explains your 403s. For most Terraform operations, Contributor on the subscription is sufficient. For operations involving role assignments, like assigning a managed identity to a resource, your service principal also needs User Access Administrator or a custom role with Microsoft.Authorization/roleAssignments/write.
Terraform Azure DevOps Pipeline Authentication
If you're running Terraform in Azure DevOps pipelines, the ARM environment variables need to be set as pipeline secrets, not plain variables. In Azure DevOps, go to Pipelines > Library > Variable Group > add your ARM_CLIENT_ID, ARM_CLIENT_SECRET, ARM_TENANT_ID, ARM_SUBSCRIPTION_ID as secret variables. Then reference them in your pipeline YAML:
variables:
- group: terraform-secrets
steps:
- script: terraform init && terraform apply -auto-approve
env:
ARM_CLIENT_ID: $(ARM_CLIENT_ID)
ARM_CLIENT_SECRET: $(ARM_CLIENT_SECRET)
ARM_TENANT_ID: $(ARM_TENANT_ID)
ARM_SUBSCRIPTION_ID: $(ARM_SUBSCRIPTION_ID)
Network Restrictions and Firewall Rules
If your storage account (used for Terraform state) has network restrictions enabled, public network access disabled or specific VNet service endpoints, Terraform will get connection refused errors when trying to read or write state. Either add your pipeline's agent IP range to the storage account's firewall allowlist (Storage Account > Networking > Firewall and virtual networks) or use a private endpoint with a self-hosted agent that's on the same VNet.
Terraform State Locking Failures
Azure Blob Storage provides state locking via blob leases. If a previous Terraform run crashed mid-apply, the lease may still be held. You'll see Error: Error acquiring the state lock with a lock ID. To break the lock, only do this if you're certain no other process is actually running:
terraform force-unlock LOCK_ID_FROM_ERROR
Importing Existing Azure Infrastructure with Azure Export for Terraform
If you're adopting Terraform for an existing Azure environment, use the Azure Export for Terraform tool to generate Terraform configurations and state for resources already deployed. In the Azure Portal, navigate to any resource group, click Export template > select the Terraform option, and the tool will generate both HCL configuration files and the corresponding terraform import commands. This eliminates the manual work of reverse-engineering your existing infrastructure into Terraform syntax.
InternalServerError or ServiceUnavailable on resource types that should be generally available, that's a platform-side issue, not something you can fix in your config. Similarly, if Azure Export for Terraform generates code that fails to apply on resources you know exist and are healthy, escalate via Microsoft Support. Capture the TF_LOG=DEBUG output and the Azure Correlation ID from the error (it looks like a GUID in the error message) before you open a ticket, it cuts resolution time dramatically.
Prevention & Best Practices
Getting Terraform on Azure working is one thing. Keeping it working, across team members, multiple environments, and Azure's regular API evolution, requires a few deliberate habits. I've watched teams skip these and pay for it with mysterious production failures six months later.
Pin your provider versions with a lock file. Always commit .terraform.lock.hcl to version control. This file records the exact provider version and checksums used in your last successful terraform init. Without it, different team members or pipeline runs may pull different provider versions and get different behavior from the same config. Run terraform init -upgrade deliberately when you want to move to a newer provider version, then commit the updated lock file and test before merging.
Use workspaces or separate state files per environment. Never share a single Terraform state file between your dev, staging, and production environments. Either use Terraform workspaces (terraform workspace new staging) or, better, maintain entirely separate backend configurations with different storage containers. Shared state between environments is how a terraform apply in dev accidentally destroys production resources.
Validate and plan in CI before every merge. Your Azure DevOps or GitHub Actions pipeline should run terraform validate and terraform plan on every pull request and post the plan output as a PR comment. Engineers should read the plan before approving the merge. This catches unintended resource replacements, the dreaded "forces replacement" lines, before they hit production.
Regularly review which resources need AzAPI vs AzureRM. As Azure releases new GA features, resources that previously required AzAPI may get promoted to AzureRM with cleaner configuration syntax and better Terraform state handling. Review your AzAPI resources quarterly against the AzureRM changelog to see if migration makes sense.
- Install the Azure Terraform Visual Studio Code extension, it gives you real-time validation, auto-complete for resource types, and direct links to provider documentation while you write HCL
- Use
terraform plan -out=tfplanandterraform apply tfplanin pipelines, this ensures the apply executes the exact plan that was reviewed, not a new one generated seconds later against a potentially changed state - Enable Azure Storage soft delete on your Terraform state container (Storage Account > Data protection > Enable soft delete for blobs), it gives you a 30-day recovery window if state gets accidentally deleted or corrupted
- Use the AzureRM provider's built-in lifecycle rules (
prevent_destroy = true) on your most critical resources like key vaults and databases to stop accidental destruction during refactors
Frequently Asked Questions
Why does terraform init keep failing with "Failed to query available provider packages"?
This is almost always a network connectivity issue between your machine or pipeline and the Terraform registry at registry.terraform.io. Check if your environment has a corporate proxy or firewall blocking outbound HTTPS to that domain. You can test with curl -v https://registry.terraform.io/v1/providers/hashicorp/azurerm/versions, if that fails, you need to configure proxy settings for Terraform via the HTTPS_PROXY environment variable, or use a private mirror. If you're in Azure Cloud Shell, this error typically means a temporary service blip, just retry after a minute.
What's the difference between AzureRM and AzAPI, which one should I actually use?
Start with AzureRM for anything that's generally available and well-established, virtual machines, storage accounts, AKS clusters, Key Vault, networking. It has better Terraform state handling, cleaner resource schemas, and stronger community documentation. Switch to AzAPI when you need a preview feature or a new API version that AzureRM hasn't caught up to yet, or when you need to manage a resource type that AzureRM simply doesn't cover. The two providers work together in the same configuration, so you're not locked into one or the other, use the right tool for each resource.
My terraform apply worked but now terraform plan shows changes I didn't make, why?
This is called "config drift" and it happens when something outside of Terraform, the Azure Portal, an Azure Policy enforcement, an automated script, or another team member, modifies a resource that Terraform manages. Run terraform refresh to update your state file with the current real-world state of your resources, then run terraform plan again to see what changes Terraform would make to bring things back in line with your configuration. If the drift is intentional and you want to accept it, you can update your Terraform config to match, or import the changed state.
How do I move an existing Azure resource into Terraform state without destroying and recreating it?
Use terraform import. You first write the Terraform resource block in your config to describe the resource, then run terraform import azurerm_resource_type.local_name /full/azure/resource/id. The resource ID is in the format visible in the Azure Portal under the resource's Properties blade, it starts with /subscriptions/. After importing, run terraform plan to see if there's any configuration drift between your HCL and the real resource. Adjust your config until the plan shows no changes, then you're fully under Terraform management. Azure Export for Terraform can also generate both the HCL and the import commands automatically for an entire resource group.
Can I use Terraform with Azure Cloud Shell without installing anything locally?
Yes, Azure Cloud Shell comes with Terraform pre-installed. Open Cloud Shell from the Azure Portal (the >_ icon in the top navigation bar), choose either Bash or PowerShell, and Terraform is already available. Run terraform version to confirm what version is installed. Cloud Shell also automatically authenticates using your Azure portal session, so you don't need to configure service principals or ARM environment variables for personal development use. The main limitation is that Cloud Shell sessions time out after 20 minutes of inactivity, so for long-running terraform apply operations against large infrastructure sets, a local environment or a pipeline agent is more reliable.
Terraform says my state is locked and gives me a lock ID, is it safe to force-unlock it?
It depends. A lock means an active or recently crashed Terraform process holds an Azure Blob Storage lease on your state file. First, confirm no other engineer or pipeline is currently running terraform apply against that same state, check your CI/CD pipeline runs and ask your team. If you're certain nothing is actively running, it's a stale lock from a crashed process and it is safe to run terraform force-unlock LOCK_ID using the exact lock ID from the error message. Never force-unlock if there's any chance another apply is mid-execution, doing so can cause two applies to run simultaneously and corrupt your infrastructure state.