Fix Azure Machine Learning API v2 Setup & Config Errors
Why This Is Happening
You've just started a new ML project, your team lead said "just use the Azure Machine Learning API v2," and now you're staring at a terminal full of red text. I've seen this exact situation on dozens of enterprise onboarding calls and late-night Slack threads. The frustrating part? The errors are rarely about the model itself. They're almost always about the plumbing , the Azure Machine Learning CLI v2 extension isn't installed correctly, the workspace isn't configured, or your YAML job definition has a subtle syntax problem that the error message completely fails to explain.
Here's the core issue. Microsoft's Azure Machine Learning platform went through a significant architectural shift with the v2 release. The Azure Machine Learning CLI v2 and Python SDK v2 were redesigned together to introduce a consistent feature set and unified terminology across both interfaces. That's great for long-term maintainability, but it means if you've got any muscle memory from the v1 API, the v2 syntax will trip you up in unexpected ways. The commands changed. In some cases they changed significantly.
The Azure Machine Learning API v2 operates on a YAML-first philosophy for the CLI path. When you run a command like az ml job create, Azure isn't just executing a script , it's reading a YAML file that describes what the job is, where it runs, what compute it targets, and what environment it needs. If any piece of that YAML is wrong, misconfigured, or references a resource that doesn't exist in your workspace, the whole thing fails. And the error messages? They'll often tell you something vague like "resource not found" without telling you which resource, in which workspace, under which subscription.
Who sees these errors most? Data scientists who are new to Azure, ML engineers migrating from v1 tooling, platform engineers setting up CI/CD MLOps pipelines for the first time, and application developers trying to integrate a deployed model endpoint into a service. I know this is frustrating, especially when it blocks your work and your deadline isn't moving. The good news is that every single one of the most common Azure Machine Learning v2 errors has a clear, documented fix. Let's get through them.
The Quick Fix, Try This First
Before you dig into YAML files or workspace configurations, do this first. The vast majority of Azure Machine Learning API v2 setup failures trace back to one of three things: the wrong CLI extension version, a stale login token, or no default workspace set. This sequence fixes all three in under two minutes.
Open your terminal, PowerShell, Bash, or Azure Cloud Shell all work, and run these four commands in order:
# Step 1, make sure Azure CLI itself is current
az upgrade
# Step 2, remove old ml extension if present, then install v2
az extension remove -n azure-cli-ml
az extension add -n ml
# Step 3, fresh login (this clears stale tokens)
az login
# Step 4, set your default workspace so every az ml command knows where to look
az configure --defaults group=<your-resource-group> workspace=<your-workspace-name>
After running those four commands, test with a simple read-only command that shouldn't fail if the connection is healthy:
az ml compute list
If you get a JSON list back (even an empty one), your CLI v2 connection to Azure Machine Learning is working. If you still get an error, note the exact error code and move into the step-by-step section below.
For Python SDK v2 users, the equivalent quick check is:
pip install --upgrade azure-ai-ml azure-identity
python -c "from azure.ai.ml import MLClient; print('SDK v2 import OK')"
A clean SDK v2 import OK printout means your Python environment is set up correctly and you can move on to workspace authentication.
azure-cli-ml extension and the new ml extension installed at the same time. They use different command namespaces and having both present causes silent conflicts. Always remove azure-cli-ml explicitly before adding the new ml extension, don't assume that installing the new one replaces the old one automatically.
The Azure Machine Learning CLI v2 is not part of the base Azure CLI install. It's a separate extension, and the extension name matters. The old v1 extension was called azure-cli-ml. The current v2 extension is simply called ml. Running az ml --help on a machine that still has the v1 extension will show you v1 commands, and those commands will fail against a v2 workspace configuration.
Here's the clean install sequence for the v2 extension:
# Check what you currently have
az extension list --output table
# Remove v1 extension if it shows up
az extension remove --name azure-cli-ml
# Install the current v2 extension
az extension add --name ml --upgrade
# Confirm the version installed
az ml --version
As of mid-2026, you want to see a version number in the 2.x.x range. If you see 1.x.x, the old extension is still active somewhere in your extension path.
On corporate machines, IT policy sometimes blocks extension installs from the default PyPI-backed feed. If az extension add fails with a permissions error or a network timeout, try installing offline using a wheel file downloaded separately and pointing the CLI at it with --source <path-to-whl>. Alternatively, Azure Cloud Shell comes with the ml extension pre-installed and is a solid fallback for teams where local installs are locked down.
When the install succeeds, run az ml job list --help. You should see a clean help page with options like --workspace-name and --resource-group. That confirms the extension is wired up correctly.
Authentication is the second most common failure point, and it shows up differently depending on whether you're using the CLI or the Python SDK v2. The error message is rarely helpful, you'll often see something like AuthenticationError, WorkspaceNotFound, or a generic 403 that doesn't tell you whether it's your credentials, your subscription, or your workspace name that's wrong.
For CLI v2, the cleanest path is interactive login followed by explicit scope setting:
# Interactive login
az login
# List your subscriptions to confirm which one you're on
az account list --output table
# Set the right subscription if you have multiple
az account set --subscription "<subscription-id-or-name>"
# Set workspace defaults so you don't have to pass them every command
az configure --defaults group=myResourceGroup workspace=myMLWorkspace
For Python SDK v2, workspace authentication uses DefaultAzureCredential from the azure-identity package, which tries a chain of credential sources. In a local dev environment, it falls back to interactive browser login. In a CI/CD pipeline, it should pick up a service principal via environment variables. Here's the standard connection pattern:
from azure.ai.ml import MLClient
from azure.identity import DefaultAzureCredential
ml_client = MLClient(
credential=DefaultAzureCredential(),
subscription_id="<your-subscription-id>",
resource_group_name="<your-resource-group>",
workspace_name="<your-workspace-name>"
)
# Test the connection
print(ml_client.workspaces.get(ml_client.workspace_name))
If DefaultAzureCredential fails in a pipeline context, check that the environment variables AZURE_CLIENT_ID, AZURE_TENANT_ID, and AZURE_CLIENT_SECRET are all set and that the service principal has at least Contributor access on the ML workspace resource. Role assignment errors show up as 403s and are managed through Azure IAM, not the ML workspace itself.
This is where most data scientists lose an hour. The Azure Machine Learning CLI v2 uses YAML files to define assets and workflows, jobs, environments, components, pipelines, models, endpoints. The YAML file is not just configuration metadata; it's the primary way you tell Azure Machine Learning what your job is, what compute it should run on, what environment it needs, and where your scripts live.
A minimal working job YAML looks like this:
$schema: https://azuremlschemas.azureedge.net/latest/commandJob.schema.json
type: command
display_name: my-first-job
experiment_name: my-experiment
code: ./src
command: python train.py --data ${{inputs.training_data}}
inputs:
training_data:
type: uri_folder
path: azureml:my-dataset@latest
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest
compute: azureml:my-compute-cluster
The most common YAML errors I see are:
- Wrong
$schemaURL, if the schema URL is outdated or mistyped, validation fails silently or with a confusing error - Missing
azureml:prefix on compute or environment references, without it, Azure treats the value as a literal string, not a workspace asset reference - Indentation errors, YAML is whitespace-sensitive; mixing tabs and spaces is the fastest way to get a parse error that points to the wrong line
- Referencing a compute or dataset that doesn't exist yet, the YAML validates references against your actual workspace
Validate before submitting with:
az ml job create --file my_job_definition.yaml --dry-run
The --dry-run flag runs validation without actually submitting. If the dry run passes, the actual job submission will almost always succeed.
Compute errors are one of the most disruptive Azure Machine Learning API v2 problems because they often don't surface until after job submission, your job sits in a "queued" state for minutes before finally failing with a vague compute-related message.
The two most common compute errors are ComputeNotFound and QuotaExceeded. Here's how to address each.
For ComputeNotFound, confirm the compute target actually exists in your workspace:
az ml compute list --output table
If your compute cluster isn't in that list, you need to create it. The CLI v2 way to provision a basic CPU cluster:
az ml compute create --name my-compute-cluster \
--type AmlCompute \
--min-instances 0 \
--max-instances 4 \
--size Standard_DS3_v2
Set --min-instances 0 so the cluster scales to zero when idle, this prevents unnecessary charges when you're not running jobs.
For QuotaExceeded, the fix is either to request a quota increase through the Azure portal at Subscriptions → Usage + Quotas, or to switch your YAML to target a VM size that you have quota for. Run az ml compute show --name <cluster-name> to see what size is provisioned. You can also target serverless compute, which uses Azure-managed infrastructure and avoids the need to manage cluster provisioning yourself, an option worth knowing about when you're in development and don't want to babysit compute resources.
When the right compute is available and quota is sufficient, jobs should move from "queued" to "running" within a few minutes for an already-warm cluster, or within about 5-10 minutes if the cluster needs to scale up from zero.
Getting your model trained is half the battle. The other half is deploying it so something can actually call it. Azure Machine Learning v2 uses managed online endpoints for real-time inference and batch endpoints for batch inference. Both are configured and deployed through the CLI v2 or SDK v2, and both have their own failure modes.
A common real-time endpoint deployment YAML:
$schema: https://azuremlschemas.azureedge.net/latest/managedOnlineEndpoint.schema.json
name: my-endpoint
auth_mode: key
And the deployment YAML that attaches a model to it:
$schema: https://azuremlschemas.azureedge.net/latest/managedOnlineDeployment.schema.json
name: blue
endpoint_name: my-endpoint
model: azureml:my-model@latest
environment: azureml:AzureML-sklearn-1.0-ubuntu20.04-py38-cpu@latest
code_configuration:
code: ./score
scoring_script: score.py
instance_type: Standard_DS3_v2
instance_count: 1
Deploy with:
az ml online-endpoint create --file endpoint.yaml
az ml online-deployment create --file deployment.yaml --all-traffic
The most frequent deployment failures are a scoring_script that throws an unhandled exception at startup, or an environment that's missing a package the scoring script imports. To debug, pull the deployment logs immediately after a failed deployment:
az ml online-deployment get-logs \
--name blue \
--endpoint-name my-endpoint \
--lines 100
Look for Python tracebacks in those logs, they'll tell you exactly what failed in your scoring script. A missing init() function or a run() function with the wrong signature are the two most common culprits. Once you see the traceback, fix the script, update the environment if a package is missing, and redeploy.
Advanced Troubleshooting
If the steps above didn't resolve your Azure Machine Learning API v2 issue, you're likely dealing with something at the enterprise configuration layer, role-based access control, network policy, private endpoints, or a workspace that's been set up with non-default security settings. These issues are less common but significantly harder to diagnose without knowing where to look.
Role-Based Access Control (RBAC) Errors
Azure Machine Learning workspaces use Azure RBAC for access control. The minimum role needed to submit jobs and manage assets is Contributor on the workspace resource. If your service principal or user account only has Reader, commands like az ml job create will fail with a 403. Check assignments in the Azure portal at Azure Machine Learning studio → Access Control (IAM) → Role Assignments, or via CLI:
az role assignment list --scope /subscriptions/<sub-id>/resourceGroups/<rg>/providers/Microsoft.MachineLearningServices/workspaces/<ws> --output table
Private Endpoint and Network Policy Issues
Enterprise workspaces are often deployed with private endpoints, which means the workspace is not accessible from the public internet. If you're running CLI v2 commands from a machine that's not on the same virtual network (or not connected via VPN), every command will time out or return a DNS resolution error. The fix is to run your commands from within the network boundary, either a jump box, a dev VM on the same VNet, or Azure Cloud Shell configured with private network access.
Diagnosing with Azure Activity Logs
Azure Machine Learning doesn't have a Windows Event Viewer equivalent, but Azure Monitor and Activity Logs serve the same purpose. Navigate to your resource group in the Azure portal, click Activity Log, and filter by the timeframe when the failure occurred. Failed operations show up with a red status. Click any failed entry to see the full JSON error payload, this usually contains far more detail than what the CLI surfaces.
For job-level diagnostics, the Azure Machine Learning studio UI at Jobs → [your job] → Outputs + logs → user_logs is where your script's stdout and stderr land. This is the first place to check for runtime errors that happen after a job successfully starts but before it completes.
SDK v2 Diagnostic Logging
When the Python SDK v2 is behaving unexpectedly, enable HTTP-level logging to see exactly what API calls are being made and what Azure is returning:
import logging
logging.basicConfig(level=logging.DEBUG)
# Now run your MLClient calls, you'll see full HTTP request/response traces
This produces verbose output but it's invaluable when you need to see whether the SDK is sending a malformed request or whether Azure is returning a valid error that the SDK is swallowing.
UserError and a message about internal service state, or any error containing the words "workspace in a bad state", it's time to escalate. Workspace provisioning failures and corrupted resource states are not self-recoverable. Open a support ticket through the Azure portal (Help + Support → Create a support request) with the Activity Log JSON attached. You can also reach Microsoft Support directly for Azure-tier issues.
Prevention & Best Practices
The best Azure Machine Learning API v2 troubleshooting session is the one you never have to do. Most of the pain I described above is avoidable with a few habits baked into your team's workflow from the start.
First, pin your extension and SDK versions in your project dependencies. Uncontrolled upgrades are a major source of "it worked yesterday" breakage. In your requirements.txt or pyproject.toml, specify azure-ai-ml==<version> rather than relying on latest. For the CLI extension, document the expected version in your team wiki and include an extension version check in your CI pipeline startup script.
Second, keep your YAML files under version control alongside your training scripts. The YAML file defines "what it is and where it runs", that's not boilerplate, that's a critical part of your ML pipeline definition. When something breaks in production, you want to be able to diff your YAML files just like you'd diff your Python code.
Third, use the --dry-run flag on all az ml job create and az ml online-deployment create commands in your CI pipeline. This catches YAML validation errors and missing workspace asset references before they cause actual job failures, which saves both time and compute costs.
Fourth, set up a shared development workspace that mirrors production. Authentication problems, compute quota issues, and network policy mismatches all show up at development time if your dev and prod environments are configured identically. The Azure Machine Learning platform makes this straightforward through ARM template exports of workspace configurations.
- Always run
az extension update --name mlat the start of a new project to avoid version drift - Use
az configure --defaultsto set workspace and resource group once per session, stops "workspace not found" errors from missing--workspace-nameflags - Set compute clusters to
min-instances: 0to avoid paying for idle hardware between experiments - Store your workspace connection string as an environment variable in CI/CD and use
DefaultAzureCredentialin Python, never hardcode subscription IDs or secrets in YAML or code
Frequently Asked Questions
What exactly is Azure Machine Learning and who is it designed for?
Azure Machine Learning is a cloud service from Microsoft designed to accelerate the full ML project lifecycle, from training and experiment tracking through to model deployment and ongoing MLOps management. It's built for data scientists who want to run experiments, ML engineers who need to deploy and monitor models in production, and platform developers who are building internal ML tooling on top of Azure's infrastructure. Application developers also use it to pull deployed model endpoints into their applications or services. Enterprises on Azure get the added benefit of familiar RBAC security controls and audit trail compliance built directly into the platform.
What's the real difference between the Azure ML CLI v2 and the Python SDK v2?
Functionally, there is no difference, both the CLI v2 and SDK v2 expose the same features and operate on the same underlying Azure Machine Learning workspace resources. The difference is in how you interact with them. The CLI v2 uses commands in the format az ml <noun> <verb> paired with YAML files, making it a natural fit for CI/CD pipelines and automation scripts where you want to invoke workflows from any platform without requiring Python. The SDK v2 is more convenient for interactive development, notebook-based workflows, and when your control logic itself is Python code. Pick the one that fits your existing workflow, or use both, since they're fully compatible.
Why does "az ml" show command not found even after I installed Azure CLI?
Because the ml extension isn't included in the base Azure CLI installation, it's a separate add-on you have to install explicitly. Run az extension add --name ml to install it. If you previously used the v1 extension (azure-cli-ml), you also need to remove that first with az extension remove --name azure-cli-ml before adding the new one, otherwise you'll have a namespace conflict. After installing, confirm with az ml --version and you should see a version in the 2.x.x range.
My "az ml job create" command keeps failing with a YAML validation error, what am I missing?
The most common causes are: a wrong or outdated $schema URL at the top of your YAML, missing azureml: prefix on workspace asset references like compute or environment names, indentation errors (YAML treats tabs and spaces differently and mixing them breaks parsing), or referencing a compute cluster or dataset that doesn't actually exist in your workspace. Use az ml job create --file my_job.yaml --dry-run to validate without submitting, this will surface most validation errors with a clearer message than the live submission path. Also double-check that your workspace defaults are set with az configure --defaults so the CLI knows which workspace to validate against.
How do I connect Foundry Tools and the model catalog to my Azure Machine Learning workspace?
Azure AI Foundry (formerly Azure AI Studio) and Azure Machine Learning are part of the same Microsoft Azure AI platform and share underlying infrastructure. The model catalog in Foundry gives you access to open-source models from HuggingFace, Meta, Mistral, and others, which you can deploy directly into an Azure Machine Learning managed endpoint or fine-tune using the AzureML training pipeline. Within the Azure Machine Learning studio, you'll find a Model Catalog section that mirrors what's available in Foundry. For agent-based workflows, Foundry Agent Service and Azure Machine Learning work together, you can host the model in an AzureML endpoint and orchestrate calls to it through the Foundry agent orchestration layer. The Python SDK v2 for Azure Machine Learning and the Foundry SDK can both be used in the same project.
Can I use Azure Machine Learning v2 without writing Python code?
Yes, this is one of the deliberate design goals of the v2 release. With the CLI v2 path, your custom logic stays in script files (Python, R, Java, Julia, or C# are all supported), but the ML infrastructure configuration, what the job is, where it runs, what environment it needs, is all defined in YAML. You only need to learn YAML syntax and the az ml command format to manage the ML platform side of things. Beyond the CLI, Azure Machine Learning studio also offers a visual designer where you can build and deploy ML pipelines through drag-and-drop without writing any code at all, and the AutoML UI lets you run automated machine learning experiments through a guided interface.