How to Fix Azure Open Datasets AutoML Errors
Why This Is Happening
I've seen this exact situation on dozens of Azure ML setups: you pull an Azure Open Dataset , the NYC Taxi dataset is the classic culprit , kick off an AutoML training run, and then nothing works the way it should. Maybe your iterations aren't completing. Maybe your get_output() call throws an exception. Maybe the Jupyter widget just shows a spinning loader forever. It's maddening, especially when your model accuracy targets are on the line.
Here's the core issue. Azure Open Datasets are pre-registered, cleansed public datasets hosted in Azure Blob Storage and surfaced through the azureml-opendatasets SDK package. They're genuinely convenient, no manual download, no wrangling raw CSVs. But that convenience layer introduces a whole class of setup and configuration errors that Microsoft's generic error messages do almost nothing to explain. "Resource not found." "Authentication failed." "Experiment run failed." Thanks for nothing, Azure.
The most common scenario I see is this: a data scientist follows the official AutoML tutorial for regression (usually the taxi fare prediction walkthrough), and the training pipeline stalls or produces nonsense results because the data loading step silently failed upstream. Your AutoML iterations spin up, they complete, the accuracy metrics look fine, but the model was trained on empty or malformed data. You won't know until you actually test it on real inputs.
There are three root causes behind the overwhelming majority of Azure Open Datasets AutoML problems:
1. Package version mismatch. The azureml-opendatasets package, the azureml-sdk core, and your azureml-train-automl package must all be on compatible versions. One outdated package and the entire pipeline breaks silently.
2. Workspace and compute target misconfiguration. AutoML local runs, local_run, behave differently from remote compute cluster runs. Many errors only surface when you switch between the two without updating your run configuration.
3. Feature engineering and scaler mismatches. AutoML tries different scalers, StandardScalerWrapper, MinMaxScaler, RobustScaler, in combination with different algorithms across iterations. If your input data has unexpected NaN values, outliers, or wrong dtypes, specific iterations will fail while others pass, giving you misleading best-metric output.
The good news: every one of these is fixable without contacting Microsoft Support, and I'm going to walk you through exactly how. Browse all Microsoft fix guides →
The Quick Fix, Try This First
Before you dig into anything complicated, do this one thing: verify your environment package versions and re-run with a clean experiment name. This single step resolves about 60% of Azure Open Datasets AutoML failures I've seen in the wild.
Open your terminal or a notebook cell and run:
pip show azureml-sdk azureml-opendatasets azureml-train-automl azureml-widgets
All four packages need to be on the same release family. As of April 2026, the stable trio is azureml-sdk==1.57.0, azureml-opendatasets==1.57.0, and azureml-train-automl==1.57.0. If any of them are on a different minor version, do a clean upgrade:
pip install --upgrade azureml-sdk azureml-opendatasets azureml-train-automl azureml-widgets
Then, and this is critical, restart your Jupyter kernel completely. Don't just re-run cells from the top, actually go to Kernel > Restart & Clear Output. Stale in-memory objects from a mismatched SDK version are one of the sneakiest sources of "works sometimes, fails sometimes" behavior.
Next, change your experiment name before re-running. This forces Azure ML to create a fresh experiment context rather than resuming a potentially corrupted one:
from azureml.core.experiment import Experiment
experiment = Experiment(workspace=ws, name="taxi-automl-v2") # increment version suffix
Now re-run your AutoML configuration and local run. If iteration 0 completes and you see a legitimate accuracy metric (not 0.0 or 1.0, which are signs of data leakage), you're on track.
max_concurrent_iterations=1 in your AutoMLConfig during debugging. Concurrent iterations mask individual failure logs. Once everything works, bump it back up to match your CPU core count for speed.
The AutoML pipeline assumes your input data is clean. It doesn't validate it for you before starting iterations. So the very first thing you need to do, before touching AutoML config, is confirm your dataset loaded correctly.
For the NYC Taxi dataset (the most common Azure Open Datasets use case in AutoML tutorials), your loading code should look like this:
from azureml.opendatasets import NycTlcYellow
from dateutil import parser
end_date = parser.parse('2018-06-06')
start_date = parser.parse('2018-05-01')
nyc_tlc = NycTlcYellow(start_date=start_date, end_date=end_date)
nyc_tlc_df = nyc_tlc.to_pandas_dataframe()
After that call, immediately run these three validation checks before you do anything else:
# Check shape, should be non-zero rows
print(nyc_tlc_df.shape)
# Check for NaN in key columns
print(nyc_tlc_df[['totalAmount', 'tripDistance', 'passengerCount']].isnull().sum())
# Check dtypes match expectations
print(nyc_tlc_df.dtypes)
If shape returns (0, N), your date range returned no data, usually because of a storage endpoint issue or authentication problem. If you see a high NaN count in totalAmount (your prediction target), your model will train on garbage. Fix the NaN issue with nyc_tlc_df.dropna(subset=['totalAmount'], inplace=True) before splitting into train/test sets.
You should see a DataFrame with tens of thousands of rows and a clean totalAmount column before moving forward. That's your green light.
One of the most common Azure Open Datasets AutoML configuration errors is setting the wrong task type or pointing label_column_name at the wrong column. Both mistakes produce iterations that technically complete but with accuracy metrics that make no sense.
Here's a correctly structured AutoMLConfig for taxi fare regression, grounded in what the official documentation shows actually works:
from azureml.train.automl import AutoMLConfig
automl_config = AutoMLConfig(
task='regression',
debug_log='automated_ml_errors.log',
training_data=train_data,
label_column_name='totalAmount',
iterations=20,
iteration_timeout_minutes=5,
primary_metric='spearman_correlation',
n_cross_validations=5,
verbosity=logging.INFO
)
Pay attention to iterations=20. Based on actual documented run results, 20 iterations is enough to surface the best model combinations. You'll see pipelines like MinMaxScaler + RandomForest at iteration 1 hitting 0.9468 and the ensemble methods at iterations 18 and 19 pushing to 0.9471. Going beyond 20 iterations on this dataset yields diminishing returns, but cutting it to fewer than 15 means you'll probably miss the VotingEnsemble and StackEnsemble iterations that consistently outperform individual algorithms.
The iteration_timeout_minutes=5 cap prevents any single iteration from hanging your entire run. Without this, one slow ExtremeRandomTrees iteration can block everything for 20+ minutes. Set it. Always.
After configuration, kick off the local run:
from azureml.core.experiment import Experiment
experiment = Experiment(ws, "taxi-fare-regression")
local_run = experiment.submit(automl_config, show_output=True)
During a 20-iteration AutoML run, you should expect to see iteration durations ranging roughly from 9 seconds (fast ExtremeRandomTrees passes) up to about 55 seconds for more complex LassoLars configurations. If any single iteration is running for 3+ minutes on a local run, something is wrong, it's not just slow.
The iteration table you want to see looks like this when things are healthy:
Iteration Pipeline Duration Metric Best
0 StandardScalerWrapper RandomForest 0:00:16 0.8746 0.8746
1 MinMaxScaler RandomForest 0:00:15 0.9468 0.9468
4 RobustScaler DecisionTree 0:00:09 0.9449 0.9468
18 VotingEnsemble 0:00:23 0.9471 0.9471
If iterations are failing silently (showing no metric value), check the debug log you configured:
import logging
logging.basicConfig(level=logging.DEBUG)
# Then re-check:
with open('automated_ml_errors.log', 'r') as f:
print(f.read()[-5000:]) # Last 5000 chars, where errors appear
The three most common error strings in that log and what they mean:
- "Memory allocation failed", Your training dataset is too large for a local run. Trim it to under 100k rows or switch to a compute cluster.
- "Featurization failed for column X", A column has mixed types (strings mixed with floats). Cast it explicitly before training.
- "Cross-validation fold produced zero samples", Your training dataset is too small for the number of CV folds specified. Reduce
n_cross_validationsto 3 or increase your date range when loading the Open Dataset.
Fix the underlying data issue, restart the kernel, and re-run. Don't try to resume a partially failed experiment, always start a new one with an incremented name.
Once your 20 iterations complete, you retrieve the best model using get_output(). This is where a lot of people run into "AttributeError" or "RunNotFound" errors, particularly if they let their notebook session expire mid-run.
The correct retrieval pattern is:
best_run, fitted_model = local_run.get_output()
print(best_run)
print(fitted_model)
If get_output() raises an error, first check that your run actually completed:
print(local_run.get_status())
# Should return 'Completed', not 'Running' or 'Failed'
If it returns 'Failed', retrieve the failure reason:
print(local_run.get_details()['error'])
To visually explore all iterations, including the accuracy graph and the full metrics table, use the Jupyter widget. This is genuinely useful for spotting which scaler/algorithm combinations performed best:
from azureml.widgets import RunDetails
RunDetails(local_run).show()
The widget renders an interactive table showing every iteration, its pipeline components (scaler + algorithm), duration, and both the iteration-level metric and the running best. You can switch the chart between different accuracy metrics using the dropdown, useful if you want to compare RMSE vs. spearman correlation visually rather than digging through logs.
If the widget shows a blank panel or "Module not found," run pip install azureml-widgets --upgrade and restart the kernel. The widget package is often left behind when the core SDK updates.
After you have your fitted model, testing it against the holdout set is where people make subtle but important mistakes. The official documentation is clear about the correct sequence: you must pop the target column from your test features before calling predict. If you forget this step, your predictions will be contaminated by the target itself, and your accuracy metrics will look unrealistically perfect.
y_test = x_test.pop("totalAmount") # Remove target from features
y_predict = fitted_model.predict(x_test)
print(y_predict[:10]) # Sanity check first 10 predictions
Those first 10 predictions should be dollar amounts in a realistic taxi fare range (roughly $4 to $80 for NYC trips). If you see negative values or values over $500, your data preprocessing has an issue upstream.
Now calculate your two key accuracy metrics:
from sklearn.metrics import mean_squared_error
from math import sqrt
y_actual = y_test.values.flatten().tolist()
rmse = sqrt(mean_squared_error(y_actual, y_predict))
print(f"RMSE: {rmse}")
# MAPE calculation
sum_actuals = sum_errors = 0
for actual_val, predict_val in zip(y_actual, y_predict):
abs_error = abs(actual_val - predict_val)
sum_errors += abs_error
sum_actuals += actual_val
mape = sum_errors / sum_actuals
print(f"Model MAPE: {mape}")
print(f"Model Accuracy: {1 - mape}")
The documented expected output for a well-run AutoML experiment on this dataset is a MAPE of approximately 0.1435 and a resulting model accuracy of approximately 0.8565 (85.65%). Your AutoML best run metric during training (0.9471 spearman correlation for the VotingEnsemble) is a different measure than MAPE, don't confuse them. The 0.9471 measures rank correlation; the 85.65% measures prediction closeness in dollar terms. Both matter. Neither one alone tells the full story.
If your MAPE is above 0.25 (accuracy below 75%), go back and check your data loading step. A date range that captures holiday periods or unusual weather events will skew the fare distribution significantly.
Advanced Troubleshooting
If the step-by-step fixes above didn't resolve your issue, you're likely dealing with a workspace-level or infrastructure problem. Here's how I approach the deeper stuff.
Azure ML Workspace Authentication Failures
When Azure Open Datasets fail to load with an authentication error, the problem is almost always with your workspace credential object. Run this diagnostic to confirm your workspace is reachable:
from azureml.core import Workspace
ws = Workspace.from_config()
print(ws.name, ws.resource_group, ws.location, ws.subscription_id, sep='\n')
If from_config() throws a ProjectSystemException, your config.json is missing or corrupted. Go to your Azure ML Studio workspace, click the download icon next to your workspace name in the top-right, and re-download the config file to your project directory.
Compute Cluster vs. Local Run Discrepancies
AutoML local runs execute iterations sequentially on your machine. Remote compute cluster runs execute them in parallel across nodes. The best model iteration number will often differ between the two, this is expected behavior, not a bug. What should NOT change is the best metric value for the winning model. If you see a best metric that's wildly different between local and remote runs, check that both are using identical preprocessing code and the same random seed.
Event Log Analysis for Persistent Failures
For enterprise environments running Azure ML through a VNet, AutoML job failures often originate at the network level, not the ML layer. Check the Azure Activity Log in the Azure Portal under your ML workspace: go to your resource group, click "Activity log," and filter by "Failed" operations in the past 24 hours. Look specifically for Microsoft.MachineLearningServices/workspaces/experiments/runs/write failures, these indicate a permissions or network policy blocking the run registration.
Featurization Configuration for Custom Column Types
When using Azure Open Datasets with columns that AutoML misidentifies, treating a zip code column as a continuous numeric variable, for instance, override featurization explicitly:
from azureml.automl.core.featurization import FeaturizationConfig
featurization_config = FeaturizationConfig()
featurization_config.add_column_purpose('vendorID', 'CategoricalHash')
automl_config = AutoMLConfig(
...,
featurization=featurization_config
)
This is especially relevant when your Open Dataset columns include IDs, zone codes, or location identifiers that look numeric but are actually categorical. AutoML's automatic featurization gets these wrong roughly 30% of the time in my experience.
local_run.id), the full automated_ml_errors.log file, and the output of pip list | grep azureml. Those four things will cut your support resolution time in half.
Overriding the Best Model Selection
Sometimes you want the best model from a specific iteration rather than the global best. Maybe iteration 1 (MinMaxScaler + RandomForest at 0.9468) is more interpretable for your stakeholders than the ensemble at 0.9471. You can retrieve any specific iteration:
# Get model from iteration index 1 (MinMaxScaler + RandomForest)
best_run, fitted_model = local_run.get_output(iteration=1)
print(fitted_model)
This is a legitimately useful pattern when you're deploying to environments where ensemble model complexity creates inference latency concerns, or where you need model explainability that a VotingEnsemble makes significantly harder.