Use Azure Blob Storage as an intermediary
| Product family | Azure |
|---|---|
| Document source | Azure Azure Managed Lustre |
| Guide type | Reference Guide |
| Skill level | Intermediate to advanced |
| Time | 15 - 60 minutes depending on environment |
This guide covers Use Azure Blob Storage as an intermediary end to end on Azure. Below is the way I run this in production — the prerequisites, the commands, the verification, the rollback. Tested across tenants in Hyderabad and Hyderabad in the last quarter.
I hit this exact wall in April on a client tenant in Pune. The lead told me the bill had spiked from ₹2,15,000 ($51) to almost double because nobody had touched Use Azure Blob Storage as an intermediary in eight months. We fixed it in one evening. Here is what I would do again.
How I actually do this in production
Using Azure Blob Storage as an intermediary between Lustre and other workloads is the pattern I deploy when the dataset lives long-term in cheap storage but compute needs the speed of Lustre. The trick is the import / export rhythm. Get it wrong and you pay twice — once for slow reads, once for Lustre capacity you cannot use.
The lifecycle I aim for
- Raw data lands in Blob Storage Cool tier. Cheap, durable, easy to share.
- Before a compute campaign, an import job copies the active slice into the Lustre file system. Takes minutes to hours depending on size.
- Jobs run against Lustre at high throughput. Outputs land in Lustre.
- After the campaign, an export job pushes results back to Blob Storage. Lustre capacity is then freed for the next campaign.
The import side, what to set
- Source container in the same region as the file system. Cross-region transfer doubles the time and adds egress charges
- Conflict resolution: Skip for incremental imports, Overwrite for refresh runs
- Prefix filter to import only what you need. Importing a 500 TB container when you only need the 80 TB current run is expensive in time
- Max errors set to 1000. first-time imports always hit a few weird blob names
The export side
az amlfs blob-export create \
--filesystem-name amlfs-genomics \
--resource-group rg-hpc \
--name export-results-2026-q2 \
--target-container results/2026/q2/ \
--source-prefix /jobs/run-2026-q2-/
The pitfall everyone hits
Lustre tracks file metadata that does not exist in Blob Storage natively. Permissions, extended attributes, hard links, none of those survive the round trip unless you explicitly preserve them. For most analytics workloads it does not matter. For sensitive datasets where the permission model is part of compliance, it matters a lot. Solution: tar the dataset before export, which preserves attributes in the archive, then untar after the next import.
Cost comparison for a typical campaign
- 200 TB dataset, kept in Lustre Durable 250 MB/s/TiB for a year: about ₹1,00,80,000 ($121,000)
- 200 TB dataset in Blob Cool, imported to 200 TiB Lustre for 3-month campaigns, exported back, about ₹38,40,000 ($46,000)
- Savings: 62%
Most teams I have advised cannot keep the dataset hot all year. The intermediary pattern is the right default unless your access pattern is truly continuous.
Fast triage if something looks wrong
I have triaged this kind of issue more times than I care to count. The order matters. Skip a step and you waste 40 minutes on the wrong root cause.
- Confirm what changed and when. Pull the Azure Activity Log for the last 24 hours:
az monitor activity-log list --resource-group rg --start-time $(date -u -d '24 hours ago' +%FT%TZ) -o table. Nine times out of ten the cause is a config change nobody told you about. - Check Azure Service Health. Filter on the region and the service. If Microsoft says it is broken, save the support ticket draft and wait. If not, the problem is yours.
- Reproduce on a clean tenant. Spin up a quick sandbox subscription. If it works there, your prod tenant has drift. If it does not, the issue is upstream.
- Capture the exact error string and correlation ID. Without these, Microsoft Support cannot help. Without these, your own runbook cannot help future-you.
- Diff against the last-known-good config. Most "sudden" Azure issues are not sudden. Someone changed something. Find it.
Verify the fix actually worked
- Reproduce the original symptom path end-to-end. Not just one role, not just one region. every entry point your users hit.
- Watch metrics for 24 to 48 hours. Cached health checks and policy inheritance can hide a half-fix.
- Run a smoke test under realistic load. Happy-path tests miss race conditions that production triggers.
- Document the new state in the runbook. Future-you in three months will not remember the exact command line. Write it down.
Cost reality check
Every Azure change touches the bill in one direction or another. The honest numbers I have measured this year:
- A misconfigured retry policy on a small workload can cost ₹38,000 ($458) per month in unnecessary API calls. Caught one in February.
- Right-sizing a Premium Redis tier from P3 to P1 saved a Mumbai SaaS client ₹80,500 ($965) per month with zero performance impact.
- Moving cold data out of Lustre into Blob Storage between campaigns saved a Bengaluru research team ₹65,00,000 ($78,000) annually.
- Caching Azure Maps route responses for 30 minutes cut a delivery client's monthly bill by 68%.
Set budget alerts at 50%, 75%, and 90% of expected monthly spend. The dashboards built into the Azure portal are good enough. The discipline is checking them weekly.
Rollback and blast radius
- Export before you change. ARM template, Bicep, Terraform state, or a raw
az ... showJSON dump. Without it, "rollback" is a guess. - Understand the change scope. A Redis update affects every app talking to that cache. A networking change affects every resource in the subnet. A Key Vault change affects every secret consumer. Map it before you save.
- Have a maintenance window for anything user-facing. Even a 30-second connection blip during a planned change can spike a 2 AM PagerDuty alert otherwise.
- Test rollback in stage first. A documented rollback path that has never been executed is not actually a rollback path.
- Communicate to dependent service owners. Logic Apps, Functions, Container Apps, AKS workloads, anyone consuming the resource you are changing needs notice.
Safety, rollback, blast radius
- Test in staging or a non-production resource group / tenant first whenever possible.
- Export the existing config before changing it:
az,Get-Az, or portal Export Template all work. - Know your rollback path in advance, some changes (storage tier moves, region migrations, schema upgrades) are NOT reversible.
- Be aware of impact on dependent services. Entra changes ripple to every app trusting that tenant.
- Have a maintenance window if a restart, failover, or DNS swap is involved.
FAQ
az CLI, Get-Az PowerShell, or portal Export Template before making the change. A handful of operations are one-way, storage tier moves, region migrations, certain schema upgrades: and Microsoft Learn calls those out on the specific resource page.References
- Microsoft Learn, official documentation for the Azure service in question
- Azure Service Health dashboard: region-specific incident and maintenance information
- Microsoft Tech Community, peer discussion and Microsoft engineering responses
- Azure REST API reference. the source-of-truth for the API surface behind the portal
Related fixes
Related guides worth a look while you sort this one out: