Fix Azure Cosmos DB Errors: Auth, Queries & Setup Guide

Microsoft Fix Intermediate 18 min read Official Docs Grounded Updated April 20, 2026

Why Azure Cosmos DB Issues Happen (And Why They're So Hard to Diagnose)

I've worked with dozens of teams who hit a wall with Azure Cosmos DB , and the frustrating part is almost always the same: the error messages are vague, the portal gives you almost nothing useful, and Microsoft's own SDK exceptions can feel like they were written for someone who already knows the answer.

Azure Cosmos DB is a genuinely powerful globally-distributed NoSQL database service. It supports six different API models , NoSQL, MongoDB, Apache Cassandra, Apache Gremlin, Table, and PostgreSQL, and each one has its own quirks, wire protocol specifics, and authentication surface area. When something breaks, you first have to figure out which layer broke: is it authentication? The API selection? Your query syntax? A throughput limit? The partitioning strategy? That's a lot of suspects.

The most common Azure Cosmos DB problems I see fall into four buckets:

  • Authentication and RBAC misconfiguration, Your app can't connect at all, or it connects but gets permission-denied errors on data operations. This is especially common after migrating to Microsoft Entra ID (formerly Azure AD) authentication from the old primary key approach.
  • Query language confusion, Developers coming from traditional SQL Server backgrounds expect standard T-SQL behavior and run into walls when aggregation syntax, JOIN semantics, or subquery support differs from what they expect.
  • HTTP 412 Precondition Failure on writes, This one is sneaky. Your writes start failing under load, and the error code doesn't make it obvious that you're hitting an optimistic concurrency conflict on ETag values.
  • Performance degradation and request unit (RU) exhaustion, Bulk import jobs that worked fine in dev fall apart in production because nobody accounted for how RU provisioning interacts with partition key distribution.

What makes these problems genuinely hard is that Azure Cosmos DB's multi-model architecture means the fix for a Cassandra API issue looks completely different from the fix for a NoSQL API issue, even when the symptom on the surface appears identical. A "connection refused" error in the MongoDB API and the NoSQL API require entirely different remediation paths.

The good news: every single one of these issues has a documented fix. I'm going to walk you through them in order of frequency, starting with the fastest resolution and working toward the more complex scenarios. Browse all Microsoft fix guides →

The Quick Fix, Try This First

If your Azure Cosmos DB application is throwing connection errors or permission-denied responses right now and you need it working fast, start here before going through the full step-by-step.

The number one cause of sudden Azure Cosmos DB failures, especially in enterprise environments, is a mismatch between control-plane and data-plane role assignments. Here's what that means in plain language: Azure Cosmos DB uses two separate authentication systems. The control plane handles account-level operations like creating databases and containers, and it uses standard Azure role-based access control (RBAC). The data plane handles your actual data, running queries, inserting items, patching documents, and it uses a completely separate, Cosmos DB-native RBAC implementation.

Teams routinely assign someone the "Cosmos DB Account Reader" or even "Owner" Azure RBAC role and then wonder why their application can't run queries. Those control-plane roles don't automatically grant data-plane access. You need both.

To verify this in the Azure Portal:

  1. Open your Azure Cosmos DB account in the portal.
  2. In the left sidebar, scroll down to Settings and click Access control (IAM), this covers control-plane roles.
  3. Then, also in the sidebar under Settings, look for Keys or navigate to Data > RBAC, this is where data-plane role assignments live.
  4. Confirm your service principal or managed identity appears in both places with appropriate roles assigned.

If the data-plane role assignment is missing, that's your fix. Use the Azure CLI to add it:

az cosmosdb sql role assignment create \
  --account-name your-cosmos-account \
  --resource-group your-resource-group \
  --role-definition-name "Cosmos DB Built-in Data Contributor" \
  --principal-id your-service-principal-object-id \
  --scope "/subscriptions/your-sub-id/resourceGroups/your-rg/providers/Microsoft.DocumentDB/databaseAccounts/your-cosmos-account"

Give it 2–3 minutes to propagate, then retry your connection. This resolves the majority of sudden "works in dev, fails in prod" Azure Cosmos DB authentication issues I've seen.

Pro Tip
When you're testing data-plane auth with Microsoft Entra ID, always check the token audience. Your client must request a token with the scope https://cosmos.azure.com/.default, not the general Azure management scope. Using the wrong audience is a silent failure: you get a valid token, but Cosmos DB rejects it anyway, and the SDK error message often just says "Unauthorized" without explaining why.
1
Verify and Configure Microsoft Entra ID Authentication

Azure Cosmos DB fully supports Microsoft Entra ID (formerly Azure Active Directory) for both control-plane and data-plane authentication. If you're still using primary keys for production workloads, I'd strongly recommend migrating, but that transition is where most teams introduce auth errors.

First, confirm your Azure Cosmos DB account has local authentication methods enabled or disabled according to your policy. Some security teams disable primary key access entirely, which means your app must use Entra ID. Check this in the portal under your Cosmos DB account > Settings > Keys, you'll see whether "Disable local authentication" is toggled on.

For control-plane operations (creating databases, updating throughput, reading metadata), you use standard Azure RBAC built-in roles. The relevant built-in roles include "Cosmos DB Account Reader Role" for read-only access and "DocumentDB Account Contributor" for write access. Assign these at the subscription, resource group, or account scope depending on your security requirements.

For data-plane operations, running queries, inserting items, executing transactions, you need the Cosmos DB-specific native RBAC. The two built-in data-plane roles are:

  • Cosmos DB Built-in Data Reader, read-only queries and item fetches
  • Cosmos DB Built-in Data Contributor, full read/write/delete on items

You can also create custom roles with granular permissions. Once your role assignments are in place, update your application connection string to use the DefaultAzureCredential from the Azure Identity SDK rather than a primary key. In .NET, that looks like:

var client = new CosmosClient(
    accountEndpoint: "https://your-account.documents.azure.com:443/",
    tokenCredential: new DefaultAzureCredential()
);

If it worked, your application will connect without any key material. If it still fails with a 401 or 403, double-check that the role assignment scope matches the resource your app is accessing, scope mismatches are common and produce the same error code as a missing role assignment.

2
Diagnose and Fix Azure Cosmos DB NoSQL Query Errors

This step is for anyone hitting query failures, unexpected result sets, or "query is not supported" errors. The Azure Cosmos DB for NoSQL API uses a custom query language derived from SQL, but it's not T-SQL, and the differences trip people up constantly.

The NoSQL query language is a subset of standard SQL combined with NoSQL-specific extensions. It includes a rich set of hierarchical and relational operators, supports JavaScript-based user-defined functions (UDFs), and works against JSON documents modeled as trees with labeled nodes. JSON grammar is at the core of how Cosmos DB indexes and queries data.

Common query failures and their causes:

  • "Query execution is not enabled on this endpoint", You're sending queries to the wrong endpoint. NoSQL queries must go to the document endpoint, not the Gremlin or Cassandra endpoint. Verify your connection string points to https://your-account.documents.azure.com:443/.
  • Cross-partition query timeouts, If your query doesn't filter on the partition key, Cosmos DB fans out the query to every physical partition. On large containers this causes massive RU consumption and timeouts. Always include your partition key in the WHERE clause when you know it.
  • Subquery not supported errors, Not all SQL subquery patterns are supported. The NoSQL query language includes a subset, not the full SQL Server subquery surface. Restructure as a correlated subquery using IN or ARRAY_CONTAINS where possible.
  • UDF execution errors, JavaScript UDFs run in a sandboxed environment. They can't call external APIs or use Node.js modules. If your UDF is throwing, check that it's pure JavaScript computation only.

To test queries quickly without deploying code, use the Azure Portal's Data Explorer. Navigate to your Cosmos DB account > Data Explorer, select your container, click New SQL Query, and run queries interactively. The portal shows you RU charge per query, a critical metric for diagnosing performance problems.

SELECT c.id, c.name, c.status
FROM c
WHERE c.partitionKey = "region-us-east"
AND c.status = "active"
ORDER BY c.createdAt DESC
OFFSET 0 LIMIT 50

If the query returns results in the portal but fails in your application, the issue is almost certainly in how your SDK is constructing the query parameters, not the query itself.

3
Resolve HTTP 412 Precondition Failure, Concurrency Conflicts

I know this one is frustrating, especially when it happens intermittently under load. HTTP 412 is Azure Cosmos DB's way of telling you that an optimistic concurrency conflict has occurred, and your write was rejected because the item changed between the time you read it and the time you tried to write it.

Here's how the mechanism works: every item in Azure Cosmos DB for NoSQL has an _etag property. Every time that item is updated, the server assigns a new ETag value. When you read an item, the ETag comes back in the response headers. If you then send a write operation with an If-Match header containing the ETag you read, the server checks whether the item's current ETag matches. If another process updated the item in the meantime, the ETags won't match and you get HTTP 412.

This is intentional behavior, it prevents lost updates in concurrent systems. The fix is to implement a retry loop with re-fetch:

// C# example, retry on 412
async Task UpdateItemWithRetry(string id, string partitionKey)
{
    int maxRetries = 3;
    for (int i = 0; i < maxRetries; i++)
    {
        // Always re-read to get current ETag
        ItemResponse<MyItem> readResponse = await container.ReadItemAsync<MyItem>(
            id, new PartitionKey(partitionKey));

        MyItem item = readResponse.Resource;
        string currentEtag = readResponse.ETag;

        // Apply your changes
        item.Status = "updated";

        try
        {
            ItemRequestOptions options = new ItemRequestOptions
            {
                IfMatchEtag = currentEtag
            };
            await container.ReplaceItemAsync(item, id,
                new PartitionKey(partitionKey), options);
            return; // Success
        }
        catch (CosmosException ex) when (ex.StatusCode ==
            System.Net.HttpStatusCode.PreconditionFailed)
        {
            if (i == maxRetries - 1) throw;
            await Task.Delay(TimeSpan.FromMilliseconds(100 * (i + 1)));
        }
    }
}

If you're getting 412 errors consistently rather than intermittently, that suggests multiple application instances are writing to the same items at high frequency. Consider redesigning your write pattern to use the patch operation (PatchItemAsync) which supports atomic field-level updates and reduces the surface area for conflicts. The Cosmos DB SDK also supports If-None-Match headers for cache validation, use this when checking whether a resource needs to be re-fetched rather than always pulling the full item.

After implementing the retry, you should see 412 errors drop from your metrics. Monitor via Azure Monitor > Metrics > select your Cosmos DB account > filter by "Total Request Units" and "Http 4xx" to track the resolution.

4
Fix Azure Cosmos DB Bulk Import Failures and Throughput Errors

Bulk insert failures are one of the most common Azure Cosmos DB pain points during initial data migration or during high-volume ingestion pipelines. You write a job that works beautifully against a 10,000-item test dataset, then run it against 10 million items and it either throttles out, times out, or starts silently dropping records.

The right approach depends on your SDK. For .NET, use the CosmosClient with bulk mode enabled, this is separate from the legacy bulk executor library. In the .NET SDK v3+, enabling bulk execution is a single flag at client construction time:

CosmosClientOptions options = new CosmosClientOptions
{
    AllowBulkExecution = true
};
CosmosClient client = new CosmosClient(connectionString, options);

With bulk execution enabled, the SDK batches your individual CreateItemAsync calls into groups automatically. This dramatically reduces the per-operation overhead and optimizes how your provisioned throughput is consumed.

For the Java SDK, the bulk API is exposed through the CosmosBulkOperations class. Both implementations optimize how throughput is consumed during large data loads.

For massive datasets, we're talking hundreds of millions of records, consider Apache Spark with the Azure Cosmos DB Spark connector. You can run bulk imports using Python or Scala against your existing Spark cluster, and the connector handles parallelism, partitioning, and retry logic for you.

Common bulk import errors and what they mean:

  • 429 Too Many Requests, You've exhausted your provisioned Request Units. Temporarily increase your container's throughput before the migration, then scale it back down after. Navigate to your container > Scale & Settings > adjust the RU/s slider.
  • Request entity too large, Individual items exceed the 2MB document size limit. Split oversized documents before importing.
  • Partition key not found, Your import data doesn't include the partition key field, or the field name doesn't match the container's partition key path exactly (it's case-sensitive).

Always run a small test batch of 1,000–5,000 items first, check the RU charge per item in Azure Monitor, and project your total RU cost before scaling up. This saves you from provisioning 50,000 RU/s for a migration that only needed 5,000.

5
Select and Configure the Right Azure Cosmos DB API for Your Workload

A misconfigured API choice causes a specific class of Azure Cosmos DB problems that are hard to recognize because the connection often succeeds but behavior is wrong or features are missing. Getting the API selection right from the start, or migrating correctly when you're on the wrong one, matters more than most teams realize.

Azure Cosmos DB offers six API models, and choosing incorrectly is a very common setup mistake:

  • NoSQL API, Best choice for greenfield applications. Uses native SQL queries with automatic indexing and schema flexibility for document workloads. This is the "native" Cosmos DB experience and has the deepest feature support, including vector search, change feed, and the most sophisticated RBAC implementation.
  • MongoDB API, Choose this if you have an existing MongoDB application you want to migrate. Full wire protocol compatibility means most MongoDB drivers and tools work without code changes. Check the supported features and syntax list in the official docs before migrating, not every MongoDB operator is supported.
  • Apache Cassandra API, For highly scalable workloads already using Cassandra Query Language. Wire protocol support means existing CQL-based applications migrate with minimal changes. Partitioning behavior differs slightly from native Cassandra so test your partition key strategies before production migration.
  • Apache Gremlin API, For graph data. If your data has complex relationships that are expensive to query in a document model, Gremlin lets you traverse those relationships efficiently. Use graph data modeling guidance from the official docs to avoid common property-key mistakes.
  • Table API, If you're migrating from Azure Table Storage and want premium capabilities like global distribution and automatic indexing without rewriting your application.
  • PostgreSQL API, For distributed relational workloads where you need horizontal scaling but want to keep the PostgreSQL wire protocol. Connect using psql or any PostgreSQL client, then run SQL commands as normal.

Once you've created an Azure Cosmos DB account with a specific API, you cannot change it. If you realize you've chosen the wrong API, you need to create a new account and migrate your data. Plan your API choice carefully, evaluate your existing tooling, driver ecosystem, query patterns, and team familiarity before committing.

To confirm which API your existing account uses: Azure Portal > your Cosmos DB account > Overview. The API type appears prominently in the account details. If it says "Core (SQL)" that's the NoSQL API. If you're connecting with a MongoDB driver to a "Core (SQL)" account, nothing will work, that's a very common setup error.

Advanced Azure Cosmos DB Troubleshooting

Network and Private Endpoint Issues

Enterprise deployments frequently restrict Azure Cosmos DB access to virtual network (VNet) service endpoints or private endpoints. If your application connects fine from a developer laptop but fails from an Azure VM, App Service, or AKS pod, network policy is almost certainly the culprit.

In the Azure Portal, navigate to your Cosmos DB account > Networking. You'll see whether public network access is enabled and whether specific VNet subnets or IP ranges are allowed. A common mistake is adding a VNet firewall rule but forgetting to enable the "Microsoft.AzureCosmosDB" service endpoint on the subnet itself. The firewall rule won't work until the subnet has the service endpoint enabled, these are two separate settings in two separate places.

For private endpoint configurations, check that your private DNS zone (privatelink.documents.azure.com) is properly linked to the VNet your application lives in. A missing DNS zone link means your app resolves the Cosmos DB hostname to the public IP even when a private endpoint exists, and the private endpoint firewall rules then block that traffic.

Diagnosing Performance Issues with Azure Monitor

When Azure Cosmos DB is slow, the diagnostic data is all there, you just need to know where to look. Open Azure Monitor > Insights > Azure Cosmos DB. The key metrics to examine are:

  • Total Request Units Consumed vs. Normalized RU Consumption (%), If normalized RU consumption is consistently above 80%, you're throttling. Either increase throughput or optimize queries to consume fewer RUs.
  • Server Side Latency, High server-side latency with normal RU consumption often points to cross-partition queries or hot partitions. Examine your partition key distribution.
  • Http 429 count, Every 429 is a throttled request. The SDK retries these automatically, but too many throttles degrade application performance and mask the real issue.

Partition Key Hot Spots

A hot partition is one of the hardest Azure Cosmos DB performance problems to spot without monitoring because the overall throughput metrics look fine, it's just one partition that's melting down. Hot partitions happen when too many requests target the same partition key value. For example, if your partition key is "region" and 90% of your traffic is "us-east-1", that partition gets all the load while others sit idle.

Fix hot partitions by choosing a higher-cardinality partition key, one with many distinct values distributed evenly across your data. If you can't change the partition key (because it's already set on a production container), consider using a synthetic partition key that concatenates multiple fields to increase cardinality.

Emulator Troubleshooting

The Azure Cosmos DB Emulator is available for local development. If the emulator won't start or your local app can't connect to it, the most common cause is a port conflict on 8081 or an expired TLS certificate. Regenerate the emulator certificate from Start Menu > Azure Cosmos DB Emulator > right-click > Reset Data, then re-import the certificate from https://localhost:8081/_explorer/emulator.pem.

When to Call Microsoft Support

If you've gone through every step in this guide and still can't resolve the issue, there are specific scenarios that genuinely require Microsoft engineering involvement: persistent regional outages (check the Azure Status page first at status.azure.com), data corruption or missing documents that aren't explained by your write logic, or sudden dramatic throughput degradation with no change in your workload pattern. Before opening a support ticket, gather your Cosmos DB account name, the affected database and container, your SDK version, the exact error messages with status codes, and a 30-minute window of Azure Monitor metrics. Microsoft Support can escalate to the Cosmos DB product team with that information.

Prevention & Best Practices for Azure Cosmos DB

Most Azure Cosmos DB problems I've seen are preventable. The teams that run Cosmos DB without drama all share the same set of habits, and they're not complicated once you know what to look for.

Plan your partition key before you create the container. I can't say this enough. Changing a partition key after the fact means migrating to a new container, which is painful and risky. Your partition key should have high cardinality (many distinct values), distribute write load evenly, and appear in most of your queries as a filter predicate. Time-based fields like "date" or "hour" are tempting but almost always create hot partitions. User IDs or customer IDs are usually much better choices.

Always use managed identity or service principals, never primary keys in production. Primary keys are single points of failure. If a key leaks, you have to rotate it immediately and everything that used the old key breaks simultaneously. Managed identities don't have keys to leak, rotate automatically, and are the pattern Microsoft explicitly recommends for production workloads.

Set up Azure Monitor alerts proactively. Configure alerts for Normalized RU Consumption above 80%, Http 429 count above your threshold, and Server Side Latency above your SLA target. Getting paged before your users notice is dramatically less stressful than diagnosing incidents after the fact. Navigate to your Cosmos DB account > Alerts > New alert rule to configure these.

Test your consistency level choice before going live. Azure Cosmos DB offers five consistency levels: Strong, Bounded Staleness, Session, Consistent Prefix, and Eventual. Strong consistency costs about 2x the RU charge of other levels and introduces higher latency. Most applications do fine with Session consistency. Don't default to Strong without understanding the RU cost impact at your expected query volume.

Use the built-in indexing policy wisely. Cosmos DB indexes everything by default, which is convenient but burns extra RU on every write. For write-heavy containers, explicitly excluding paths you never query can cut your write costs significantly. Navigate to your container > Scale & Settings > Indexing Policy to customize it with JSON.

Quick Wins
  • Enable diagnostic logs (Azure Monitor) on your Cosmos DB account from day one, retroactive log enablement doesn't recover historical data during incidents.
  • Use the Azure Cosmos DB Emulator for local development to avoid burning production RUs and accumulating unexpected costs during testing.
  • Set resource-level cost alerts in Azure Cost Management to catch unexpected RU/s scaling or runaway query costs before the monthly bill arrives.
  • Review your indexing policy after your query patterns stabilize, the default "index everything" policy is rarely optimal for mature production workloads.

Frequently Asked Questions About Azure Cosmos DB

Can I use my existing Microsoft Entra ID accounts to authenticate to Azure Cosmos DB?

Yes, Azure Cosmos DB supports Microsoft Entra authentication for both control-plane and data-plane operations. Control-plane tasks like creating databases and managing accounts use standard Azure role-based access control, where you can assign built-in or custom roles to your Entra ID users and service principals. Data-plane operations, running queries, inserting and updating items, use a separate Cosmos DB-native RBAC system with its own built-in roles like "Cosmos DB Built-in Data Contributor." The critical thing to understand is that these are two independent role assignment systems. Having control-plane access does not automatically grant data-plane access, so you need to configure both separately. Once configured, your applications can authenticate using the Azure Identity SDK with DefaultAzureCredential, eliminating the need for primary keys entirely.

Does Azure Cosmos DB for NoSQL actually support SQL queries, or is it something different?

It supports SQL queries, but not standard T-SQL, it uses a custom NoSQL query language that's derived from SQL. Think of it as a meaningful subset of SQL combined with NoSQL-specific extensions built for JSON documents. You get familiar constructs like SELECT, FROM, WHERE, ORDER BY, JOIN, and OFFSET/LIMIT. What you won't find is everything from SQL Server, certain subquery patterns work differently, and the JOIN semantics are document-centric rather than relational. The query language also supports JavaScript-based user-defined functions (UDFs) for custom computation inline in queries. If you're coming from a SQL Server background, most of your basic query intuitions will transfer, but you'll hit edges if you try to port complex stored procedures or server-side logic directly.

Does Azure Cosmos DB support SQL aggregation functions like COUNT, SUM, and AVG?

Yes, the NoSQL query language supports aggregation functions including COUNT, MAX, MIN, AVG, and SUM. You use them in the same position you would in standard SQL, in the SELECT list with optional GROUP BY clauses. The important practical consideration is that cross-partition aggregations (where the query spans multiple physical partitions) consume significantly more Request Units than single-partition aggregations. If you're running COUNT(*) over a massive container without a partition key filter, you'll see both high RU consumption and potentially long query execution times. Where possible, filter on your partition key before aggregating, and test your aggregation queries in the Data Explorer to see the RU charge before running them at scale in production.

Why do my Azure Cosmos DB writes fail with HTTP 412 and what does it mean?

HTTP 412 Precondition Failure means an optimistic concurrency conflict occurred, your write was rejected because the item you tried to update had already been changed by another operation since you last read it. Azure Cosmos DB uses ETags for optimistic concurrency control: every item has an _etag property that the server updates on every write. When you send an If-Match header with the ETag you read earlier, and that ETag no longer matches (because another process wrote to the item), you get 412. The correct fix is to re-read the item to get its current ETag, apply your changes, and retry the write with the fresh ETag. Most applications should implement a retry loop with exponential backoff that handles this automatically. If you're seeing 412 errors at high volume, consider using the patch operation for atomic field-level updates, which reduces the conflict window significantly.

What's the best way to bulk-insert millions of documents into Azure Cosmos DB without it failing?

For .NET applications, enable bulk execution on the CosmosClient by setting AllowBulkExecution = true in the client options, this lets the SDK batch your operations automatically and is far more efficient than individual item inserts. For Java, use the Cosmos DB bulk operations API exposed through the Java SDK. For very large datasets (hundreds of millions of records), Apache Spark with the Azure Cosmos DB Spark connector is the most scalable approach, with support for Python and Scala. Before running any large import, temporarily scale up your container's provisioned throughput to handle the write surge, and monitor for HTTP 429 throttling errors. After the import completes, scale throughput back down to avoid unnecessary cost. Also make sure every document in your import set contains the partition key field, missing partition keys cause hard failures that the SDK won't automatically retry.

Does Azure Cosmos DB cache resource links, and does my app need to handle that?

Yes, Azure Cosmos DB is a RESTful service and resource links (the URIs that identify databases, containers, and items) are immutable, which means they're safe to cache indefinitely. The Cosmos DB SDKs handle this caching automatically in most cases, so you don't need to implement it yourself. Where you do want to pay attention is when polling for updated resources, use the If-None-Match header with the ETag from your last read. If the resource hasn't changed, the server returns HTTP 304 Not Modified with no body, saving you bandwidth and RU charges. If it has changed, you get the full updated resource. The SDKs expose this pattern through their response objects, and it's particularly useful for scenarios where you're periodically checking container metadata or configuration without wanting to always pull the full document payload.

Related Microsoft Fix Guides

H
Sai Kiran Pandrala
Our team includes certified Microsoft engineers, Azure architects, and system administrators with 10+ years of enterprise IT experience. Every guide is written from hands-on troubleshooting, not guesswork. We test every fix before publishing.