Azure Cosmos DB Troubleshooting: Fix Every Common Error
Why Azure Cosmos DB Troubleshooting Is So Confusing
You're staring at your application logs at 2 AM. Requests are failing. Users are complaining. And Azure Cosmos DB is throwing errors that, honestly, don't tell you much beyond "something went wrong." I've seen this exact scenario play out on dozens of enterprise deployments , and the maddening part is that Cosmos DB errors almost never point at the real root cause on the surface.
Here's the thing about Azure Cosmos DB: it's a globally distributed, multi-model NoSQL database with a pricing model built entirely around Request Units (RUs). That combination means there are roughly five distinct failure categories you could be hitting at any given moment , and they all look similar from the application layer. A connection timeout, a 429 status code, a suddenly slow cross-partition query, a consistency violation, all of them can surface as generic "request failed" errors in your app if you're not logging carefully.
The most common culprit I see in production is RU throttling. Specifically, HTTP 429, "Too Many Requests", which Cosmos DB returns when your workload burns through provisioned request units faster than the database can replenish them. This is not a bug. It's by design. Cosmos DB is a metered system, and 429s are its way of enforcing the throughput limits you've set. But if you haven't sized your RU/s correctly or your partition key is causing a hot partition, you'll hit this constantly.
Second on the list: partition key design issues. A poorly chosen partition key creates an uneven distribution of data and requests. One logical partition gets hammered while the others sit idle. The scary part is that Cosmos DB won't tell you "your partition key is bad", it just starts throttling the hot partition while your metrics look fine at the account level. I've seen teams spend weeks on this before realizing the partition was the problem the whole time.
Then there's the SDK layer. Outdated SDK versions, missing retry policies, incorrect connection mode settings, these cause connection timeouts and sporadic failures that are incredibly hard to reproduce in staging. If you're running .NET SDK v2 or Java SDK v3 without tuning the retry options, you're fighting with one hand tied behind your back.
Finally, consistency level misconfigurations bite teams working across multi-region deployments. Choosing Session consistency but then using clients that don't propagate session tokens correctly leads to reads that return stale data, and that's a category of bug that's almost impossible to catch in a single-region development environment.
The good news: every one of these problems is fixable. And the Azure Portal metrics blade gives you the data to diagnose them precisely, if you know where to look. Browse all Microsoft fix guides →
The Quick Fix, Try This First
Before you touch any code or configuration, open the Azure Portal and navigate directly to your Cosmos DB account. In the left menu, click Metrics (Classic) or, if you're on a newer account, go to Monitoring > Metrics. This is your single most important diagnostic screen.
Set the time range to the last 1 hour and pull up these three metrics simultaneously:
- Total Requests, filtered by Status Code
429 - Normalized RU Consumption (%), broken down by PartitionKeyRangeId
- Server Side Latency, to separate network issues from database-side issues
If your 429 count is above zero and your Normalized RU Consumption is hitting 100% on any partition key range, you've confirmed throttling. The immediate fix is to increase your provisioned throughput. Go to Settings > Scale, bump your RU/s by 25–50%, and save. Within 60 seconds, the throttling will stop.
If your 429 count is zero but latency is spiking, the problem is almost certainly a slow cross-partition query or an indexing issue, jump straight to Steps 4 and 5 below.
If both metrics look normal but your app is still failing, the culprit is almost certainly the SDK configuration or a network connectivity issue between your application and the Cosmos DB endpoint. That points you to Step 3.
One more quick check: go to Diagnose and solve problems in the left menu of your Cosmos DB account. Click Availability and Connectivity. Azure runs automated health checks here and will often surface region-specific outages or endpoint issues that you'd otherwise spend hours hunting down manually.
Good Azure Cosmos DB troubleshooting starts with data, not guesswork. The metrics blade is your ground truth. Open the Azure Portal, navigate to your Cosmos DB account, and go to Monitoring > Metrics in the left navigation pane.
Create a custom chart with the following metrics pinned side by side:
- Total Requests split by StatusCode, immediately shows you the ratio of 200 (success) vs 429 (throttled) vs 408 (timeout) responses
- Normalized RU Consumption split by CollectionName and Region
- Physical Partition Throughput (under the "Throughput" namespace), shows hot partition skew
- Data & Index Storage, useful if you suspect an indexing explosion
If you're running Azure Diagnostics, you can also query the CDBDataPlaneRequests log table in Log Analytics. This query isolates every throttled request with its partition key:
CDBDataPlaneRequests
| where TimeGenerated > ago(1h)
| where statusCode == 429
| summarize ThrottledCount = count() by partitionKey, collectionName, bin(TimeGenerated, 5m)
| order by ThrottledCount desc
This single query has saved me hours on complex deployments. You'll see exactly which partition keys are generating throttles, which tells you whether you have a hot partition problem or a global throughput sizing issue.
If you see event ID 40501 in the response body of a 429, that's the Cosmos DB internal code for "collection throughput limit exceeded." If you see 40613, that means the database is temporarily unavailable, usually due to a planned maintenance event or regional failover.
What success looks like: Your 429 rate drops to zero, Normalized RU Consumption stays below 70% across all partition key ranges, and latency holds steady under 10ms for point reads.
Azure Cosmos DB 429 throttling is the single most common error I deal with, and it almost always has one of three causes: you're under-provisioned, you have a hot partition, or a burst of traffic is outpacing your autoscale configuration. Let's address each.
If you're under-provisioned globally: Go to your container in the Portal, click Settings > Scale & Settings, and increase the Manual throughput RU/s value. A good starting formula for a production read-heavy workload is: (average item size in KB × reads per second × replication factor) / 1 RU. For a 5KB item read at 500 RPS with 3 replicas, you need roughly 7,500 RU/s as a floor.
If autoscale is enabled but still throttling: Check whether your max RU/s ceiling is too low. Cosmos DB autoscale scales between 10% and 100% of your configured maximum. If your max is set to 4,000 RU/s, the minimum floor is 400 RU/s, and if traffic spikes before autoscale can catch up (it typically responds within seconds, not milliseconds), you'll get a burst of 429s. Consider switching to manual provisioned throughput if your workload is predictable, or raise your autoscale max.
If it's a hot partition: Increasing RU/s won't fully help because Cosmos DB distributes throughput evenly across physical partitions. A single hot logical partition is limited to its physical partition's share. The real fix is redesigning your partition key, see Step 4. As a short-term band-aid, you can enable Burst Capacity (in Preview as of early 2026) which allows partitions to consume unused RU budget from other partitions temporarily.
Also check whether your SDK is implementing exponential backoff on 429 responses. The RetryAfterInMilliseconds field in the 429 response header tells your client exactly how long to wait before retrying. If your code is ignoring this and retrying immediately, you're amplifying the problem.
What success looks like: The 429 count in your metrics drops to near zero. Server-side latency returns to single-digit milliseconds for point reads.
If your Cosmos DB metrics show healthy RU consumption and low latency but your application is still throwing timeout errors, the SDK configuration is almost always responsible. This is one of those Azure Cosmos DB troubleshooting scenarios that's particularly nasty because the database itself is perfectly fine, the problem lives entirely in your application layer.
For the .NET SDK v3, here's a properly tuned client initialization:
CosmosClientOptions options = new CosmosClientOptions
{
ConnectionMode = ConnectionMode.Direct,
RequestTimeout = TimeSpan.FromSeconds(10),
OpenTcpConnectionTimeout = TimeSpan.FromSeconds(5),
MaxRetryAttemptsOnRateLimitedRequests = 9,
MaxRetryWaitTimeOnRateLimitedRequests = TimeSpan.FromSeconds(30),
ApplicationPreferredRegions = new List<string>
{
"East US",
"West US 2"
}
};
CosmosClient client = new CosmosClient(connectionString, options);
Pay close attention to ConnectionMode.Direct. This uses the TCP protocol to connect directly to Cosmos DB backend nodes instead of routing through the Gateway. For applications running inside Azure (App Service, AKS, VMs), Direct mode reduces latency significantly. For on-premises apps or apps behind strict firewalls, stick with ConnectionMode.Gateway, Direct mode requires outbound TCP access on ports 10000–20000.
For Java SDK v4, the equivalent configuration looks like this:
CosmosClientBuilder builder = new CosmosClientBuilder()
.endpoint(endpoint)
.key(key)
.directMode(DirectConnectionConfig.getDefaultConfig())
.throttlingRetryOptions(new ThrottlingRetryOptions()
.setMaxRetryAttemptsOnThrottledRequests(9)
.setMaxRetryWaitTime(Duration.ofSeconds(30)));
CosmosAsyncClient client = builder.buildAsyncClient();
One thing I see constantly: teams using a single CosmosClient instance per request. Don't do this. CosmosClient is expensive to initialize, it opens and maintains a connection pool. Instantiate it once at application startup and treat it as a singleton for the lifetime of your process. In .NET, register it as a singleton in your DI container.
If you're seeing HTTP 408 (Request Timeout) specifically, and the Portal shows server-side latency is fine, the timeout is occurring on the client side before the server even responds. Increase RequestTimeout as a temporary measure, then investigate whether your query is simply returning too much data.
What success looks like: Your application logs stop showing timeout exceptions. P99 latency as measured by your APM tool aligns with the server-side latency shown in the Cosmos DB Portal metrics.
This is the step most teams skip, and it's the one that causes the most long-term pain. Azure Cosmos DB partition key design is not just a performance concern, a bad partition key creates structural throttling that no amount of RU/s increase will fully resolve. I've watched teams throw money at this problem for months before finally biting the bullet and redesigning.
A hot partition forms when a disproportionate number of requests target a single logical partition. Common examples:
- Using
userIdas a partition key when you have a handful of "power users" generating 90% of activity - Using
dateortimestampas a partition key, all writes always go to "today" - Using a boolean or low-cardinality field like
isActiveorcountrywhen most documents are from the same country
The diagnostic test: in Log Analytics, run this query to calculate partition key distribution:
CDBPartitionKeyStatistics
| where TimeGenerated > ago(1h)
| where CollectionName == "your-container-name"
| project PartitionKey, SizeKb
| order by SizeKb desc
| take 20
If the top 5 partition keys account for more than 50% of total storage, you have a skew problem.
Synthetic partition key approach: If you can't change your data model, append a random suffix to your existing partition key to spread load. For example, instead of /userId, use /userId_suffix where suffix is a random number between 0 and 9. You then write to 10 logical partitions per user and fan out reads across all 10, a common pattern called partition key suffixing.
// Write: append suffix 0-9 randomly
string partitionKey = $"{userId}_{random.Next(0, 10)}";
// Read: fan-out query across all suffixes
var tasks = Enumerable.Range(0, 10).Select(suffix =>
container.ReadItemAsync<MyItem>(id, new PartitionKey($"{userId}_{suffix}"))
);
For new containers, consider a hierarchical partition key (available in Cosmos DB for NoSQL). This lets you specify two levels, for example, /tenantId as the first level and /userId as the second. It gives you logical grouping without sacrificing distribution.
What success looks like: The Physical Partition Throughput metric shows balanced consumption across all partition key ranges, no single range consistently above 60–70% while others are under 20%.
Azure Cosmos DB indexes every property on every document by default. That sounds helpful, but on write-heavy workloads or documents with large, deeply nested schemas, the default indexing policy becomes a silent RU drain. Every write has to update the index for every indexed path, and if you have 50 properties per document, that's 50 index updates on each insert or update.
Go to your container in the Portal and click Settings > Indexing Policy. You'll see JSON that looks something like this for the default policy:
{
"indexingMode": "consistent",
"automatic": true,
"includedPaths": [
{
"path": "/*"
}
],
"excludedPaths": [
{
"path": "/\"_etag\"/?"
}
]
}
The /* wildcard is the culprit. For a write-heavy workload, switch to an opt-in model where you only index the paths you actually query on:
{
"indexingMode": "consistent",
"automatic": true,
"includedPaths": [
{ "path": "/userId/?" },
{ "path": "/status/?" },
{ "path": "/createdAt/?" }
],
"excludedPaths": [
{ "path": "/*" }
]
}
This can reduce write RU cost by 40–60% on complex document schemas. I've seen this single change drop a team's monthly Cosmos DB bill by thousands of dollars.
For slow query performance specifically, you may need to add a composite index. If you're running a query like SELECT * FROM c WHERE c.status = 'active' ORDER BY c.createdAt DESC, Cosmos DB can't use single-property indexes for this, it needs a composite index on (status, createdAt):
{
"compositeIndexes": [
[
{ "path": "/status", "order": "ascending" },
{ "path": "/createdAt", "order": "descending" }
]
]
}
Without this composite index, Cosmos DB performs a full cross-partition scan to satisfy the ORDER BY, which is both slow and RU-expensive. Adding it drops query cost from thousands of RUs to tens of RUs on the same dataset.
Also check: if indexingMode is set to "none", only point reads by document ID are possible. All queries will fail with a 400 error unless you pass the partition key explicitly. This sometimes happens after a misconfigured ARM template deployment.
What success looks like: Write RU cost per document drops noticeably in your metrics. Slow queries that previously showed 10,000+ RU charges in the Query Stats pane now run in under 100 RUs. The query execution time drops from seconds to milliseconds.
Advanced Azure Cosmos DB Troubleshooting
Diagnosing Multi-Region Replication Lag and Consistency Issues
If you're running a multi-region Cosmos DB deployment and reads are returning stale data, the problem is almost always a consistency level mismatch or a session token propagation failure. Cosmos DB offers five consistency levels, Strong, Bounded Staleness, Session, Consistent Prefix, and Eventual, and the one you choose has direct consequences on what you read after a write.
The most common mistake: teams configure Session consistency expecting read-your-own-writes behavior, but then use multiple CosmosClient instances (or stateless serverless functions) that don't share a session token. Session consistency guarantees are scoped to a single client session, if your write goes through Client A and your read goes through Client B, Client B has no session token from Client A's write, so it might return stale data perfectly legally within the Session consistency model.
Fix: capture the session token from each write response and pass it explicitly to subsequent reads:
// After a write:
ItemResponse<MyItem> writeResponse = await container.CreateItemAsync(item);
string sessionToken = writeResponse.Headers.Session;
// On the subsequent read, pass the session token:
ItemRequestOptions readOptions = new ItemRequestOptions { SessionToken = sessionToken };
ItemResponse<MyItem> readResponse = await container.ReadItemAsync<MyItem>(
id, partitionKey, readOptions);
Group Policy and Network-Level Connectivity
In enterprise environments, I frequently see Cosmos DB connection failures caused by network security groups (NSGs) or Azure Firewall rules blocking outbound TCP traffic on ports 10000–20000 (required for Direct mode) or HTTPS on port 443 (required for Gateway mode). If you've recently migrated your app to a new virtual network or tightened your NSG rules, check this first.
Run this from your application host to verify Direct mode connectivity:
Test-NetConnection -ComputerName your-account.documents.azure.com -Port 443
Test-NetConnection -ComputerName your-account.documents.azure.com -Port 10255
If the TCP test fails on 10255 but succeeds on 443, switch to Gateway mode temporarily and file a ticket with your network team to open the Direct mode port range.
Event Viewer and Azure Monitor Alert Rules
For ongoing production monitoring, set up Azure Monitor alert rules on your Cosmos DB account. Navigate to Monitoring > Alerts > Create alert rule and configure alerts for:
- Normalized RU Consumption > 80% for more than 5 minutes
- Total Requests with StatusCode = 429 > 100 per minute
- Server Side Latency > 50ms average over 5 minutes
- Replication Latency > 500ms (for multi-region accounts)
For deeper Log Analytics queries, enable Diagnostic Settings on your Cosmos DB account (under Monitoring > Diagnostic settings) and stream CDBDataPlaneRequests, CDBQueryRuntimeStatistics, and CDBPartitionKeyStatistics to a Log Analytics workspace. The CDBQueryRuntimeStatistics log is particularly valuable, it records every query's RU charge, duration, and whether it required a cross-partition scan.
Handling Cosmos DB Service Unavailability (503 Errors)
HTTP 503 responses from Cosmos DB typically indicate a transient service disruption or a regional failover in progress. The Cosmos DB SDK handles these automatically with built-in retries, but if you're seeing sustained 503s, check the Azure Service Health dashboard for your region. Also verify that your failover priority settings are correctly configured, go to Settings > Replicate data globally and confirm your read regions and failover priorities are set intentionally.
Escalate to Microsoft Support if you're seeing: consistent 503 errors that aren't on the Azure Status page; data loss or corruption that isn't explained by your consistency level; RU charges that don't match what your workload should generate (possible billing anomaly); or if a container's data has become inaccessible despite the account showing as healthy. Open a support ticket with Severity A or B depending on production impact, and include your Cosmos DB account resource ID, the affected container name, and a 24-hour export from your diagnostic logs.
Prevention & Best Practices for Azure Cosmos DB
The best Azure Cosmos DB troubleshooting is the kind you never have to do. After working through these issues with dozens of teams, here are the practices that reliably prevent the most common problems from ever appearing in production.
Design your partition key before you write a single line of code. The partition key decision is the hardest thing to change later, migrating data to a new partition key in a production container is genuinely painful and requires using the Azure Data Factory or a custom migration script with the change feed. Get it right upfront. Target high cardinality (thousands of unique values, not dozens), even distribution of reads and writes, and consider future growth patterns. A good rule: if your top 5 partition key values account for more than 20% of your traffic, the key is too low-cardinality.
Enable autoscale for unpredictable workloads, manual throughput for predictable ones. Autoscale adds overhead on extremely latency-sensitive workloads because the scale-up isn't instantaneous. If you have a scheduled batch job that hits the database hard at 3 AM and then goes quiet, manual throughput with a script that adjusts RU/s on a schedule is often more cost-effective than autoscale.
Use the Cosmos DB Emulator for local development. The emulator (available for Windows, Docker on Linux/Mac) runs locally and gives you a full Cosmos DB environment without burning real Azure budget during development. It also surfaces partition key and indexing issues early, before they reach production. Download it from the Microsoft docs site and wire your local connection string to https://localhost:8081.
Implement the circuit breaker pattern in your application. If Cosmos DB starts returning 429s, a naive retry loop amplifies the problem by sending even more requests. A circuit breaker opens after a threshold of consecutive failures and stops sending requests entirely for a cool-down period. The Polly library in .NET makes this straightforward to implement alongside the built-in SDK retries.
Monitor your index storage growth. If you're storing binary blobs, large text fields, or Base64-encoded data in Cosmos DB documents and indexing them, your index can grow larger than your actual data, driving up storage costs and write latency. Set exclusion paths for any property that will never appear in a WHERE clause.
- Set up Azure Monitor alerts for 429 rate and Normalized RU Consumption > 80%, know before your users do
- Use the Cosmos DB Change Feed to offload heavy analytical queries to a separate read replica or Azure Synapse Analytics instead of hammering your production container
- Pin your SDK version in your dependency manifest and review the Cosmos DB SDK changelog before upgrading, major versions sometimes change default connection behavior
- Enable Time-to-Live (TTL) on containers that hold transient data like sessions or cache entries, expired documents are deleted automatically without consuming your write RU budget
Frequently Asked Questions
Why does my Cosmos DB keep throwing 429 errors even after I increased the RU/s?
Increasing RU/s helps with global throttling, but if you have a hot partition, the extra throughput gets distributed evenly across all physical partitions, the hot one still hits its ceiling. Check the Normalized RU Consumption metric broken down by PartitionKeyRangeId. If one range is at 100% while others are at 20%, you have a hot partition problem that requires a partition key redesign, not more RU/s. As a short-term fix, look into enabling Burst Capacity on your account, which temporarily lets a hot partition borrow unused throughput from neighboring partitions.
What's the difference between 408 and 429 errors in Azure Cosmos DB?
These are genuinely different problems. HTTP 429 means you've exceeded your provisioned request units, the database is throttling you on purpose. HTTP 408 is a request timeout, which means the request didn't complete within the allowed time window (default 60 seconds in the SDK). A 408 can happen because a query is running too long, the server is under unusual load, or there's a network issue between your client and the Cosmos DB endpoint. Check Server Side Latency in the Portal metrics first, if that's normal, the timeout is client-side and you should look at SDK configuration and query efficiency.
My Cosmos DB reads are returning old data even though I just wrote new data, what's wrong?
This is almost always a consistency level issue. If your account is set to Session, Consistent Prefix, or Eventual consistency, reads are not guaranteed to reflect the most recent write immediately or across different client instances. The most common culprit: you're using Session consistency but your serverless functions or containers are creating new CosmosClient instances on each invocation, which means each instance has its own session token, so reads on a fresh client don't "know" about writes from a previous client. Either switch to Strong or Bounded Staleness consistency for scenarios that require read-your-own-writes, or propagate session tokens manually between your write and read operations.
How do I find out which queries are costing the most RUs?
Enable the CDBQueryRuntimeStatistics diagnostic log on your Cosmos DB account and stream it to a Log Analytics workspace. Then run this query: CDBQueryRuntimeStatistics | summarize TotalRU = sum(RequestCharge) by QueryText | order by TotalRU desc | take 10. This surfaces your ten most RU-expensive query patterns. Pay special attention to queries with a RetrievedDocumentCount much higher than OutputDocumentCount, that ratio tells you how many documents the database scanned versus how many it returned, and a high ratio usually means a missing index or a cross-partition scan that could be avoided with a better partition key or composite index.
Can I change the partition key on an existing Cosmos DB container without losing data?
Not directly, the partition key is immutable once a container is created. You have to migrate your data to a new container with the correct partition key. The standard approach is to use the Cosmos DB Change Feed to stream documents from the old container to the new one while both are live. Azure Data Factory has a built-in Cosmos DB connector that makes this feasible at scale without writing custom migration code. Plan for a dual-write period where your application writes to both containers, then cut over reads to the new container after verifying consistency, and finally decommission the old container. This process is disruptive but necessary, it's one more reason to get the partition key right at design time.
Is it safe to use the Cosmos DB free tier for production?
The free tier gives you 1,000 RU/s and 25 GB of storage at no cost, per subscription, not per account. For genuine production workloads, 1,000 RU/s is almost certainly not enough; a single complex query can cost hundreds of RUs, and under any meaningful traffic load you'll hit throttling immediately. I'd use the free tier for development environments, proof-of-concept work, and very low-traffic internal tools. For anything user-facing or revenue-critical, size your RU/s properly based on expected workload, start with the Cosmos DB capacity calculator in the Azure Portal under your account's Settings > Scale blade, which lets you model your expected document sizes, reads, writes, and queries to get a realistic RU/s estimate.