Azure Managed Instance for Apache Cassandra: Fix Guide

Microsoft Fix Intermediate 14 min read Official Docs Grounded Updated April 20, 2026

Why This Is Happening

I've seen this pattern more times than I can count: a team gets excited about Azure Managed Instance for Apache Cassandra, spins up their first cluster, and then hits a wall , nodes that won't join the ring, hybrid connectivity that refuses to stabilize, or scaling operations that hang without explanation. The error messages you get from the Azure portal are often vague, and Cassandra's own logs can feel like reading tea leaves if you're not already deep in the ecosystem.

Here's the thing. Azure Managed Instance for Apache Cassandra is a fully managed service, which means Microsoft's platform is doing a lot of heavy lifting underneath , VM scale set provisioning, automatic patching, certificate rotation, snapshot backups, and weekly maintenance runs using nodetool repair via a tool called Reaper. But "fully managed" doesn't mean "zero configuration." The service gives you genuine control over configurations to accommodate specific workload needs, and that flexibility is exactly where most problems sneak in.

The most common failure points I see break down into a few distinct categories:

  • Network misconfiguration: The service deploys datacenters as virtual machine scale sets inside your Azure Virtual Network. If your VNet subnets, NSG rules, or private DNS zones aren't set up correctly before you try to provision, the cluster creation will fail, sometimes silently, sometimes with a generic ARM deployment error that doesn't point you to the actual cause.
  • Hybrid ring membership failures: When you're connecting an existing on-premises Cassandra ring to Azure via ExpressRoute, the new managed nodes need to be able to gossip with your existing nodes. Routing issues at the ExpressRoute level, wrong seed node configuration, or firewall rules blocking port 7000 (inter-node communication) will prevent the hybrid datacenter from ever joining your ring properly.
  • Compaction strategy mismatches: Cassandra's compaction behavior is driven by the strategy you set at the table level using CQL. The managed service runs Merkle tree compaction as part of its weekly repair cycle, but if your table-level strategy is misconfigured for your write patterns, you'll end up with runaway disk usage or read amplification that looks like a platform problem but is actually a configuration problem.
  • Version assumption errors: The service currently supports Cassandra versions 3.11 and 4.0. If you're expecting a newer version to be available or you didn't specify the version during CLI deployment, you may end up with a version that doesn't match your application's driver expectations.
  • Scaling confusion: Scaling in the managed instance is intentionally hands-off, you specify the node count, and the orchestrator handles ring membership. But if you trigger a scaling operation while a patch cycle is running, or while Reaper is mid-repair, the operation can queue in unexpected ways that look like it's hung.

I know this is frustrating, especially when these problems are blocking a production migration. The good news is that almost every issue I've listed has a clear fix path once you know what to look at. Let's get into it. Browse all Microsoft fix guides →

The Quick Fix, Try This First

If your Azure Managed Instance for Apache Cassandra cluster is failing to provision, the single most impactful thing you can do right now is verify your Virtual Network configuration before touching anything else. This resolves the majority of first-time provisioning failures.

Open the Azure portal, navigate to Virtual Networks → [your VNet] → Subnets, and confirm the following:

  1. The subnet delegated to the managed Cassandra service has at least a /24 address space (256 addresses). Smaller subnets will cause VM scale set provisioning to fail when the cluster tries to allocate addresses for each node.
  2. The subnet has Microsoft.AzureCosmosDB/cassandraClusters listed as a service delegation. Without this delegation, the platform cannot inject the necessary management agents into your VNet.
  3. Your Network Security Group (NSG) allows inbound and outbound on ports 7000 (inter-node gossip), 7001 (TLS inter-node), 9042 (CQL native), and 9142 (TLS CQL). Blocking any of these is the single most common silent failure in hybrid setups.

Once you've confirmed the VNet is correct, go to Azure Managed Instance for Apache Cassandra → [your cluster] → Overview and check the Provisioning State. If it still shows Failed, click into the cluster resource and look at the Activity Log, filter by "Failed" operations. The ARM operation details in that log will tell you the actual error, which is almost always more specific than what the main portal blade shows.

If the cluster is provisioned but nodes aren't healthy, run the following Azure CLI command to get node-level status:

az managed-cassandra cluster invoke-command \
  --resource-group <your-rg> \
  --cluster-name <your-cluster> \
  --host <node-ip> \
  --command-name "nodetool" \
  --arguments "status"

The output will show you the Cassandra ring from that node's perspective, look for nodes in DN (Down/Normal) or DL (Down/Leaving) status. Any node showing DN is a node the ring knows about but can't reach, which points directly to a network or VM health issue rather than a Cassandra configuration issue.

Pro Tip
When you're looking at provisioning failures in the Activity Log, always expand the "Raw JSON" view of the failed operation rather than reading the portal's human-readable summary. The message field inside the JSON error object almost always contains the actual error reason, while the portal UI often only surfaces the top-level status code like DeploymentFailed, which tells you nothing actionable on its own.
1
Validate Virtual Network Prerequisites Before Cluster Creation

The Azure Managed Instance for Apache Cassandra service places managed datacenters, deployed as virtual machine scale sets, directly into your Azure Virtual Network. This is not optional architecture; it's how the service works. Getting the VNet wrong before you create a cluster means debugging provisioning failures after the fact, which is much harder than checking upfront.

Navigate to Azure Portal → Virtual Networks → [your target VNet] → Subnets. Select the subnet you plan to use for Cassandra nodes. In the subnet configuration blade:

  • Under Subnet delegation, set the delegation to Microsoft.AzureCosmosDB/cassandraClusters. This is required, the service cannot manage the VMs without this delegation in place.
  • Confirm the address range. Each Cassandra node will consume one IP address in this subnet, plus additional IPs for Azure management infrastructure. A /24 is the documented safe minimum for production clusters.
  • If you have a private DNS zone attached to this VNet, verify it's not conflicting with the internal DNS entries the managed service creates for node-to-node communication.

For the NSG attached to this subnet, open Network Security Groups → [your NSG] → Inbound Security Rules and confirm you have rules permitting traffic on ports 7000, 7001, 7199 (JMX), 9042, 9142, and 8080 (metrics). For hybrid setups where your on-premises nodes need to reach Azure nodes, these same ports need to be open across your ExpressRoute circuit as well.

If everything looks correct here, you should be able to proceed to cluster creation and see the provisioning state reach Succeeded within about 15–25 minutes depending on node count. If it fails after this validation, pull the Activity Log immediately, the window for detailed error messages in ARM is limited before they roll off.

2
Create Your Cluster with the Correct Cassandra Version Specified

One of the most common gotchas with Azure Managed Instance for Apache Cassandra is not explicitly specifying the Cassandra version during deployment, especially via the Azure CLI. The service supports Cassandra 3.11 and 4.0, both of which are generally available. If you don't specify a version, you may not get the one your application drivers expect, and Cassandra driver version mismatches produce some genuinely confusing errors at the application layer.

To create a cluster via the Azure CLI with the version explicitly set, use this command structure:

az managed-cassandra cluster create \
  --resource-group <your-rg> \
  --cluster-name <cluster-name> \
  --location <azure-region> \
  --delegated-management-subnet-id <subnet-resource-id> \
  --cassandra-version "4.0" \
  --initial-cassandra-admin-password <your-password>

The --cassandra-version flag maps to either "3.11" or "4.0". Note that once a cluster is provisioned, you cannot change the major (X) version without reprovisioning. The service does allow you to control major and minor version upgrades manually via service tooling, while the patch-level (Z) updates within a given major/minor are applied automatically when security vulnerabilities are identified.

After cluster creation, add a datacenter using:

az managed-cassandra datacenter create \
  --resource-group <your-rg> \
  --cluster-name <cluster-name> \
  --data-center-name <dc-name> \
  --data-center-location <azure-region> \
  --node-count 3 \
  --delegated-subnet-id <subnet-resource-id> \
  --sku Standard_DS14_v2

When the datacenter reaches Provisioning State: Succeeded, you'll see the node count reflected in the cluster's datacenter list. Three nodes with a replication factor of 3 is your minimum viable starting point, and per the official documentation, it's the minimum required to survive a patching cycle without availability impact.

3
Configure Hybrid Connectivity via ExpressRoute

If you're running an existing on-premises Apache Cassandra ring and want to extend it into Azure, which is exactly the hybrid scenario this service is designed for, the connectivity setup has to be right before you add any Azure-managed datacenters to the ring. Trying to add a managed datacenter to a ring that can't reach it will result in nodes that never transition from joining status in nodetool status, which is one of the more frustrating states to debug.

First, your Azure ExpressRoute circuit needs to be provisioned and connected to the VNet where your managed Cassandra nodes will live. In the Azure portal, navigate to ExpressRoute → [your circuit] → Peerings and confirm that private peering is configured and the circuit shows Provider Status: Provisioned.

Once ExpressRoute connectivity is established, the key configuration step is providing your existing on-premises seed nodes when creating your managed datacenter. The managed service needs to know about your existing Cassandra ring members so it can bootstrap new nodes into the correct ring:

az managed-cassandra datacenter create \
  --resource-group <your-rg> \
  --cluster-name <cluster-name> \
  --data-center-name "azure-dc-east" \
  --data-center-location eastus \
  --node-count 3 \
  --delegated-subnet-id <subnet-resource-id> \
  --base-url "https://management.azure.com" \
  --external-seed-nodes "[{\"ipAddress\":\"10.0.0.1\"},{\"ipAddress\":\"10.0.0.2\"}]"

The --external-seed-nodes parameter accepts the IP addresses of your existing on-premises Cassandra seed nodes. Without this, the managed nodes will bootstrap into an isolated ring rather than joining your existing cluster. After the managed datacenter provisions, verify ring membership from both sides, run nodetool status on an on-premises node and confirm the Azure-managed nodes appear in the UN (Up/Normal) state. If they're stuck in UJ (Up/Joining), check firewall rules on both sides for port 7000 and verify that your on-premises nodes can reach the managed nodes' private IPs within the Azure VNet.

4
Fix Compaction Strategy Errors That Cause Disk Exhaustion

This one catches teams off guard because it looks like an infrastructure problem, disk usage climbing, read latencies spiking, but the root cause is almost always at the Cassandra table configuration level. The managed service performs Merkle tree compaction as part of its weekly repair cycle using Reaper, but that's separate from the table-level compaction strategy that drives how SSTables are merged during normal write operations.

If you're seeing disk usage grow faster than your data volume would explain, or if read latencies are elevated and climbing, the first thing to check is whether your compaction strategy matches your workload's write pattern. Connect to your cluster via CQL and check the current strategy on your busiest tables:

SELECT table_name, compaction
FROM system_schema.tables
WHERE keyspace_name = '<your_keyspace>';

The three most common strategy mismatches I see:

  • SizeTieredCompactionStrategy (STCS) on a time-series write-heavy workload: STCS accumulates SSTables until size thresholds are hit. For write-heavy workloads, this means SSTables pile up, read performance degrades, and disk usage climbs before compaction kicks in. Switch to TimeWindowCompactionStrategy (TWCS) for time-series data.
  • LeveledCompactionStrategy (LCS) on a bulk-ingest workload: LCS is great for read-heavy workloads but creates significant write amplification on bulk ingestion. If you're ingesting heavy data streams, LCS will saturate your disk I/O.
  • No compaction tuning at all: Default settings rarely match production workloads. Always review compaction configuration before going live.

To change the compaction strategy on an existing table without downtime:

ALTER TABLE <keyspace>.<table>
WITH compaction = {
  'class': 'TimeWindowCompactionStrategy',
  'compaction_window_unit': 'DAYS',
  'compaction_window_size': 1
};

The official documentation explicitly warns against running manual compactions outside your configured strategy. The managed service's repair mechanism is designed to work with, not around, your table-level compaction settings. After changing the strategy, monitor disk usage and read latencies over the next 24–48 hours to confirm improvement.

5
Set Up Metrics Monitoring with Prometheus and Grafana

Flying blind on a managed Cassandra cluster is a bad idea. The service emits detailed metrics from each datacenter node using Metric Collector for Apache Cassandra, and those metrics are accessible via Prometheus. If you're not collecting them, you won't know about resource pressure, CPU saturation, disk usage approaching limits, quorum loss risk, until it's already affecting your application.

The service is also integrated directly with Azure Monitor for metrics and diagnostic logging, which gives you two paths to observability. For most teams, I recommend setting up Azure Monitor first (it requires no additional infrastructure) and then adding Prometheus/Grafana if you need deeper Cassandra-specific metrics.

To enable Azure Monitor diagnostic logging for your cluster, go to Azure Portal → Managed Instance for Apache Cassandra → [your cluster] → Diagnostic Settings → Add Diagnostic Setting. Select the following log categories:

  • CassandraLogs, captures Cassandra system logs including compaction events, repair operations, and node state changes
  • CassandraAudit, captures authentication events and CQL query audit trails (important for compliance)

Send these to a Log Analytics Workspace. Once flowing, you can query them in Azure Monitor Logs using KQL. For example, to find recent node state changes:

CassandraLogs
| where TimeGenerated > ago(24h)
| where Message contains "STATE"
| project TimeGenerated, Message, Computer
| order by TimeGenerated desc

For Prometheus integration, the Metric Collector agent running on each node exposes metrics on port 9443 by default. Point your Prometheus scrape config at each node's private IP on that port. The official Grafana dashboards for Cassandra work directly with these metrics once they're flowing into Prometheus. Navigate to Grafana → Dashboards → Import and use dashboard ID 17871 (Cassandra Overview) as a starting point, it surfaces the read/write latency histograms, compaction metrics, and GC pause data you need to spot problems before they escalate.

Advanced Troubleshooting

Investigating Node Health Failures

The managed service actively monitors each node's membership in the Cassandra ring, and it autodetects and tries to mitigate infrastructure issues, VM failures, network problems, storage issues, OS-level failures. But there are situations where automatic mitigation isn't sufficient and you need to dig in manually.

When a node shows as down in nodetool status, the first place to look is Azure Portal → Managed Instance for Apache Cassandra → [cluster] → Data Centers → [datacenter] → Nodes. The portal shows per-node health status. If a node shows Unhealthy at the VM level, this is a platform issue, open a support case immediately because this falls under the service SLA.

If the VM is healthy but the Cassandra process is down, check the node's system logs via Azure Monitor:

CassandraLogs
| where Computer == "<node-hostname>"
| where TimeGenerated > ago(6h)
| order by TimeGenerated desc
| take 100

Look for java.lang.OutOfMemoryError, GCInspector warnings about long GC pauses, or CorruptSSTableException entries. These indicate JVM heap pressure or SSTable corruption, both of which are configuration-level issues (heap sizing, compaction strategy) rather than platform failures.

LDAP Authentication Problems

If you've enabled LDAP authentication on your managed Cassandra cluster and users are getting authentication errors, the most common cause is a mismatch between the LDAP search base DN configuration and your actual directory structure. Verify your LDAP configuration is correctly applied using:

az managed-cassandra cluster show \
  --resource-group <your-rg> \
  --cluster-name <cluster-name> \
  --query "properties.ldapSettings"

Confirm that searchBaseDn, serverHostname, and searchFilterTemplate are all correctly populated. A misconfigured searchFilterTemplate is a frequent culprit, it needs to contain {0} as the placeholder for the username being authenticated.

Reaper Conflicts with Custom Repair Services

The managed service runs nodetool repair through Reaper once per week. If you're running a hybrid deployment and you have your own repair service running on the on-premises side, you can end up with two repair processes competing for the same token ranges, which causes elevated load and occasionally causes nodes to report as overloaded during the repair window.

The official documentation is clear on this: if you're using your own repair service for a hybrid deployment, disable Reaper on the managed side. Contact Azure support to have Reaper disabled for your cluster, this is not a self-service setting you can change in the portal today. Keep your own repair service running, ensure it covers the Azure-managed nodes in the token range, and run at a cadence that doesn't overlap with your peak traffic windows.

Scaling Operations That Appear Hung

If you've issued a datacenter scale command and the operation seems to be taking much longer than expected, check whether a patch cycle is running concurrently. Patching reboots machines one rack at a time, and the scaling orchestrator will wait for the ring to return to a stable quorum state before adding or removing nodes. This is intentional behavior, not a bug. The operation will complete, it's just queued behind the in-progress patching operation. You can verify the current patch status by looking at the Activity Log for recent updateCassandraDatacenter operations.

When to Call Microsoft Support

Open a support request in the Azure portal immediately if you see: VM-level node failures that don't self-heal within 30 minutes; disk or network failures reported at the infrastructure level; quorum loss across a datacenter with no obvious application-side cause; or any platform behavior that your Azure Activity Log shows as a failed Microsoft.AzureCosmosDB operation. The support team provides 24x7 coverage with autogenerated incidents for severe outages, and you get a single point of contact, you don't need to open separate tickets with compute, storage, and networking teams.

What's not a platform issue: slow queries, disk exhaustion from uncompacted SSTables, authentication failures from incorrect LDAP config, or throughput limits from undersized VM SKUs. These land in your court per the official documentation. That said, the Azure support team will still provide guidance and recommendations on how to remediate these, they just won't resolve them for you. File your case at Microsoft Support.

Prevention & Best Practices

Most of the painful issues I've walked through in this guide are preventable. Once you've got a working Azure Managed Instance for Apache Cassandra cluster, there's a set of habits that will keep you out of trouble long-term.

Plan your replication factor before you write any data. The managed service makes scaling datacenter node counts easy, but changing your keyspace replication factor after data is already written requires a coordinated operation, you need to alter the keyspace, then run a full repair to redistribute token ranges. Do this upfront. Start with a replication factor of 3 in every datacenter from day one. The patching documentation is explicit: you should not be running with quorum ALL consistency, and your replication factor should be 3 or higher to survive a patching cycle without availability impact.

Choose your VM SKU based on your P30 disk count requirements, not just CPU and memory. The pricing model lets you choose cores, VM SKU, memory size, and number of P30 disks per node independently. Cassandra is a disk-bound workload at scale, most teams underestimate how many P30 disks they'll need and then hit storage capacity limits at an inconvenient time. Size with headroom: plan for 70% maximum disk utilization to give compaction enough working space.

Test your patching experience before it happens in production. OS-level patches run every two weeks automatically, and machines are rebooted one rack at a time. In a properly configured cluster with replication factor 3 and non-ALL consistency, you won't notice this. But if you haven't tested rack-level failover with your application, you might find out during a patch cycle that your application wasn't as resilient as you thought. Simulate a rack going offline in your staging cluster before it happens automatically in production.

Monitor your weekly Reaper repair runs. Reaper runs nodetool repair weekly to keep your data consistent. Check your Azure Monitor logs after each repair run to confirm it completes cleanly. A repair that's failing silently, due to a node being overloaded during the repair window, is a consistency problem waiting to happen.

Quick Wins
  • Set Azure Monitor alerts on disk usage thresholds (warn at 60%, critical at 75%) for every node, storage exhaustion is the leading cause of unplanned outages in managed Cassandra clusters
  • Explicitly specify your Cassandra version (3.11 or 4.0) in every CLI command and ARM template, never rely on defaults for version-sensitive configuration
  • Validate all NSG rules and VNet subnet delegation before attempting cluster creation, fixing these after provisioning fails costs far more time than checking upfront
  • For hybrid clusters, document and test your cross-site seed node list so that if an on-premises node used as a seed goes down, you have backups ready to supply to the managed datacenter configuration

Frequently Asked Questions

What is Azure Managed Instance for Apache Cassandra and how is it different from self-managed Cassandra?

Azure Managed Instance for Apache Cassandra is a fully managed service that handles the operational side of running Apache Cassandra clusters, things like VM provisioning, OS patching (on a two-week automatic cadence), Cassandra software patching, certificate rotation, snapshot backups, and weekly repair via Reaper. Self-managed Cassandra means you're responsible for all of that yourself. The key distinction is that the managed service deploys Cassandra nodes as virtual machine scale sets inside your own Azure Virtual Network, which means you keep full network isolation and control while Microsoft handles the operational overhead. You still control configurations at the Cassandra level, keyspace settings, compaction strategies, consistency levels, so it's not a black box, but the infrastructure underneath it is managed for you.

Which Cassandra versions does Azure Managed Instance support right now?

The service currently supports Apache Cassandra versions 3.11 and 4.0, and both are generally available. You specify the version using the --cassandra-version flag in the Azure CLI at cluster creation time, if you don't specify it, you may not get the version you expect. Once provisioned, major and minor version upgrades can be controlled manually using service tools; patch-level updates within your chosen major/minor version are applied automatically by the platform when security vulnerabilities are identified. There is no in-place upgrade path between 3.11 and 4.0 today, so pick your version carefully based on your application driver compatibility.

How do I connect my existing on-premises Cassandra cluster to Azure Managed Instance?

This is the hybrid deployment scenario the service is specifically built for. You need an Azure ExpressRoute circuit connecting your on-premises network to the Azure Virtual Network where your managed nodes will be deployed. Once ExpressRoute private peering is established and confirmed as provisioned, you create your managed datacenter using the --external-seed-nodes parameter to provide the IP addresses of your existing on-premises Cassandra seed nodes. The managed nodes will use those seed addresses to bootstrap into the existing ring. After provisioning, verify ring membership with nodetool status from both on-premises and Azure-side nodes, all nodes should appear in UN (Up/Normal) state. Make sure ports 7000 and 7001 are open bidirectionally across your ExpressRoute circuit.

How do I scale my Azure Managed Instance for Apache Cassandra datacenter up or down?

Scaling is designed to be simple, you specify the target node count and the platform's scaling orchestrator handles everything else, including adding the new nodes to the Cassandra ring and streaming data to them. Using the Azure CLI: az managed-cassandra datacenter update --resource-group <rg> --cluster-name <cluster> --data-center-name <dc> --node-count <new-count>. You can also do this from the portal under your datacenter's settings. One important thing to know: if a patching operation is currently running when you trigger a scaling operation, the scale will queue behind it. This is intentional, the orchestrator waits for a healthy quorum state before altering the node count. Don't cancel the operation; just wait it out.

What is Reaper and should I disable it for hybrid clusters?

Reaper is the tool the managed service uses to run nodetool repair on your cluster once per week. Repair in Cassandra performs a Merkle tree comparison across replicas to identify and fix any data inconsistencies, it's a critical maintenance operation for data consistency. For pure Azure clusters, you should leave Reaper running and let it do its job. For hybrid clusters where your on-premises side has its own repair service already running, you may want to disable Reaper on the managed side to avoid two repair processes competing for the same token ranges, which increases load unnecessarily. Contact Azure support to disable Reaper, it's not a portal-accessible setting. Make sure your on-premises repair service covers the Azure-managed nodes after disabling it.

How do I get started creating my first Azure Managed Instance for Apache Cassandra cluster?

The fastest path to a working cluster is through the Azure portal, navigate to "Create a resource," search for "Azure Managed Instance for Apache Cassandra," and the creation wizard will walk you through VNet selection, datacenter configuration, and node sizing. For repeatable, scriptable deployments, the Azure CLI quickstart is the better choice: install the azure-cli, run az extension add --name cosmosdb-preview, then use the az managed-cassandra cluster create and az managed-cassandra datacenter create commands as covered in the step-by-step section of this guide. The minimum viable cluster for testing is 3 nodes with a replication factor of 3. After the cluster provisions, connect using any Cassandra-compatible client on port 9042 using the admin credentials you set during creation.

Related Microsoft Fix Guides

H
Sai Kiran Pandrala
Our team includes certified Microsoft engineers, Azure architects, and system administrators with 10+ years of enterprise IT experience. Every guide is written from hands-on troubleshooting, not guesswork. We test every fix before publishing.