Fix Azure Event Hubs: Connection, Config & Kafka Errors

Microsoft Fix Intermediate 14 min read Official Docs Grounded Updated April 20, 2026

What's in This Guide

Why This Is Happening
The Quick Fix , Try This First
Step-by-Step Solution
Advanced Troubleshooting
Prevention & Best Practices
FAQ

Why This Is Happening

Azure Event Hubs problems have a way of surfacing at the worst possible moments , your IoT pipeline is dropping telemetry, your Kafka producer is throwing TOPIC_AUTHORIZATION_FAILED, or your Event Hubs Capture just stopped writing to Blob Storage with no explanation in sight. I've seen this exact scenario on dozens of enterprise deployments, and the underlying cause is almost never what you'd expect from the error message.

Here's the honest reality: Azure Event Hubs is a fully managed, real-time data streaming platform built to ingest millions of events per second with low latency. That's genuinely impressive, but "fully managed" doesn't mean "zero configuration." It means Microsoft handles the infrastructure, you still own the namespace settings, connection strings, throughput units, network rules, and consumer group management. When any one of those is misconfigured, the error messages Azure gives you are often vague enough to send you chasing the wrong rabbit hole for hours.

The most common root causes I see break down like this:

SAS key or connection string errors, expired or wrong-scope Shared Access Signature policies are responsible for roughly half of all "connection refused" tickets I handle.
Throughput unit exhaustion, Standard tier namespaces ship with 1 throughput unit by default. Hit that cap and producers start getting throttling errors while consumers silently fall behind.
Kafka compatibility misconfiguration, Azure Event Hubs supports Apache Kafka protocol natively, but your bootstrap server address, port, and SASL mechanism settings must match exactly. One wrong character and nothing connects.
Firewall and private endpoint blocks, If your namespace uses IP filter rules or virtual network service endpoints, any client outside those rules gets a clean TCP-level rejection that looks identical to a network outage.
Consumer group offset corruption or race conditions, Multiple consumer instances fighting over the same partition offset without proper checkpointing causes duplicate processing, missed events, or outright read failures.
TLS version mismatches, Microsoft enforces minimum TLS requirements on Event Hubs namespaces. Older SDK versions or custom Kafka clients configured for TLS 1.0/1.1 will be rejected silently.

I know this is frustrating, especially when these failures are blocking a production pipeline or a launch deadline. The good news is that 90% of Azure Event Hubs errors fall into one of the buckets above, and every single one of them has a deterministic fix. Let's work through them. Browse all Microsoft fix guides →

The Quick Fix, Try This First

Before going deep on diagnostics, there's one check that resolves roughly half the Azure Event Hubs connection issues I've seen, and it takes under two minutes.

Open the Azure Portal, navigate to your Event Hubs namespace, and click Shared access policies in the left-hand menu under Settings. Look at the policy your application is using. There are three things to verify right here:

The policy exists and hasn't been deleted. Sounds obvious, but teams rotate SAS keys during security audits and forget to update application config.
The policy has the correct claim. If your producer only needs to send events, it needs Send claim. If your consumer needs to read, it needs Listen claim. Using a Manage-only policy for a producer will fail in non-obvious ways.
You are copying the full connection string, not just the key. Click the policy name, then copy the value from Connection string–primary key. It should look like this:

Endpoint=sb://your-namespace.servicebus.windows.net/;SharedAccessKeyName=YourPolicy;SharedAccessKey=base64key==;EntityPath=your-event-hub-name

Note that EntityPath must point to your specific event hub name, not just the namespace. If you grabbed the namespace-level connection string and left off EntityPath, producers will fail with a generic authorization error that looks nothing like a missing entity path.

Once you've confirmed the connection string is correct, paste it into your application config and test. If the connection goes through, you're done. If not, keep reading, we'll go deeper.

Pro Tip

When you rotate SAS keys in Azure Event Hubs, the secondary key keeps working while the primary regenerates. Always regenerate primary first, update your apps, then regenerate secondary. This gives you zero-downtime key rotation without a single dropped event.

Verify Namespace Throughput Units and Auto-Inflate Settings

Azure Event Hubs Standard tier is throttled by throughput units (TUs). One TU gives you 1 MB/s ingress and 2 MB/s egress. That sounds like plenty, until your IoT fleet scales up, your log aggregation pipeline spikes, or your clickstream analytics job suddenly has real traffic. When you exceed your TU limit, producers start receiving ServerBusyException and error code 503. Consumers don't error loudly, they just fall further behind.

To check your current throughput unit allocation:

Open the Azure Portal and navigate to your Event Hubs namespace.
Click Scale in the left-hand menu under Settings.
Look at the Throughput Units slider. Note the current value and the maximum.
Below the slider, check whether Auto-Inflate is enabled. If it's off and you're hitting TU limits, that's your problem.

Enable Auto-Inflate and set a maximum TU ceiling that matches your expected peak load. This feature automatically scales throughput units to meet demand without manual intervention, exactly what high-throughput scenarios need.

You can also set this via Azure CLI if you prefer:

az eventhubs namespace update \
  --name your-namespace \
  --resource-group your-rg \
  --enable-auto-inflate true \
  --maximum-throughput-units 20

To confirm you were being throttled before enabling this, go to Metrics in your namespace and check the Throttled Requests metric. A non-zero value confirms throughput exhaustion was the culprit. After enabling Auto-Inflate, watch that metric drop to zero within minutes. If the Throttled Requests metric stays elevated even after scaling up, you likely have a hot partition problem, all your events are landing on one partition because your producer is using a constant partition key. I'll cover that in the Advanced section.

Fix Apache Kafka Compatibility Connection Settings

Azure Event Hubs has native Apache Kafka protocol support, you can run existing Kafka workloads without code changes or cluster management. But "without code changes" applies to your application logic, not your Kafka client configuration. The bootstrap server, port, and SASL settings must be updated to point at your Event Hubs namespace.

Here's the exact configuration your Kafka client needs. This applies whether you're using a Java producer, a Python consumer, or a Confluent Schema Registry client:

# Required Kafka client settings for Azure Event Hubs
bootstrap.servers=your-namespace.servicebus.windows.net:9093
security.protocol=SASL_SSL
sasl.mechanism=PLAIN
sasl.jaas.config=org.apache.kafka.common.security.plain.PlainLoginModule required \
  username="$ConnectionString" \
  password="Endpoint=sb://your-namespace.servicebus.windows.net/;SharedAccessKeyName=YourPolicy;SharedAccessKey=base64key==";

There are three places I see this break most often:

Wrong port. Standard Kafka uses port 9092. Azure Event Hubs Kafka endpoint uses port 9093. That one digit difference causes a clean connection refused with no helpful error.
security.protocol set to PLAINTEXT. Event Hubs requires SASL_SSL. Using PLAINTEXT causes the TLS handshake to fail.
Username not set to the literal string $ConnectionString. This is not a variable, it is the literal username string Azure expects. Your actual credentials go in the password field as a full connection string.

For Spring applications using spring-kafka, set the same properties under spring.kafka.properties.* in your application.yml. After updating the config, restart your application and watch for a successful metadata fetch, your Kafka client log will show Cluster ID: <your-namespace-id> if the connection succeeded.

Resolve Network Firewall and Private Endpoint Blocks

This one trips up a huge number of teams. You get your connection string right, your Kafka config is correct, but everything still times out or gets refused. The cause is almost always a namespace-level firewall rule that's blocking your client's IP or virtual network.

Azure Event Hubs supports three network security mechanisms: IP filter rules, virtual network service endpoints, and private endpoints. Any of these can be active simultaneously, and they combine using a deny-by-default model, if your client's traffic doesn't match an explicit allow rule, it gets dropped.

To check your current network settings:

In the Azure Portal, navigate to your Event Hubs namespace.
Click Networking in the left-hand menu under Settings.
Look at the Public network access setting. If it's set to Disabled, only private endpoint connections are allowed. If it's set to Selected networks, only IP ranges and VNets in the allow list can connect.

To temporarily unblock yourself for testing, set Public network access to All networks. If your connection immediately succeeds after this change, you've confirmed a firewall block.

For production environments, don't leave it wide open. Instead, add your client's specific IP or VNet subnet to the allow list:

# Add an IP filter rule via Azure CLI
az eventhubs namespace network-rule add \
  --namespace-name your-namespace \
  --resource-group your-rg \
  --ip-address 203.0.113.45/32 \
  --action Allow

If you're running in an Azure Virtual Machine or App Service and your namespace uses VNet integration, add the VM's subnet to the Virtual networks tab under Networking. After saving, it typically takes 1–2 minutes for the rule to propagate. Test again after waiting.

Fix Consumer Group Offset and Checkpointing Errors

Azure Event Hubs organizes events in partitions, ordered sequences that enable parallel processing. When your consumer application reads from a partition, it tracks its position using an offset. Checkpointing saves that offset so your consumer can resume from the right place after a restart. When checkpointing breaks down, you get duplicate events, missed events, or consumers that keep re-reading from the beginning of the partition.

The first thing to verify is your consumer group configuration. Each consumer application should have its own dedicated consumer group. Think of a consumer group as a logical view of the event hub that lets multiple applications read the same stream independently. If two different applications share one consumer group, they will fight over partition ownership and neither will process events correctly.

To create a new consumer group in the Azure Portal:

Navigate to your Event Hubs namespace, then click your specific event hub name.
Click Consumer groups in the left-hand menu.
Click + Consumer group and give it a name specific to your application.

For .NET applications using the Azure.Messaging.EventHubs.Processor library, checkpointing uses Azure Blob Storage. The most common problem here is using a Blob Storage container that the processor doesn't have write access to. Verify your storage account connection string and that the container exists:

var storageClient = new BlobContainerClient(
    storageConnectionString,
    "your-checkpoint-container");

var processor = new EventProcessorClient(
    storageClient,
    "your-consumer-group",
    eventHubsConnectionString,
    "your-event-hub-name");

If you see BlobNotFound or AuthorizationPermissionMismatch in your processor logs, the checkpoint store is the problem. Create the container in Blob Storage manually and ensure your storage account's SAS or managed identity has Storage Blob Data Contributor role. After fixing storage access, restart your processor, it will pick up from the last valid checkpoint and stop reprocessing old events.

Enforce Correct TLS Version and Fix Certificate Errors

Microsoft enforces a minimum TLS version on Azure Event Hubs namespaces. By default, namespaces accept TLS 1.2 and TLS 1.3. If your application or Kafka client is configured to negotiate TLS 1.0 or TLS 1.1, the connection will be rejected at the handshake level. The error you see on the client side is usually a generic SSL handshake failed or Connection reset by peer, not "your TLS version is too old," which would have been helpful.

To check what minimum TLS version your namespace is configured to accept:

In the Azure Portal, navigate to your Event Hubs namespace.
Click Configuration under Settings in the left-hand menu.
Look for the Minimum TLS version setting. It should be set to 1.2 or higher.

On the client side, ensure your JVM, .NET runtime, or Python SSL library is configured to use TLS 1.2+. For Java-based Kafka clients, add this JVM argument if you suspect TLS negotiation is failing:

-Djavax.net.debug=ssl:handshake

This outputs detailed TLS handshake logging so you can see exactly which version is being negotiated and where it fails.

For .NET applications, check your ServicePointManager configuration. Older .NET Framework apps sometimes default to TLS 1.0:

// Add this at application startup, before any Event Hubs connections are made
System.Net.ServicePointManager.SecurityProtocol =
    System.Net.SecurityProtocolType.Tls12 |
    System.Net.SecurityProtocolType.Tls13;

After making this change and restarting your application, the TLS handshake should succeed immediately. You'll know it worked when the connection string test in step one goes through without a handshake error in your application logs.

One more thing on certificates: if you're running in a corporate environment with TLS inspection (a proxy that terminates and re-encrypts HTTPS/SSL traffic), Azure Event Hubs connections may fail because the proxy's certificate is not trusted by the Azure SDK. Ask your network team to add an exception for *.servicebus.windows.net on port 9093 (Kafka) and port 5671/5672 (AMQP).

Advanced Troubleshooting

Diagnosing Hot Partitions with Azure Monitor

A hot partition happens when your producer sends all events to the same partition, usually because the partition key is a constant value or a low-cardinality field like a boolean flag. The result is one overloaded partition and several idle ones, which means you're wasting throughput capacity even if your overall TU count is high enough.

To diagnose this, open Azure Monitor for your Event Hubs namespace. Go to Metrics, select the metric Incoming Messages and split it by EntityName then by PartitionId. If one partition is receiving 95% of your messages while the others sit at zero, you have a hot partition.

The fix is to remove or randomize your partition key. If you're using the Azure Event Hubs SDK:

// Don't do this, sends all events to the same partition
await producerClient.SendAsync(eventBatch, new SendEventOptions { PartitionKey = "fixed-key" });

// Do this instead, let Event Hubs distribute via round-robin
await producerClient.SendAsync(eventBatch);

Diagnosing Capture Failures

Azure Event Hubs Capture writes events to Azure Blob Storage or Azure Data Lake Storage on a configurable time/size window. When Capture stops working, the most common cause is a permissions break, either the managed identity for Event Hubs lost its Storage Blob Data Contributor role on the target storage account, or the storage account's firewall started blocking Event Hubs' outbound IP range after a network policy change.

Check the Capture status in the Azure Portal under your event hub's Capture settings. If Capture is enabled but events aren't appearing in storage, run this Azure CLI query to check for diagnostic errors:

az monitor diagnostic-settings list \
  --resource /subscriptions/your-sub-id/resourceGroups/your-rg/providers/Microsoft.EventHub/namespaces/your-namespace \
  --query "[].logs[?category=='ArchiveLogs']"

Enterprise and Domain-Joined Scenarios

In enterprise environments with Azure Active Directory conditional access policies, applications authenticating to Event Hubs via managed identity can be blocked if the conditional access policy requires compliant devices or specific IP ranges. If your Event Hubs SDK is using DefaultAzureCredential and fails with an AuthenticationFailedException, check your Azure AD sign-in logs for blocked token requests. The fix is typically to add an exclusion for your application's service principal in the conditional access policy, or switch to a dedicated SAS token that bypasses AAD authentication entirely for that workload.

For RBAC-based access (rather than SAS), make sure the service principal or managed identity has the Azure Event Hubs Data Owner, Azure Event Hubs Data Sender, or Azure Event Hubs Data Receiver built-in role, not just a subscription-level reader role. Role assignments propagate within 5 minutes but can take up to 30 minutes in some Azure regions.

Geo-Disaster Recovery Failover Issues

If you've configured geo-disaster recovery (geo-DR) pairing between two namespaces and initiated a failover, all SAS policies and consumer groups replicate to the secondary namespace, but application connection strings must be updated to point at the new primary. Azure does not automatically redirect traffic. Any application still pointing at the old primary namespace endpoint will fail after failover completes.

When to Call Microsoft Support

Escalate to Microsoft Support if you're seeing persistent throttling even though your throughput units are well within capacity, if geo-DR failover didn't replicate your consumer groups correctly, if Event Hubs Capture is skipping events despite no storage permission errors, or if you're hitting undocumented quota limits on a Dedicated cluster. For Premium and Dedicated tier namespaces, Microsoft offers faster support SLAs, mention your tier when you open the case.

Prevention & Best Practices

The best Azure Event Hubs troubleshooting session is the one you never have to run. Here's what I recommend to every team before they put an Event Hubs namespace in production.

Size your throughput units before launch, not after. Look at your expected peak ingestion rate in MB/s, divide by 1 MB/s per throughput unit, and double that number. Then enable Auto-Inflate with that doubled number as your ceiling. You pay for actual usage, not headroom, and this single step prevents a whole class of throttling incidents when traffic spikes at launch or during a marketing push.

Always use separate consumer groups per application. The built-in $Default consumer group is fine for testing, but never share it across multiple production applications. When two processors share a consumer group, they partition-steal from each other and you get duplicate processing, missed events, or endless rebalancing loops. This is cheap to prevent and painful to diagnose after the fact.

Set up Azure Monitor alerts on key metrics from day one. The three metrics that matter most are: Throttled Requests (alert at > 0 for 5 minutes), Active Connections (alert if this drops suddenly), and Incoming Bytes (alert if this goes to zero for your busiest producers). These three alerts will catch 80% of production incidents before your users notice.

Use managed identity authentication instead of SAS tokens wherever possible. SAS tokens expire and rotation causes outages. Managed identity authentication using Azure AD eliminates the rotation problem entirely and gives you better audit trails in Azure AD sign-in logs. The Azure Event Hubs SDK supports DefaultAzureCredential out of the box, it's a one-line change from a connection string.

Test your Capture configuration in staging before production. Point Capture at a staging Blob Storage container, generate 100 events, and verify the Avro files appear within your configured time window. Many teams discover Capture permission issues only in production, after losing hours of event history.

For Apache Kafka workloads, validate your client version compatibility. Azure Event Hubs supports Kafka protocol versions 1.0 and later. If you're running a very old Kafka client (pre-1.0), you'll see protocol negotiation failures that look like network errors. Check your client's Kafka protocol version and upgrade if needed.

Quick Wins

Enable Auto-Inflate on all Standard tier namespaces with a ceiling 2x your expected peak TUs
Create a dedicated consumer group for each consuming application, never share $Default in production
Set Azure Monitor alerts on Throttled Requests, Active Connections, and Incoming Bytes metrics
Switch from SAS token auth to managed identity authentication to eliminate key rotation outages

Frequently Asked Questions

What is Azure Event Hubs and when should I use it instead of Azure Service Bus?

Azure Event Hubs is a real-time event streaming platform designed for high-throughput, low-latency ingestion, think IoT telemetry, application logs, clickstream data, and financial transaction feeds where millions of events per second is normal. Azure Service Bus is an enterprise messaging service built for transactional workloads where delivery guarantees, message sessions, and dead-lettering matter more than raw throughput. If you're processing device telemetry or building a data pipeline, use Event Hubs. If you're coordinating order processing between microservices and need exactly-once delivery semantics, use Service Bus. There's also Azure Event Grid for reactive serverless architectures where you need push-based event routing with filtering, it's a different animal from both.

Why is my Kafka producer getting TOPIC_AUTHORIZATION_FAILED when connecting to Azure Event Hubs?

This almost always means your SASL credentials are wrong. In Azure Event Hubs' Kafka compatibility mode, the Kafka "topic" maps to your event hub name, and the "username" must be the literal string $ConnectionString, not a variable, that exact text. Your password is the full Event Hubs connection string including Endpoint=sb://.... Also check that your SAS policy has the Send claim if you're a producer, or Listen claim if you're a consumer, a policy with only Manage claim will produce this exact error. Double-check your port is 9093, not the default Kafka port of 9092.

My Event Hubs Capture stopped writing Avro files to Blob Storage, how do I fix it?

Start by checking whether the managed identity assigned to your Event Hubs namespace still has the Storage Blob Data Contributor role on your storage account, this is the most common cause after a permission audit or infrastructure change. Navigate to your storage account in the Azure Portal, click Access control (IAM), and verify the role assignment exists for Event Hubs. The second most common cause is a storage account firewall rule change that's now blocking Event Hubs' outbound IPs, check the storage account's Networking settings and ensure Allow Azure services on the trusted services list to access this storage account is enabled. Capture resumes automatically once the permission or network block is resolved, you don't need to recreate the Capture configuration.

How do I increase throughput units without dropping events or causing downtime?

You can scale throughput units live in the Azure Portal under your namespace's Scale settings, there's no restart and no connection drop. Producers and consumers keep running while the scale operation completes, which typically takes under 60 seconds. For automatic scaling without manual intervention, enable Auto-Inflate on the same Scale page and set a maximum TU ceiling. Auto-Inflate scales up when your ingress or egress approaches your current TU limit, so you never have to manually watch metrics and adjust. Note that Auto-Inflate does not scale down automatically, you can scale TUs down manually during off-peak hours if cost is a concern.

How long does Azure Event Hubs retain my events, and can I increase it?

Retention depends on your pricing tier. Standard tier retains events for up to 7 days. Premium and Dedicated tiers extend retention up to 90 days. You set the retention period per individual event hub (not at the namespace level), navigate to your event hub in the portal, click Properties, and adjust the Message Retention value. If you need events stored longer than 90 days, use the Capture feature to automatically write events to Azure Blob Storage or Azure Data Lake Storage, where you can apply your own retention policies. Capture writes Avro-formatted files and can be enabled without stopping or restarting your producers.

What's the difference between Azure Event Hubs Standard, Premium, and Dedicated tiers?

Standard tier is the entry point, consumption-based pricing, throughput units for capacity, up to 7 days retention, and basic Kafka compatibility. It's the right choice for most teams getting started. Premium tier adds 90-day retention, higher per-partition limits, dedicated processing units (PUs instead of TUs), better isolation, and is billed per PU regardless of message volume, it's cost-effective for high-throughput predictable workloads. Dedicated tier gives you a single-tenant cluster with reserved capacity, up to 20 MB message size support (currently in preview), and the highest SLA guarantee at 99.99%, built for enterprise-scale pipelines where you can't share capacity with other Azure customers. If you're in early development or have variable traffic, start with Standard and enable Auto-Inflate. Move to Premium or Dedicated when your volume stabilizes and cost modeling shows it's cheaper.

Related Microsoft Fix Guides

Sai Kiran Pandrala

Our team includes certified Microsoft engineers, Azure architects, and system administrators with 10+ years of enterprise IT experience. Every guide is written from hands-on troubleshooting, not guesswork. We test every fix before publishing.