Azure Event Grid: Fix Common Setup & Config Errors
Why This Is Happening
If you've landed here, you're probably staring at a silent event subscription, a dead MQTT connection, or an Azure Event Grid namespace that simply refuses to behave the way Microsoft's quickstart guide implied it would. I've seen this exact situation on dozens of Azure deployments , and the frustrating part is that Event Grid's error messages often tell you almost nothing useful. "Delivery failed." Great. Thanks, Azure.
Here's the reality of Azure Event Grid: it's a powerful, fully managed publish-subscribe service that handles both MQTT messaging for IoT workloads and HTTP-based event distribution for cloud application architectures. It supports CloudEvents 1.0, both MQTT v3.1.1 and MQTT v5.0, and gives you two distinct delivery models , push and pull. That flexibility is genuinely great once everything is working. Getting there is where people run into walls.
The root causes break into a few distinct buckets. First, there's the namespace vs. basic tier confusion. Azure Event Grid now has two main modes of operation: the classic "Event Grid Basic" model using custom topics, system topics, domain topics, and partner topics, and the newer Namespace model, which supports both pull delivery and push delivery and is where MQTT broker capability lives. If you set up an event subscription expecting pull delivery but created a resource in the wrong tier, your events are going nowhere.
Second, authentication mismatches are endemic. Event Grid's MQTT broker supports X.509 certificate authentication, Microsoft Entra ID authentication, OAuth 2.0 JWT, and even custom webhook authentication. Each of those requires a different configuration path, and mixing up the expected auth method for your client type is one of the top reasons connections silently fail with no meaningful error surfaced in the portal.
Third, event subscription misconfiguration. Developers often define the wrong endpoint type, miss required headers for webhook validation, or skip the dead-letter storage account configuration, which means failed deliveries vanish without a trace.
Finally, network and TLS issues. Event Grid enforces TLS 1.2 minimum (TLS 1.3 is also supported), and clients in firewall-restricted environments need MQTT over WebSocket enabled to get traffic through. If your organization's egress rules aren't updated for Event Grid's namespace endpoints, you'll see connection timeouts that look exactly like authentication failures.
I know this is frustrating, especially when it's blocking an IoT pipeline or a production event-driven workflow. Let's fix it systematically. Browse all Microsoft fix guides →
The Quick Fix, Try This First
Before you go deep on configuration audits, run through this fast checklist. In my experience, about 60% of Azure Event Grid problems are resolved by one of these three things.
1. Verify your resource tier matches your intended delivery model. Open the Azure portal, navigate to your Event Grid resource, and look at the overview blade. If you need pull delivery or MQTT support, you must be in a Namespace resource, not a plain custom topic. Go to All Resources → search "Event Grid" → check the Type column. You want to see "Event Grid Namespace" for namespace features. If you see "Event Grid Topic," that's the classic tier and it only supports push delivery to webhooks and Azure services.
2. Re-validate your webhook endpoint for push delivery subscriptions. When you create an event subscription targeting a webhook, Event Grid sends a validation request to that URL before it starts delivering real events. If your endpoint didn't respond with a 200 OK and the correct validation code within 10 seconds, the subscription sits in a "pending validation" state and delivers nothing. In the portal go to Event Subscriptions → select your subscription → look for "Provisioning State." If it says anything other than "Succeeded," that's your problem. Fix the endpoint, delete the subscription, and recreate it.
3. Check that your MQTT client certificates haven't expired. This is the silent killer for IoT workloads. Event Grid's MQTT broker uses X.509 certificate authentication by default for device clients. A cert that expired last Tuesday causes an immediate, silent connection refusal. Run this against your client cert PEM file:
openssl x509 -in your-client-cert.pem -noout -dates
The notAfter line tells you the expiry date. If it's in the past, that's your problem, rotate the certificate and re-register the client in the Event Grid namespace.
The first thing to get right is the namespace itself. In the Azure portal, type "Event Grid Namespaces" in the top search bar and select it from the results. Click + Create and work through the creation blade carefully.
Pick your subscription, resource group, and region. The namespace name must be globally unique, Event Grid uses it as the hostname for your MQTT and HTTP endpoints, so choose something that reflects your workload. Under the Networking tab, decide upfront whether you need public access or private endpoints. Changing this later is painful.
Once created, navigate to the namespace resource and look at the Host name field on the overview blade. It will look like your-namespace.westus2-1.eventgrid.azure.net. This is the endpoint your MQTT clients will connect to on port 8883 (TLS), or port 443 if you're routing MQTT over WebSocket for firewall-restricted environments.
To confirm MQTT is enabled on the namespace, go to Settings → Configuration inside the namespace blade. The "MQTT broker" toggle must be set to Enabled. If it's off, your MQTT clients will connect and immediately receive a CONNACK with return code 5 (Connection Refused, Not Authorized), which looks exactly like a certificate problem, so a lot of people chase the wrong thing for an hour.
If everything is configured correctly, you'll see the namespace in "Succeeded" provisioning state. That's your green light to proceed to client registration.
This is where most IoT teams get stuck. Azure Event Grid's MQTT broker needs to know about your clients before they can connect, you don't just point a device at the endpoint and hope for the best.
In the namespace blade, go to Clients and click + Client. Give the client a name and select your authentication method. For X.509 certificate auth (the standard for IoT devices), you'll provide either the thumbprint of the client's leaf certificate or a CA certificate for validation. To get the thumbprint from a PEM file:
openssl x509 -in client-cert.pem -noout -fingerprint -sha256
Copy the output (remove the colons to get a clean hex string) and paste it into the Thumbprint field. Make sure "Authentication type" is set to "Certificate thumbprint."
For applications using Microsoft Entra ID (formerly Azure Active Directory), switch the authentication type to "Microsoft Entra JWT." The client will need to acquire a token scoped to https://eventgrid.azure.net/.default and pass it as the MQTT password field in the CONNECT packet.
For lightweight clients that can't use Entra (common in constrained IoT hardware scenarios), the OAuth 2.0 JWT authentication option lets you use tokens from external identity providers. You configure this under the namespace's CA Certificates and Topic Spaces settings.
After saving the client, go to Client Groups and add the new client to an appropriate group. Client groups are what you reference in permission bindings, individual client registration alone doesn't grant publish or subscribe access to any topic.
I've seen countless setups where the client connects successfully but can't publish or subscribe, and the engineer is baffled because "the auth worked." Auth is just the door. Topic spaces and permission bindings are the room keys.
In the namespace blade, navigate to Topic Spaces and click + Topic Space. A topic space is a named collection of topic templates. For example, if your devices publish telemetry to paths like devices/{clientId}/telemetry, your topic template would be devices/${client.name}/telemetry. The ${client.name} variable substitution is powerful, it ensures each device can only publish to its own topic path, giving you fine-grained access control without writing a custom policy for every device.
Wildcards are also supported. A template of fleet/+/status matches single-level wildcard, while fleet/# matches multi-level. Use wildcards carefully, too broad a topic space can inadvertently give one client group access to another's data.
Once your topic space is defined, go to Permission Bindings and click + Permission Binding. Select the client group, the topic space, and the permission level, either "Publisher," "Subscriber," or both. Save it.
To verify the binding took effect, have your MQTT client attempt a publish. A successful publish will return a PUBACK (for QoS 1) or nothing for QoS 0. If you still get a disconnect immediately after the CONNECT+SUBSCRIBE, go back and double-check that the topic your client is using matches the template pattern exactly, including case sensitivity.
If you're building an event-driven application (not MQTT IoT) and want to consume events via HTTP, you need namespace topics and event subscriptions. This covers scenarios like reacting to Azure Blob Storage events, routing custom application events to Azure Functions, or sending events to Azure Event Hubs for stream processing.
In the namespace blade, navigate to Topics and click + Topic. Give the topic a name, this is the channel your publisher will write events to. For Event Grid Namespace topics, publishers send events over HTTP using the Event Grid Data Plane SDK or a simple HTTP POST to the topic's endpoint with a Content-Type: application/cloudevents+json; charset=utf-8 header.
Next, create an event subscription on the topic. Go to the topic you just created, then + Event Subscription. For pull delivery, select "Queue" as the endpoint type, this creates a consumer group that your application polls on demand. For push delivery, select the destination type (Webhook, Azure Function, Event Hub, Service Bus, etc.) and configure the endpoint URL.
For webhook push delivery, Event Grid will immediately attempt to validate the endpoint. Your webhook handler must respond to an HTTP POST containing a SubscriptionValidationEvent by returning the validationCode from the request body in a JSON response:
{
"validationResponse": "512d38b6-c7b8-40c8-89fe-f46f9e9622b6"
}
If validation succeeds, the subscription state changes to "Succeeded" in the portal and events begin flowing immediately.
One of the most valuable things Azure Event Grid's MQTT broker can do is route device messages to downstream Azure services, Event Hubs for stream analytics, Azure Functions for serverless processing, or custom webhooks for your own applications. If routing isn't configured, MQTT messages stay inside the broker and never reach your data pipeline.
In the namespace blade, go to Routing. You'll see a toggle to enable routing and a field for the routing topic. When you enable routing, Event Grid forwards MQTT messages that arrive on matching topics to an Event Grid namespace topic, which then fans them out to whatever event subscriptions you've configured on that topic.
Click Enable Routing, then select or create the target namespace topic. The routing configuration also lets you enrich events with custom attributes, useful for adding device metadata like fleet ID or region before the data hits downstream services.
To test routing end-to-end, publish a test MQTT message from a registered client and then check your downstream service. For Event Hubs, use the Data Explorer tab in the Event Hub blade to see incoming events in near-real time. For Azure Functions, check the function's invocations in the Monitor tab under the function app.
If routing is enabled but events aren't arriving downstream, check two things: first, confirm the namespace topic you selected for routing has at least one active event subscription. Second, go to Metrics on the namespace and look at the "Routed MQTT Messages" metric, if it shows zero, the topic filter for routing isn't matching your published topic names. Adjust the routing filter to match your actual topic patterns.
Advanced Troubleshooting
If the step-by-step fixes above didn't resolve your issue, it's time to go deeper. Here's how I approach persistent Azure Event Grid problems in enterprise and production environments.
Azure Monitor Diagnostic Logs. This is the single most powerful tool for understanding what Event Grid is actually doing. In the Azure portal, open your Event Grid namespace or topic, go to Monitoring → Diagnostic Settings → + Add diagnostic setting. Enable the "DeliveryFailures," "PublishFailures," "DataPlaneRequests," and "MqttClientSessionConnectedDisconnected" log categories. Send them to a Log Analytics workspace. Once data starts flowing (allow 5–10 minutes), you can run KQL queries:
AegDeliveryFailureLogs
| where TimeGenerated > ago(1h)
| project TimeGenerated, Topic, EventSubscriptionName, DeliveryAttemptFailureReason
| order by TimeGenerated desc
The DeliveryAttemptFailureReason field tells you exactly why delivery failed, certificate validation errors, HTTP 4xx from your webhook, endpoint timeouts, and more. No more guessing.
MQTT Client Session Events. Event Grid emits client lifecycle events when an MQTT client connects or disconnects. These appear as system events on the namespace and include a disconnectionReason field. Common values you'll see include ClientAuthenticationError (cert or token issue), ClientAuthorizationError (topic space / permission binding missing), and IPAllowed or IPNotAllowed for network-level blocks. Subscribe to these events during debugging by creating a short-lived event subscription on the namespace's system topic.
Dead-Letter Analysis. If you configured a dead-letter storage destination (strongly recommended for any production subscription), failed events land in the blob container with metadata that includes the delivery attempt count, last attempt timestamp, and failure reason. Use Azure Storage Explorer or the portal's Storage Browser to inspect these blobs. The blob metadata contains a DeadLetterReason key, values like MaxDeliveryAttemptsExceeded or DestinationEndpointNotFound tell you exactly what went wrong.
Network-Level Blocks in Firewall-Restricted Environments. If your MQTT clients live behind a corporate firewall that blocks port 8883 outbound, switch to MQTT over WebSocket. This routes MQTT traffic over port 443 (standard HTTPS), which virtually all corporate firewalls allow. Update your client's connection string to use the WebSocket endpoint, it follows the pattern wss://your-namespace.westus2-1.eventgrid.azure.net:443/mqtt. Most open-source MQTT client libraries (Paho, MQTT.js, Eclipse Mosquitto) support WebSocket transport with a single flag change.
Custom Domain Names. For enterprise deployments that need custom domain names on MQTT endpoints (for certificate pinning, simplified device configuration, or compliance reasons), Event Grid Namespaces support custom domain assignment. This requires a verified custom domain in your Azure DNS zone and a TLS certificate uploaded to Azure Key Vault. Configure this under Namespace → Settings → Custom Domains. After assignment, your devices can connect to mqtt.yourcompany.com instead of the Azure-generated hostname.
Prevention & Best Practices
Getting Azure Event Grid working is one thing. Keeping it working at scale, especially for IoT deployments with thousands of devices or event-driven microservices with strict SLA requirements, takes deliberate setup. Here's what separates the setups that run smoothly for years from the ones that generate 2 AM alerts.
Plan your certificate rotation strategy on day one. X.509 certificate expiry is the most common cause of sudden, unexplained MQTT authentication failures in production. Set up a calendar reminder or an automated Azure Monitor alert to flag certificates expiring within 30 days. For large device fleets, use CA certificate-based authentication rather than per-device thumbprints, it means you control the CA and can issue new leaf certs without touching the Event Grid configuration for every single device.
Design topic space templates with the principle of least privilege. Use the ${client.name} variable substitution in your topic templates. This ensures a device can only publish to its own namespace in the topic hierarchy, not to other devices' topics. A device named sensor-042 with topic template devices/${client.name}/# physically cannot publish to devices/sensor-099/telemetry, even if it tries. Hard-code this from day one, retrofitting it onto a running fleet is painful.
Always configure dead-letter destinations on production event subscriptions. Without a dead-letter storage account, failed events disappear silently. A Storage Account blob container costs almost nothing and gives you a complete audit trail of every failed delivery. Set the retention period on the dead-letter container to match your compliance requirements.
Monitor namespace-level metrics proactively. Set up Azure Monitor alerts on the "Unmatched MQTT Messages" and "Dead Lettered Events" metrics. A spike in either is an early warning signal. Set alert thresholds based on your normal baseline traffic, even a 10% increase in dead-lettered events during off-peak hours can indicate a misconfigured event subscription or a downstream service starting to fail.
- Enable diagnostic logs on Day 1, retroactive debugging without logs is nearly impossible in Event Grid
- Use CloudEvents 1.0 format for all HTTP event publishing, it's the interoperability standard Event Grid supports and makes schema validation trivial
- Test MQTT connectivity with the open-source
mosquitto_pub/mosquitto_subtools before onboarding devices, isolates client-side config issues from device firmware issues - Set a retry policy and maximum delivery attempts on every event subscription, and always pair them with a dead-letter destination so nothing gets lost silently
Frequently Asked Questions
What exactly is Azure Event Grid and how is it different from Azure Service Bus or Event Hubs?
Azure Event Grid is a fully managed publish-subscribe service built around event-driven architectures. It handles two distinct workloads: MQTT messaging for IoT devices (using MQTT v3.1.1 and v5.0 protocols), and HTTP-based event distribution for cloud applications. Service Bus is designed for message queuing with ordering guarantees and competing consumer patterns, it's for application-to-application messaging with strong delivery guarantees. Event Hubs is a high-throughput event streaming platform optimized for capturing and replaying massive streams of telemetry. Event Grid sits in the middle, it excels at reacting to state change events from Azure services (like a blob being uploaded or a VM being deleted) and routing them to the right handler in near-real time. For IoT device connectivity at scale, Event Grid's MQTT broker capability is genuinely the right tool, the others don't natively speak MQTT.
My MQTT client connects but immediately disconnects, what's happening?
An immediate disconnect after a successful CONNECT packet almost always means one of three things: the client is authenticated (the CONNECT was accepted) but has no permission bindings allowing it to publish or subscribe to any topic space; the client's session settings are invalid (for example, requesting a session expiry interval that exceeds the namespace maximum); or there's a keep-alive timeout mismatch where the client is configured for a very short keep-alive that the broker isn't honoring. Enable the MqttClientSessionConnectedDisconnected diagnostic log category on the namespace and check the disconnectionReason field, it'll tell you exactly which category you're in. For permission binding issues, go to Namespace → Permission Bindings and confirm your client's group is listed with the correct Publisher or Subscriber permission for the relevant topic space.
What's the difference between pull delivery and push delivery in Azure Event Grid, and which should I use?
Push delivery means Event Grid proactively sends events to a destination you define, a webhook URL, an Azure Function, an Event Hub, or a Service Bus queue. You set it up once and events flow automatically. Pull delivery means your application connects to Event Grid and asks for events on its own schedule, similar to polling a queue. Pull delivery is only available for namespace topics (not classic custom or system topics), and it's ideal for scenarios where your consumer needs to control the pace of processing, handle backpressure gracefully, or process events in batches. Push is simpler to set up and better for real-time reactions where latency matters. If your consumer is a stateless function that needs to react within seconds, use push. If your consumer is a batch processor or you need to pause/resume consumption independently, use pull.
How do I route MQTT messages from devices to Azure Event Hubs for analytics?
This is a two-step configuration. First, in your Event Grid namespace, go to Routing and enable message routing, selecting a namespace topic as the routing destination. Second, create an event subscription on that namespace topic with Event Hubs as the endpoint type, select your Event Hub namespace and hub name from the dropdown. Once both are configured, MQTT messages matching your routing filter will automatically flow through the namespace topic and into the Event Hub within milliseconds. From there you can connect Azure Stream Analytics, Azure Databricks, or any other service that reads from Event Hubs. To verify it's working, open the Event Hub in the portal, go to Data Explorer, and click View events, you should see your device messages appearing in near-real time.
My webhook event subscription is stuck in "Awaiting Manual Validation", how do I fix it?
This happens when Event Grid sent the initial validation request to your webhook URL and never received the correct response. Your webhook endpoint must respond within 10 seconds to an HTTP POST containing a SubscriptionValidationEvent by returning a 200 OK with a JSON body containing the validationCode from the request. Check that your endpoint is publicly reachable from Azure (Event Grid's validation calls come from Azure's outbound IPs, you may need to whitelist these if you have a firewall in front of your server), that it's returning 200 and not a redirect (3xx), and that the response body is valid JSON with the exact validationResponse field. If you can't modify the endpoint, use the manual validation option in the portal: copy the validation URL from the subscription properties and open it in a browser, this completes the handshake without code changes.
Does Azure Event Grid support custom domain names for MQTT endpoints?
Yes, Event Grid Namespaces support custom domain name assignment for MQTT endpoints. This is genuinely useful for device fleets where you want to connect to iot.yourcompany.com instead of the Azure-generated hostname, either for branding, simplified provisioning, or certificate pinning requirements. To set it up, you need a verified custom domain in your Azure DNS zone, a TLS certificate covering that domain stored in Azure Key Vault, and a CNAME record pointing your custom domain to the Event Grid namespace hostname. Configure this under Namespace → Settings → Custom Domains in the portal. Note that the TLS certificate must be renewed before expiry and re-uploaded to Key Vault, Event Grid doesn't manage certificate renewal for custom domains automatically.