Azure AI Services

Configure the flow trigger

By Sai Kiran Pandrala · Last verified: 2026-05-31 · Source: official Microsoft Learn docs

At a glance
Product familyAzure AI Services
Document sourceAzure AI Speech Service
Guide typeHands-on Reference
Skill levelIntermediate to advanced
Time20 - 75 minutes depending on tenant scale

When you wire Azure Speech to a Power Automate flow or a Logic App, the trigger is where everything starts. Get the trigger wrong and you either fire too often (cost spike) or miss events (silent failure). Both are bad. The second is worse because you don't know it's happening.

The flow trigger for Speech batch jobs is typically a "when a blob is added" trigger on the input container, or a webhook receiver for the completion notification. Pick based on where in the pipeline you sit.

Reference content and what it actually means

The Microsoft Learn page for Configure the flow trigger is correct and complete. It is also written for "every reader." I want to tell you what an engineer shipping this in a real customer tenant should care about.

Three forces shape the behaviour of Azure Speech Service. Your endpoint region. Your SDK version. The audio format and channel layout of your input. Most accuracy and latency problems trace back to one of these three.

Endpoints and regional behaviour

Azure Speech is available in 30+ regions. Central India and South India have base STT and TTS GA, including Indian language voices. Preview features — Voice Live, GPT-Realtime, certain HD voices — typically launch in East US and West Europe first and land in Indian regions 4-8 months later.

# Build the endpoint URL from region + scenario
# Speech-to-text REST endpoint:
https://<region>.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1
# Text-to-speech REST endpoint:
https://<region>.tts.speech.microsoft.com/cognitiveservices/v1
# Voice Live WebSocket:
wss://<region>.api.cognitive.microsoft.com/speech/realtime/v1

The Speech SDK builds these URLs for you when you pass region. You only construct them by hand when you're calling REST directly or wiring a non-SDK client like a SIP gateway.

SDK versions and the long-lived support window

Microsoft supports each Speech SDK major version for 24 months from GA. Two majors are typically supported at once. Pin your version in package.json / pom.xml / requirements.txt and bump it deliberately, not automatically. I have a calendar reminder every quarter to check for new minor versions.

# Python
pip install azure-cognitiveservices-speech==1.40.0
# C# / .NET
dotnet add package Microsoft.CognitiveServices.Speech --version 1.40.0
# Node
npm install microsoft-cognitiveservices-speech-sdk@1.40.0

Audio format, the silent accuracy killer

Speech-to-Text expects 16 kHz, 16-bit, mono PCM by default. Feed it 8 kHz phone-grade audio and accuracy drops 10-15%. Feed it stereo when it expects mono and it gets confused about which channel to weight. The container handles common formats but the cloud REST endpoints are stricter.

For phone audio, switch to the narrow-band model explicitly. Tell the SDK the source is 8 kHz. The Indian Telecom Regulatory Authority's standard for VoIP is G.711 mu-law at 8 kHz. that's most call centre traffic in the country.

How to apply this in practice

Here's the order I run through when I'm setting up a fresh Speech deployment for a customer.

  1. Pick your region based on data residency. For an Indian customer with consumer data: Central India. For a customer in regulated finance: confirm with their DPO. The region cannot be changed without a fresh resource.
  2. Create the Speech resource: az cognitiveservices account create --kind SpeechServices --sku S0 --location centralindia --name my-speech-prod --resource-group my-rg. S0 SKU pricing runs about ₹83 ($1) per audio hour for STT at the time of writing, verify on the calculator.
  3. Enable managed identity on the consuming app: az webapp identity assign --name my-app --resource-group my-rg. Grant it Cognitive Services Speech User role on the Speech resource.
  4. Write a 30-line health-check that calls TTS to produce a known phrase, calls STT to transcribe that phrase, asserts WER below 5%. Run it post-deploy. This catches credential drift, quota throttling, and regional outages.
  5. Enable diagnostic logs. Send them to Log Analytics. Build one alert: STT 5xx rate over 1% in 5 minutes. Build one alert: TTS latency p95 over 800ms. Both have saved me production incidents.
  6. Document your region, your SDK version, your audio format assumptions, and your phrase list in your team wiki. Future you and the on-call engineer will both need it.

I've seen this fail when teams skip step 4. The customer's first symptom is "the voicebot is hallucinating," which is almost never the model: it's the audio path. The health check catches it in 30 seconds.

Caveats and what to double-check

Troubleshooting the failures I keep seeing

Four issues account for most of the Azure Speech tickets I get pulled into. Walking through them up front saves a debugging evening.

Recognition returns blank strings or hallucinates

Nine times out of ten this is an audio format problem. The Speech SDK silently accepts 8 kHz audio into a 16 kHz endpoint and emits garbage. Confirm format with ffprobe:

ffprobe -v error -show_entries stream=sample_rate,channels,codec_name input.wav
# Expect: sample_rate=16000, channels=1, codec_name=pcm_s16le
# If sample_rate is 8000, switch to the narrow-band model:
# SpeechRecognizer with PhoneCallSampleRate=8000, set recognition mode = "Conversation"

For phone audio (G.711 mu-law at 8 kHz), explicitly target the narrow-band model in the SDK. The SDK will upsample 8 kHz to 16 kHz internally if you don't tell it about the source rate, and the accuracy hit is brutal.

Latency spikes past 800ms p95

Real-time speech latency depends on region distance, audio chunk size, and whether you've enabled the Microsoft Audio Stack. Measure first, optimise second. Use the Speech SDK's FirstByteLatency diagnostic event, it's exposed in the connection's Connected handler. If the number is consistently over 500ms, your client is too far from the region. Move closer (deploy in the same region as the Speech resource) before tuning anything else.

Custom Voice training rejected for consent issues

The consent file must be a WAV recording of the voice talent reading Microsoft's specific consent statement, in the same voice as the training data. I've seen training rejected because the consent was recorded on a phone and the training audio was studio quality. Microsoft's verification model flagged the voice mismatch. Record consent on the same hardware as your training samples.

Voice Live drops connection mid-conversation

The WebSocket has a default idle timeout. If your user pauses speaking and the LLM doesn't emit anything for 60+ seconds, the connection may drop. Send a heartbeat ping every 25 seconds from the client side. ws.ping() in Node, equivalent in your language. I had to add this after a customer reported "the bot forgets us if we pause."

Last quarter a Mumbai customer hit all four of these in the same week. Audio format, latency, consent, and idle timeout. We worked through them in that order and saved the deadline.

Cost notes and a rollback plan

Azure Speech pricing has five major levers. STT audio hours, TTS characters synthesised, Custom Voice training and hosting, Voice Live concurrent connections, and storage for batch transcription input/output. Pick the one that dominates your workload and tune the others later.

S0 SKU Speech-to-Text runs about ₹83 ($1) per audio hour at the time of writing. A call-centre transcribing 2,000 hours a day runs ₹50 lakh ($60,000) a month before commitment-tier discount, ₹30-35 lakh after. Real-time and batch are priced identically per hour on STT. TTS is around ₹1,300 per million characters on standard neural voices, ₹2,500 on HD voices. A high-volume IVR with 50,000 calls a day at 200 characters per call comes to roughly ₹8 lakh a month on standard neural.

Custom Voice training runs into the lakhs for Professional voice, figure ₹4-6 lakh for a typical fine-tune, plus monthly hosting at around ₹35,000 per hosted voice. Personal Voice (the URL-clone flow) is dramatically cheaper but constrained to a narrower set of features. Pick based on use case, not on cost alone.

Rollback plan. If the Speech feature you've enabled is causing regressions, you have three lines of defence. Revert to a prior SDK version (your requirements.txt / package.json is your friend: this is why we pin versions). Switch to a different model on the same endpoint (Speech-to-Text exposes both latest and specific timestamped model versions). Or fall back to a different region if a regional outage is in play.

# SDK pin pattern that lets you roll back in one deploy
# Python, pin to the last good version explicitly:
pip install azure-cognitiveservices-speech==1.39.0
# Switching the STT model at runtime, no redeploy needed:
recognizer = SpeechRecognizer(
    speech_config=speech_config,
    audio_config=audio_config,
    source_language_config=SourceLanguageConfig(language="en-IN")
)
# Set the deployment ID to a prior trained model:
speech_config.endpoint_id = "<prior-good-model-deployment-id>"

I keep the prior model deployment alive for 30 days after promoting a new one. That window has saved me three times this year, including a Friday evening incident where the new Custom Speech model regressed on a key vocabulary class. One config change, four-minute deploy, problem solved.

FAQ

Where does this configure the flow trigger content come from?
I cross-checked it against the official Microsoft Learn page for Azure AI Services, reformatted the structure for engineers who scan rather than read, and added the verify + rollback notes I wish someone had given me when I first shipped this on a customer tenant. The "Last verified" stamp at the top tells you when it was last reconciled with Microsoft's version.
How often is this reference updated?
Quarterly minimum, plus an out-of-band refresh whenever Microsoft pushes a breaking change. The Azure AI Services docs move fast. I once watched the endpoint URL change shape between Friday and Monday. If you see drift between this page and the canonical Microsoft Learn source, the Microsoft page wins. Drop me a note and I'll re-verify.
Can I use this for production planning?
Use it as your first read, not your only read. For production, pair this with your tenant's SKU (S0 vs Standard vs Commitment Tier), the region you've picked, your compliance bracket (GDPR / HIPAA / India MeitY), and Microsoft's pricing calculator on the day you sign the PO. A 30-minute architecture review with these inputs beats a 3-hour search through PDFs.
Why is this reference free?
HowToFixMe runs on display ads. No paywall, no email gate, no "sign up to read more" pattern. I built this because I lost two evenings last month digging through outdated Microsoft PDF exports for a customer migration, that pain shouldn't be a tax on every engineer.
Where can I read the original Microsoft source?
Search "Configure the flow trigger" on learn.microsoft.com: Microsoft restructures URL paths every few quarters but the heading text usually stays stable, so a verbatim search is the most reliable path to the live page.

References

Related guides worth a look while you sort this one out: