What we learned building voice AI at Talio

February 5, 2026

Talio is a first point of contact for volume hiring and can interface through both text and audio. In this article we share what we have learned from building with speech-to-speech models, the current landscape, and where we see it going.

The two architectures

Chained architecture

The most common architecture for voice agents is to wrap an LLM in two layers:

1. Speech-to-text: the user's voice is converted to text
2. Text-to-speech: the LLM's text response is converted to voice

This creates a chained architecture: STT → LLM → TTS

The advantage of this approach is flexibility. You can swap out individual components, use the best STT provider, your preferred LLM, and a TTS provider with high quality voices or custom voice cloning. The downside is latency: each step adds delay, and the cumulative effect can make conversations feel sluggish.

Speech-to-speech models

Speech-to-speech models take audio as input and produce audio as output.

Large language models process text using tokens, which are common sequences of characters found in text:

Text is tokenized and processed by the large language model

(You can try this yourself at platform.openai.com/tokenizer)

Speech-to-speech models work similarly, but with audio. Audio is compressed into discrete tokens, the model predicts the response tokens, and those are converted back to audio.

This direct approach reduces latency and preserves information in the speech signal (emotion, tone, accents) that would be lost if everything was forced through text first.

When speech-to-speech wins

The choice between chained and speech-to-speech depends on what you are optimizing for.

B2C applications tend to prioritize responsiveness and conversational feel. A consumer calling their bank or ordering food expects a snappy, natural interaction. Latency spikes and robotic pauses break the illusion. Here, speech-to-speech models have a clear advantage.

B2B applications often prioritize correctness over feel. When a voice agent is extracting structured information, booking appointments, or handling compliance-sensitive workflows, getting the right answer matters more than shaving off milliseconds.

The main practical advantage of the chained architecture today is cost. Speech-to-speech models are expensive. Compare that to a chained setup where you might pay significantly less per minute total for STT, a cheap LLM, and TTS. For high-volume use cases, this difference adds up quickly.

This does not mean B2B will always use chained and B2C will always use speech-to-speech. But it explains why we see different adoption patterns across industries.

What we learned from testing the models

We have tested all major speech-to-speech models in production. Here is what we found:

OpenAI Realtime

Currently the most reliable option. Latency is consistent and the API is stable. The downsides: developer experience is rough, and it does not support outbound calls without building a Media Streams bridge (Phone → Twilio → WebSocket → Your Server → OpenAI). Voice quality is good but not perfect. One technique we use: configure Realtime to output text instead of speech, then pipe that text directly to a dedicated TTS provider like ElevenLabs. The text streams while the response is still generating, so the added latency is minimal. This gives you the responsiveness of a speech-to-speech model with the voice quality of a premium TTS provider.

Gemini Live

Has the most advanced features (more on this below), but reliability is currently an issue. We have seen sessions where it takes over 15 seconds for the model to become responsive after connection. This happens on both Google AI Studio and Vertex AI. The documentation is also unclear in places. Dutch works reasonably well, though Flemish less so. The underlying technology is impressive, but it is not yet production-ready for applications where you cannot afford latency variance.

Ultravox

Has the best developer experience of the bunch. Custom voices, outbound calls, batteries included. Voice quality is not at the frontier, but if you need something that just works at a reasonable price, it is worth considering.

Hume AI EVI

Focuses on empathic voice, reading and responding to emotional cues. Interesting technology, but does not support Dutch.

Moshi (open source, from Kyutai)

A full-duplex speech-to-speech model you can self-host. Interruption handling does not work well yet.

Grok

Voices are lower quality and do not perform well for non-English languages. Dutch in particular sounds unnatural.

Our current recommendation: OpenAI Realtime for production workloads that need reliability and can absorb the cost. Ultravox if you want better DX and lower cost with acceptable quality. Keep an eye on Gemini Live for when the stability issues are resolved.

Where this is going

The most exciting developments are coming from Google's Gemini Live API, which previews capabilities that will likely become standard across all providers.

Affective dialog

Gemini can adapt its response style based on the input expression and tone. If you sound stressed, it responds more calmly. If you are joking, it can match that energy. This goes beyond sentiment analysis of transcribed text; the model is interpreting acoustic cues directly.

Proactive audio

Standard voice AI uses Voice Activity Detection (VAD) to know when to listen and when to speak. VAD is essentially a binary switch: if it hears sound, it stops; if it hears silence, it starts. This creates problems.

Proactive audio replaces that switch with something more intelligent. A few examples of what this enables:

External chatter rejection: If you are talking to the AI and someone else in the room starts a separate conversation, the model can identify that the second voice is not talking to it. No more "I'm sorry, I didn't catch that" when your colleague asks if you want coffee.
Filler word tolerance: If you pause and say "Ummm... let me think," the model recognizes this as a filler, not a finished thought. It will not interrupt you or start answering until it detects you are actually done speaking.
Configurable proactivity: You can guide the proactivity style via system instructions. In Silent Mode, the AI acts as a passive observer, listening and processing but only responding when directly addressed. In Outspoken Mode, the AI is encouraged to interject when it has something valuable to add, making it feel like a third participant in the conversation.

These features point to a future where voice AI feels less like talking to a system and more like talking to someone who understands conversational dynamics.

Evaluating voice agents

Voice agent evaluation is harder than text-based AI evaluation because failure modes are often invisible in transcripts. Turn timing, interruptions, and audio artifacts matter. A conversation can be correct on paper but feel terrible to the user.

The evaluation stack typically has three layers:

1.
Text-only behavioral tests: Fast and cheap, good for CI/CD. Test intent recognition, tool call correctness, conversation policy. These catch logic bugs but miss voice-specific issues.
2.
Audio-native simulation: Full end-to-end tests with synthetic callers that have different accents, speaking speeds, background noise, and behaviors like interrupting or long pauses. This is where you catch the "real world" failures.
3.
Production monitoring: Capture real calls with synchronized audio, transcripts, and traces. Use failures to generate new test cases. This closes the loop between production incidents and regression tests.

Tooling in this space is maturing. LiveKit recently raised $100M at a $1B valuation with explicit plans to make voice agent evaluation easier. Dedicated platforms like Roark, Hamming, Coval, and Bluejay are building simulation and monitoring tools specifically for voice agents. This signals that voice agent evaluation is becoming part of the core infrastructure, not just a nice-to-have.

Closing

Speech-to-speech models are maturing rapidly. The gap between "demo impressive" and "production reliable" is closing. We expect the hybrid approach (speech-to-speech for responsiveness, text output piped to premium TTS for voice quality) to remain the sweet spot for the next year or so, while native voice quality catches up.

The more interesting shift is in capabilities. Features like proactive audio and affective dialog are not incremental improvements; they change what voice AI can do. The systems that figure out how to use these well will feel qualitatively different from today's voice agents.

We will continue sharing what we learn as we build. If you are working on similar problems, we would love to hear from you.

Want to see Talio in action?

Schedule a call or reach out to us at jarne@usetalio.com

Talio is committed to ethical AI development and deployment. Our technology is designed to augment human decision-making, not replace it.

What we learned building voice AI at Talio

The two architectures

Chained architecture

Speech-to-speech models

When speech-to-speech wins

What we learned from testing the models

OpenAI Realtime

Gemini Live

Ultravox

Hume AI EVI

Moshi (open source, from Kyutai)

Grok

Where this is going

Affective dialog

Proactive audio

Evaluating voice agents

Closing

Want to see Talio in action?

Product

Company

Contact

Product

Company

Contact

Contact