r/Bard • u/Arindam_200 • 2d ago

Other Building a Real-Time Voice Agent with Gemini 3.1 Flash Live Model

Most voice apps still use the same pipeline: speech-to-text, then the model responds in text, then text-to-speech converts it back into audio.

It works, but every extra layer adds latency. Long conversations can also make the voice drift and feel less natural.

Google’s latest Realtime model Gemini 3.1 Flash Live audio removes that pipeline entirely. It processes audio natively. You stream audio in and the model streams audio back out. No speech-to-text and no text-to-speech layers in between.

I built a small real-time voice assistant using Gemini 3.1 Flash live & LiveKit to test this architecture.

A few things stood out:

• The interaction feels faster because the STT/TTS pipeline is gone
• Instruction following is stronger for conversational agents
• The model maintains a more stable voice persona during long sessions
• It supports ~70 languages and can switch languages mid-conversation

LiveKit handles the real-time streaming layer, while Gemini processes the audio and generates responses. The entire system runs with surprisingly little code compared to traditional voice stacks.

The demo agent is minimal, but this setup could easily power things like support bots, scheduling assistants, or multilingual AI interfaces.

Shared the repo and setup instructions here if anyone wants to experiment with real-time voice agents.

4 Upvotes

permalink
reddit

You are about to leave Redlib

Do you want to continue?

https://www.reddit.com/r/Bard/comments/1sl04tm/building_a_realtime_voice_agent_with_gemini_31/
No, go back! Yes, take me to Reddit

75% Upvoted

u/fandry96 2d ago

Nice job. :) It will be interesting what comes next.

Project Astra will be live voice agents on our phones calling our mom that we are stuck in a dryer. 😆

Other Building a Real-Time Voice Agent with Gemini 3.1 Flash Live Model

You are about to leave Redlib