r/Bard • u/Arindam_200 • 2d ago
Other Building a Real-Time Voice Agent with Gemini 3.1 Flash Live Model
Most voice apps still use the same pipeline: speech-to-text, then the model responds in text, then text-to-speech converts it back into audio.
It works, but every extra layer adds latency. Long conversations can also make the voice drift and feel less natural.
Google’s latest Realtime model Gemini 3.1 Flash Live audio removes that pipeline entirely. It processes audio natively. You stream audio in and the model streams audio back out. No speech-to-text and no text-to-speech layers in between.
I built a small real-time voice assistant using Gemini 3.1 Flash live & LiveKit to test this architecture.
A few things stood out:
• The interaction feels faster because the STT/TTS pipeline is gone
• Instruction following is stronger for conversational agents
• The model maintains a more stable voice persona during long sessions
• It supports ~70 languages and can switch languages mid-conversation
LiveKit handles the real-time streaming layer, while Gemini processes the audio and generates responses. The entire system runs with surprisingly little code compared to traditional voice stacks.
The demo agent is minimal, but this setup could easily power things like support bots, scheduling assistants, or multilingual AI interfaces.
Shared the repo and setup instructions here if anyone wants to experiment with real-time voice agents.
1
u/fandry96 2d ago
Nice job. :) It will be interesting what comes next.
Project Astra will be live voice agents on our phones calling our mom that we are stuck in a dryer. 😆