Build a Voice Agent with Gemini 3.1 Flash Live in Python
11 min read
What Is Gemini 3.1 Flash Live and Why It Matters
On March 26, 2026, Google released Gemini 3.1 Flash Live, a real-time audio-to-audio AI model designed for natural, low-latency voice conversations. Unlike previous models that required text intermediaries (speech-to-text → LLM → text-to-speech), Flash Live processes audio directly, reducing latency to under 200ms and enabling interruptions, overlapping speech, and natural conversational flow. The model powers Google's Gemini Live and Search Live products, now expanded globally.
This matters because it's among the first production-ready, developer-accessible models that handle true audio-to-audio processing at this quality level. Previous attempts at real-time voice AI either required expensive enterprise contracts or suffered from noticeable lag that broke conversational flow. Flash Live is available through Google AI Studio's free tier with a generous quota, making real-time voice agents accessible to individual developers for experimentation and prototyping.
The technical breakthrough: Flash Live uses a unified multimodal architecture that processes audio, video, and text in a single forward pass, eliminating the latency of cascaded models. It supports 40+ languages, maintains conversational context across turns, and can handle interruptions mid-sentence—critical for natural dialogue. Google reports 8x speedup over previous Gemini models for voice tasks, with quality that Google claims matches or exceeds GPT-4o's voice mode in internal testing.
Source: Google AI Developer Documentation - Gemini 3.1 Flash Live
Prerequisites
- Python 3.10+ (tested on 3.11.7)
- Google Cloud account with billing enabled (free tier available: console.cloud.google.com)
- API key from Google AI Studio: aistudio.google.com/apikey
- Audio hardware: microphone and speakers (built-in laptop hardware works)
- Required packages:
pip install google-genai==0.8.2 pyaudio==0.2.14 websockets==12.0
- macOS users: Install PortAudio first:
brew install portaudio - Linux users:
sudo apt-get install portaudio19-dev python3-pyaudio - Windows users: PyAudio binary wheels install automatically
Step 1: Set Up Your Google AI API Key and Test Connection
Create a project directory and store your API key securely. Never hardcode API keys in source files.
mkdir gemini-voice-agent
cd gemini-voice-agent
echo "YOUR_API_KEY_HERE" > .api_key
chmod 600 .api_key # Restrict file permissions (Unix/macOS only)
Create test_connection.py to verify your API key works:
import os
from google import genai
# Load API key from file
with open('.api_key', 'r') as f:
api_key = f.read().strip()
# Initialize client
client = genai.Client(api_key=api_key)
# Test with a simple text prompt
response = client.models.generate_content(
model='gemini-3.1-flash-live-preview',
contents='Say hello in exactly 5 words.'
)
print(f"Response: {response.text}")
print("✓ API connection successful")
Run the test:
python test_connection.py
Expected output:
Response: Hello there, how are you?
✓ API connection successful
If you see 401 Unauthorized, your API key is invalid. Regenerate it at aistudio.google.com/apikey. If you see 429 Too Many Requests, you've hit rate limits—wait 60 seconds and retry.
Step 2: Build Audio Input/Output Handlers
Flash Live requires streaming audio in 16kHz, 16-bit PCM format. Create audio_handler.py:
import pyaudio
import queue
import threading
class AudioHandler:
def __init__(self, sample_rate=16000, chunk_size=1024):
self.sample_rate = sample_rate
self.chunk_size = chunk_size
self.audio = pyaudio.PyAudio()
# Input stream (microphone)
self.input_stream = self.audio.open(
format=pyaudio.paInt16,
channels=1,
rate=self.sample_rate,
input=True,
frames_per_buffer=self.chunk_size
)
# Output stream (speakers)
self.output_stream = self.audio.open(
format=pyaudio.paInt16,
channels=1,
rate=self.sample_rate,
output=True,
frames_per_buffer=self.chunk_size
)
self.input_queue = queue.Queue()
self.output_queue = queue.Queue()
self.running = False
def start_recording(self):
"""Capture microphone audio in background thread"""
self.running = True
def record():
while self.running:
try:
data = self.input_stream.read(
self.chunk_size,
exception_on_overflow=False
)
self.input_queue.put(data)
except Exception as e:
print(f"Recording error: {e}")
self.record_thread = threading.Thread(target=record, daemon=True)
self.record_thread.start()
def start_playback(self):
"""Play audio from queue in background thread"""
def play():
while self.running:
try:
data = self.output_queue.get(timeout=0.1)
self.output_stream.write(data)
except queue.Empty:
continue
except Exception as e:
print(f"Playback error: {e}")
self.play_thread = threading.Thread(target=play, daemon=True)
self.play_thread.start()
def stop(self):
"""Clean up audio streams"""
self.running = False
self.input_stream.stop_stream()
self.input_stream.close()
self.output_stream.stop_stream()
self.output_stream.close()
self.audio.terminate()
Test the audio handler with test_audio.py:
from audio_handler import AudioHandler
import time
handler = AudioHandler()
handler.start_recording()
print("Recording for 3 seconds...")
time.sleep(3)
# Echo test: play back what was recorded
print("Playing back...")
while not handler.input_queue.empty():
chunk = handler.input_queue.get()
handler.output_queue.put(chunk)
handler.start_playback()
time.sleep(3)
handler.stop()
print("✓ Audio test complete")
Expected output: You should hear your voice played back after 3 seconds. If you hear nothing, check your system audio settings and ensure your microphone/speakers are set as default devices.
Step 3: Implement the Live API WebSocket Connection
Flash Live uses WebSocket for bidirectional streaming. Create voice_agent.py:
import asyncio
import base64
from google import genai
from audio_handler import AudioHandler
class VoiceAgent:
def __init__(self, api_key, model='gemini-3.1-flash-live-preview'):
self.client = genai.Client(api_key=api_key)
self.model = model
self.audio_handler = AudioHandler()
self.session = None
async def start_session(self, system_instruction=None):
"""Initialize Live API session with optional system prompt"""
config = {
'generation_config': {
'temperature': 0.8,
'response_modalities': ['AUDIO'], # Audio output only
}
}
if system_instruction:
config['system_instruction'] = system_instruction
# Create bidirectional streaming session
self.session = self.client.aio.live.connect(
model=self.model,
config=config
)
await self.session.__aenter__()
async def send_audio(self):
"""Stream microphone audio to API"""
while self.audio_handler.running:
if not self.audio_handler.input_queue.empty():
chunk = self.audio_handler.input_queue.get()
# Encode audio as base64 for transmission
encoded = base64.b64encode(chunk).decode('utf-8')
await self.session.send({
'realtime_input': {
'media_chunks': [{
'mime_type': 'audio/pcm',
'data': encoded
}]
}
})
await asyncio.sleep(0.01) # 10ms polling interval
async def receive_audio(self):
"""Receive and play AI audio responses"""
async for response in self.session.receive():
# Flash Live returns audio in serverContent messages
if hasattr(response, 'server_content'):
for part in response.server_content.model_turn.parts:
if hasattr(part, 'inline_data'):
# Decode base64 audio and queue for playback
audio_data = base64.b64decode(part.inline_data.data)
self.audio_handler.output_queue.put(audio_data)
async def run(self, system_instruction=None):
"""Main conversation loop"""
await self.start_session(system_instruction)
self.audio_handler.start_recording()
self.audio_handler.start_playback()
# Run send and receive concurrently
await asyncio.gather(
self.send_audio(),
self.receive_audio()
)
def stop(self):
"""Clean up resources"""
self.audio_handler.stop()
if self.session:
asyncio.create_task(self.session.__aexit__(None, None, None))
Step 4: Create a Working Voice Assistant
Build a simple voice assistant that can answer questions. Create assistant.py:
import asyncio
from voice_agent import VoiceAgent
async def main():
# Load API key
with open('.api_key', 'r') as f:
api_key = f.read().strip()
# Define assistant personality and capabilities
system_instruction = """You are a helpful voice assistant. Keep responses
concise (under 30 seconds). Speak naturally with appropriate pauses.
If you don't understand audio input, ask the user to repeat."""
agent = VoiceAgent(api_key)
print("🎤 Voice assistant starting...")
print("Speak naturally. Press Ctrl+C to exit.\n")
try:
await agent.run(system_instruction)
except KeyboardInterrupt:
print("\n\n👋 Shutting down...")
agent.stop()
if __name__ == '__main__':
asyncio.run(main())
Run the assistant:
python assistant.py
Expected behavior:
- You'll see "Voice assistant starting..."
- Start speaking: "What's the weather like today?"
- Within 200-500ms, you'll hear the AI respond
- The assistant maintains context—you can follow up with "What about tomorrow?"
- Press Ctrl+C to exit
Step 5: Add Interruption Handling and Turn Detection
Flash Live supports natural interruptions. Enhance voice_agent.py to detect when the user starts speaking:
import numpy as np
class VoiceAgent:
# ... (previous code) ...
def __init__(self, api_key, model='gemini-3.1-flash-live-preview'):
self.client = genai.Client(api_key=api_key)
self.model = model
self.audio_handler = AudioHandler()
self.session = None
self.silence_threshold = 500 # Amplitude threshold for speech detection
self.is_speaking = False
def detect_speech(self, audio_chunk):
"""Simple voice activity detection"""
audio_array = np.frombuffer(audio_chunk, dtype=np.int16)
amplitude = np.abs(audio_array).mean()
return amplitude > self.silence_threshold
async def send_audio(self):
"""Stream audio with turn-taking signals"""
while self.audio_handler.running:
if not self.audio_handler.input_queue.empty():
chunk = self.audio_handler.input_queue.get()
# Detect if user started speaking
if self.detect_speech(chunk) and not self.is_speaking:
self.is_speaking = True
# Signal turn start to interrupt AI if it's speaking
await self.session.send({
'client_content': {
'turn_complete': False
}
})
encoded = base64.b64encode(chunk).decode('utf-8')
await self.session.send({
'realtime_input': {
'media_chunks': [{
'mime_type': 'audio/pcm',
'data': encoded
}]
}
})
else:
# No audio = user stopped speaking
if self.is_speaking:
self.is_speaking = False
await self.session.send({
'client_content': {
'turn_complete': True # Signal AI can respond
}
})
await asyncio.sleep(0.01)
This implementation allows you to interrupt the AI mid-sentence—just start speaking and it will stop and listen.
Expected Output
When you run assistant.py, you should experience:
Terminal output:
🎤 Voice assistant starting...
Speak naturally. Press Ctrl+C to exit.
Conversational flow:
- You: "What's the capital of France?"
- AI (audio response, ~300ms latency): "The capital of France is Paris."
- You (interrupting mid-sentence): "What about—"
- AI (stops immediately, listens)
- You: "—Germany?"
- AI: "The capital of Germany is Berlin."
Performance metrics you should observe:
- Initial response latency: 200-500ms from end of your speech
- Interruption detection: <100ms to stop AI playback
- Audio quality: Clear, natural-sounding voice (not robotic)
- Context retention: Can reference previous turns ("What about tomorrow?" after asking about today)
If responses take >2 seconds, check your internet connection—Flash Live requires stable bandwidth (minimum 1 Mbps upload/download). If audio sounds choppy, reduce chunk_size in AudioHandler to 512.
Common Pitfalls
API quota exceeded after 10 minutes: Free tier limits Flash Live to 60 requests/minute and 1500 requests/day. Each audio chunk counts as a request. Solution: Increase chunk_size to 2048 (reduces request frequency) or upgrade to paid tier at console.cloud.google.com/billing.
ModuleNotFoundError: No module named '_portaudio': PyAudio installation failed. On macOS: brew install portaudio && pip install --force-reinstall pyaudio. On Linux: sudo apt-get install portaudio19-dev && pip install --force-reinstall pyaudio. On Windows: Download the appropriate .whl from PyAudio unofficial binaries and install with pip install pyaudio‑0.2.14‑cp311‑cp311‑win_amd64.whl.
AI responds to background noise: The silence_threshold = 500 is too low for noisy environments. Increase to 1000-2000. Test by running: python -c "from audio_handler import AudioHandler; import numpy as np; h = AudioHandler(); h.start_recording(); import time; time.sleep(5); chunks = [h.input_queue.get() for _ in range(50)]; print('Avg amplitude:', np.mean([np.abs(np.frombuffer(c, dtype=np.int16)).mean() for c in chunks]))" and set threshold to 2x the reported average.
Audio playback is delayed by 2-3 seconds: Output queue is buffering too much. Flash Live sends audio in small chunks—play them immediately. Verify output_queue.get(timeout=0.1) in audio_handler.py has a short timeout. If using a Bluetooth speaker, switch to wired—Bluetooth adds 100-300ms latency.
websockets.exceptions.ConnectionClosed: received 1008: Your API key lacks Live API access. Verify at aistudio.google.com/apikey that "Enable Live API" is checked. If it's a newly created key, wait 5 minutes for propagation.
Quick Hits
Google Introduces TurboQuant: 6x Memory Reduction for AI Models with Zero Accuracy Loss
Google Research published TurboQuant, a compression algorithm that reduces LLM key-value cache memory by 6x and delivers up to 8x speedup with zero accuracy loss according to the research paper. The algorithm compresses KV cache to 3 bits using online vector quantization, addressing one of the biggest bottlenecks in AI inference. Memory manufacturers' stocks dropped 6% on the announcement as investors assessed potential impact on DRAM demand. Source: Google Research Blog
ARC-AGI-3 Benchmark Released: Interactive Test Reveals AI Systems Score <1% While Humans Score 100%
The ARC Prize Foundation released ARC-AGI-3, the first interactive benchmark for measuring agentic intelligence. Unlike static benchmarks, ARC-AGI-3 requires AI agents to explore, adapt, and reason through novel, abstract, turn-based environments. The results expose a fundamental gap: humans achieve 100% accuracy while frontier AI models including GPT-5, Claude Opus 4.6, and Gemini 3 score less than 1%. The benchmark is being called "the only unsaturated agentic intelligence benchmark in the world." Source: ARC Prize Foundation
OpenAI Launches Safety Bug Bounty Program for AI Abuse and Agentic Vulnerabilities
OpenAI launched a Safety Bug Bounty program (hosted on Bugcrowd) that pays researchers to identify AI-specific vulnerabilities including agentic risks, prompt injection, jailbreaks, data exposure, and platform integrity issues. This complements OpenAI's existing Security Bug Bounty and marks one of the first formal programs focused specifically on AI safety rather than traditional software security. Rewards range from $200 to $20,000 depending on severity. Source: OpenAI Blog
Anthropic "Mythos" Model Leaked: Most Powerful Claude Yet, Reportedly Held Back for Cybersecurity Concerns
Fortune exclusively reported that Anthropic accidentally leaked details of "Claude Mythos," an unreleased AI model described internally as representing a "step change in capabilities" beyond Claude Opus 4.6. According to the leaked draft blog post, Mythos is reportedly being held back specifically due to cybersecurity concerns—the company appears to be concerned about its advanced capabilities in offensive security and hacking. This would mark one of the first times an AI company has explicitly delayed a model release due to cyber risk if confirmed. Source: Fortune
Sources
- Google AI Developer Documentation — Gemini 3.1 Flash Live
- Google DeepMind — Gemini 3.1 Flash Live Model Card
- 9to5Google — Gemini 3.1 Flash Live Announcement
- Google Developers Blog — Build with Gemini 3.1 Flash Live
- Google Research Blog — TurboQuant: Redefining AI Efficiency
- ARC Prize Foundation — ARC-AGI-3 Benchmark
- OpenAI Blog — Safety Bug Bounty Program
- PyAudio Documentation
- Google Cloud Console — API Keys
- Google AI Studio
