Build a Voice Agent with Gemini 3.1 Flash Live in Python

·Mar 28, 2026·

11 min read

Cover Image for Build a Voice Agent with Gemini 3.1 Flash Live in Python

What Is Gemini 3.1 Flash Live and Why It Matters

On March 26, 2026, Google released Gemini 3.1 Flash Live, a real-time audio-to-audio AI model designed for natural, low-latency voice conversations. Unlike previous models that required text intermediaries (speech-to-text → LLM → text-to-speech), Flash Live processes audio directly, reducing latency to under 200ms and enabling interruptions, overlapping speech, and natural conversational flow. The model powers Google's Gemini Live and Search Live products, now expanded globally.

This matters because it's among the first production-ready, developer-accessible models that handle true audio-to-audio processing at this quality level. Previous attempts at real-time voice AI either required expensive enterprise contracts or suffered from noticeable lag that broke conversational flow. Flash Live is available through Google AI Studio's free tier with a generous quota, making real-time voice agents accessible to individual developers for experimentation and prototyping.

The technical breakthrough: Flash Live uses a unified multimodal architecture that processes audio, video, and text in a single forward pass, eliminating the latency of cascaded models. It supports 40+ languages, maintains conversational context across turns, and can handle interruptions mid-sentence—critical for natural dialogue. Google reports 8x speedup over previous Gemini models for voice tasks, with quality that Google claims matches or exceeds GPT-4o's voice mode in internal testing.

Source: Google AI Developer Documentation - Gemini 3.1 Flash Live

Prerequisites

Python 3.10+ (tested on 3.11.7)
Google Cloud account with billing enabled (free tier available: console.cloud.google.com)
API key from Google AI Studio: aistudio.google.com/apikey
Audio hardware: microphone and speakers (built-in laptop hardware works)
Required packages:

pip install google-genai==0.8.2 pyaudio==0.2.14 websockets==12.0

macOS users: Install PortAudio first: brew install portaudio
Linux users: sudo apt-get install portaudio19-dev python3-pyaudio
Windows users: PyAudio binary wheels install automatically

Step 1: Set Up Your Google AI API Key and Test Connection

Create a project directory and store your API key securely. Never hardcode API keys in source files.

mkdir gemini-voice-agent
cd gemini-voice-agent
echo "YOUR_API_KEY_HERE" > .api_key
chmod 600 .api_key  # Restrict file permissions (Unix/macOS only)

Create test_connection.py to verify your API key works:

import os
from google import genai

# Load API key from file
with open('.api_key', 'r') as f:
    api_key = f.read().strip()

# Initialize client
client = genai.Client(api_key=api_key)

# Test with a simple text prompt
response = client.models.generate_content(
    model='gemini-3.1-flash-live-preview',
    contents='Say hello in exactly 5 words.'
)

print(f"Response: {response.text}")
print("✓ API connection successful")

Run the test:

python test_connection.py

Expected output:

Response: Hello there, how are you?
✓ API connection successful

If you see 401 Unauthorized, your API key is invalid. Regenerate it at aistudio.google.com/apikey. If you see 429 Too Many Requests, you've hit rate limits—wait 60 seconds and retry.

Step 2: Build Audio Input/Output Handlers

Flash Live requires streaming audio in 16kHz, 16-bit PCM format. Create audio_handler.py:

import pyaudio
import queue
import threading

class AudioHandler:
    def __init__(self, sample_rate=16000, chunk_size=1024):
        self.sample_rate = sample_rate
        self.chunk_size = chunk_size
        self.audio = pyaudio.PyAudio()

        # Input stream (microphone)
        self.input_stream = self.audio.open(
            format=pyaudio.paInt16,
            channels=1,
            rate=self.sample_rate,
            input=True,
            frames_per_buffer=self.chunk_size
        )

        # Output stream (speakers)
        self.output_stream = self.audio.open(
            format=pyaudio.paInt16,
            channels=1,
            rate=self.sample_rate,
            output=True,
            frames_per_buffer=self.chunk_size
        )

        self.input_queue = queue.Queue()
        self.output_queue = queue.Queue()
        self.running = False

    def start_recording(self):
        """Capture microphone audio in background thread"""
        self.running = True

        def record():
            while self.running:
                try:
                    data = self.input_stream.read(
                        self.chunk_size, 
                        exception_on_overflow=False
                    )
                    self.input_queue.put(data)
                except Exception as e:
                    print(f"Recording error: {e}")

        self.record_thread = threading.Thread(target=record, daemon=True)
        self.record_thread.start()

    def start_playback(self):
        """Play audio from queue in background thread"""
        def play():
            while self.running:
                try:
                    data = self.output_queue.get(timeout=0.1)
                    self.output_stream.write(data)
                except queue.Empty:
                    continue
                except Exception as e:
                    print(f"Playback error: {e}")

        self.play_thread = threading.Thread(target=play, daemon=True)
        self.play_thread.start()

    def stop(self):
        """Clean up audio streams"""
        self.running = False
        self.input_stream.stop_stream()
        self.input_stream.close()
        self.output_stream.stop_stream()
        self.output_stream.close()
        self.audio.terminate()

Test the audio handler with test_audio.py:

from audio_handler import AudioHandler
import time

handler = AudioHandler()
handler.start_recording()

print("Recording for 3 seconds...")
time.sleep(3)

# Echo test: play back what was recorded
print("Playing back...")
while not handler.input_queue.empty():
    chunk = handler.input_queue.get()
    handler.output_queue.put(chunk)

handler.start_playback()
time.sleep(3)
handler.stop()
print("✓ Audio test complete")

Expected output: You should hear your voice played back after 3 seconds. If you hear nothing, check your system audio settings and ensure your microphone/speakers are set as default devices.

Step 3: Implement the Live API WebSocket Connection

Flash Live uses WebSocket for bidirectional streaming. Create voice_agent.py:

import asyncio
import base64
from google import genai
from audio_handler import AudioHandler

class VoiceAgent:
    def __init__(self, api_key, model='gemini-3.1-flash-live-preview'):
        self.client = genai.Client(api_key=api_key)
        self.model = model
        self.audio_handler = AudioHandler()
        self.session = None

    async def start_session(self, system_instruction=None):
        """Initialize Live API session with optional system prompt"""
        config = {
            'generation_config': {
                'temperature': 0.8,
                'response_modalities': ['AUDIO'],  # Audio output only
            }
        }

        if system_instruction:
            config['system_instruction'] = system_instruction

        # Create bidirectional streaming session
        self.session = self.client.aio.live.connect(
            model=self.model,
            config=config
        )

        await self.session.__aenter__()

    async def send_audio(self):
        """Stream microphone audio to API"""
        while self.audio_handler.running:
            if not self.audio_handler.input_queue.empty():
                chunk = self.audio_handler.input_queue.get()
                # Encode audio as base64 for transmission
                encoded = base64.b64encode(chunk).decode('utf-8')

                await self.session.send({
                    'realtime_input': {
                        'media_chunks': [{
                            'mime_type': 'audio/pcm',
                            'data': encoded
                        }]
                    }
                })
            await asyncio.sleep(0.01)  # 10ms polling interval

    async def receive_audio(self):
        """Receive and play AI audio responses"""
        async for response in self.session.receive():
            # Flash Live returns audio in serverContent messages
            if hasattr(response, 'server_content'):
                for part in response.server_content.model_turn.parts:
                    if hasattr(part, 'inline_data'):
                        # Decode base64 audio and queue for playback
                        audio_data = base64.b64decode(part.inline_data.data)
                        self.audio_handler.output_queue.put(audio_data)

    async def run(self, system_instruction=None):
        """Main conversation loop"""
        await self.start_session(system_instruction)

        self.audio_handler.start_recording()
        self.audio_handler.start_playback()

        # Run send and receive concurrently
        await asyncio.gather(
            self.send_audio(),
            self.receive_audio()
        )

    def stop(self):
        """Clean up resources"""
        self.audio_handler.stop()
        if self.session:
            asyncio.create_task(self.session.__aexit__(None, None, None))

Step 4: Create a Working Voice Assistant

Build a simple voice assistant that can answer questions. Create assistant.py:

import asyncio
from voice_agent import VoiceAgent

async def main():
    # Load API key
    with open('.api_key', 'r') as f:
        api_key = f.read().strip()

    # Define assistant personality and capabilities
    system_instruction = """You are a helpful voice assistant. Keep responses 
    concise (under 30 seconds). Speak naturally with appropriate pauses. 
    If you don't understand audio input, ask the user to repeat."""

    agent = VoiceAgent(api_key)

    print("🎤 Voice assistant starting...")
    print("Speak naturally. Press Ctrl+C to exit.\n")

    try:
        await agent.run(system_instruction)
    except KeyboardInterrupt:
        print("\n\n👋 Shutting down...")
        agent.stop()

if __name__ == '__main__':
    asyncio.run(main())

Run the assistant:

python assistant.py

Expected behavior:

You'll see "Voice assistant starting..."
Start speaking: "What's the weather like today?"
Within 200-500ms, you'll hear the AI respond
The assistant maintains context—you can follow up with "What about tomorrow?"
Press Ctrl+C to exit

Step 5: Add Interruption Handling and Turn Detection

Flash Live supports natural interruptions. Enhance voice_agent.py to detect when the user starts speaking:

import numpy as np

class VoiceAgent:
    # ... (previous code) ...

    def __init__(self, api_key, model='gemini-3.1-flash-live-preview'):
        self.client = genai.Client(api_key=api_key)
        self.model = model
        self.audio_handler = AudioHandler()
        self.session = None
        self.silence_threshold = 500  # Amplitude threshold for speech detection
        self.is_speaking = False

    def detect_speech(self, audio_chunk):
        """Simple voice activity detection"""
        audio_array = np.frombuffer(audio_chunk, dtype=np.int16)
        amplitude = np.abs(audio_array).mean()
        return amplitude > self.silence_threshold

    async def send_audio(self):
        """Stream audio with turn-taking signals"""
        while self.audio_handler.running:
            if not self.audio_handler.input_queue.empty():
                chunk = self.audio_handler.input_queue.get()

                # Detect if user started speaking
                if self.detect_speech(chunk) and not self.is_speaking:
                    self.is_speaking = True
                    # Signal turn start to interrupt AI if it's speaking
                    await self.session.send({
                        'client_content': {
                            'turn_complete': False
                        }
                    })

                encoded = base64.b64encode(chunk).decode('utf-8')
                await self.session.send({
                    'realtime_input': {
                        'media_chunks': [{
                            'mime_type': 'audio/pcm',
                            'data': encoded
                        }]
                    }
                })
            else:
                # No audio = user stopped speaking
                if self.is_speaking:
                    self.is_speaking = False
                    await self.session.send({
                        'client_content': {
                            'turn_complete': True  # Signal AI can respond
                        }
                    })

            await asyncio.sleep(0.01)

This implementation allows you to interrupt the AI mid-sentence—just start speaking and it will stop and listen.

Expected Output

When you run assistant.py, you should experience:

Terminal output:

🎤 Voice assistant starting...
Speak naturally. Press Ctrl+C to exit.

Conversational flow:

You: "What's the capital of France?"
AI (audio response, ~300ms latency): "The capital of France is Paris."
You (interrupting mid-sentence): "What about—"
AI (stops immediately, listens)
You: "—Germany?"
AI: "The capital of Germany is Berlin."

Performance metrics you should observe:

Initial response latency: 200-500ms from end of your speech
Interruption detection: <100ms to stop AI playback
Audio quality: Clear, natural-sounding voice (not robotic)
Context retention: Can reference previous turns ("What about tomorrow?" after asking about today)

If responses take >2 seconds, check your internet connection—Flash Live requires stable bandwidth (minimum 1 Mbps upload/download). If audio sounds choppy, reduce chunk_size in AudioHandler to 512.

Common Pitfalls

API quota exceeded after 10 minutes: Free tier limits Flash Live to 60 requests/minute and 1500 requests/day. Each audio chunk counts as a request. Solution: Increase chunk_size to 2048 (reduces request frequency) or upgrade to paid tier at console.cloud.google.com/billing.

ModuleNotFoundError: No module named '_portaudio': PyAudio installation failed. On macOS: brew install portaudio && pip install --force-reinstall pyaudio. On Linux: sudo apt-get install portaudio19-dev && pip install --force-reinstall pyaudio. On Windows: Download the appropriate .whl from PyAudio unofficial binaries and install with pip install pyaudio‑0.2.14‑cp311‑cp311‑win_amd64.whl.

AI responds to background noise: The silence_threshold = 500 is too low for noisy environments. Increase to 1000-2000. Test by running: python -c "from audio_handler import AudioHandler; import numpy as np; h = AudioHandler(); h.start_recording(); import time; time.sleep(5); chunks = [h.input_queue.get() for _ in range(50)]; print('Avg amplitude:', np.mean([np.abs(np.frombuffer(c, dtype=np.int16)).mean() for c in chunks]))" and set threshold to 2x the reported average.

Audio playback is delayed by 2-3 seconds: Output queue is buffering too much. Flash Live sends audio in small chunks—play them immediately. Verify output_queue.get(timeout=0.1) in audio_handler.py has a short timeout. If using a Bluetooth speaker, switch to wired—Bluetooth adds 100-300ms latency.

websockets.exceptions.ConnectionClosed: received 1008: Your API key lacks Live API access. Verify at aistudio.google.com/apikey that "Enable Live API" is checked. If it's a newly created key, wait 5 minutes for propagation.

Quick Hits

Google Introduces TurboQuant: 6x Memory Reduction for AI Models with Zero Accuracy Loss

Google Research published TurboQuant, a compression algorithm that reduces LLM key-value cache memory by 6x and delivers up to 8x speedup with zero accuracy loss according to the research paper. The algorithm compresses KV cache to 3 bits using online vector quantization, addressing one of the biggest bottlenecks in AI inference. Memory manufacturers' stocks dropped 6% on the announcement as investors assessed potential impact on DRAM demand. Source: Google Research Blog

ARC-AGI-3 Benchmark Released: Interactive Test Reveals AI Systems Score <1% While Humans Score 100%

The ARC Prize Foundation released ARC-AGI-3, the first interactive benchmark for measuring agentic intelligence. Unlike static benchmarks, ARC-AGI-3 requires AI agents to explore, adapt, and reason through novel, abstract, turn-based environments. The results expose a fundamental gap: humans achieve 100% accuracy while frontier AI models including GPT-5, Claude Opus 4.6, and Gemini 3 score less than 1%. The benchmark is being called "the only unsaturated agentic intelligence benchmark in the world." Source: ARC Prize Foundation

OpenAI Launches Safety Bug Bounty Program for AI Abuse and Agentic Vulnerabilities

OpenAI launched a Safety Bug Bounty program (hosted on Bugcrowd) that pays researchers to identify AI-specific vulnerabilities including agentic risks, prompt injection, jailbreaks, data exposure, and platform integrity issues. This complements OpenAI's existing Security Bug Bounty and marks one of the first formal programs focused specifically on AI safety rather than traditional software security. Rewards range from $200 to $20,000 depending on severity. Source: OpenAI Blog

Anthropic "Mythos" Model Leaked: Most Powerful Claude Yet, Reportedly Held Back for Cybersecurity Concerns

Fortune exclusively reported that Anthropic accidentally leaked details of "Claude Mythos," an unreleased AI model described internally as representing a "step change in capabilities" beyond Claude Opus 4.6. According to the leaked draft blog post, Mythos is reportedly being held back specifically due to cybersecurity concerns—the company appears to be concerned about its advanced capabilities in offensive security and hacking. This would mark one of the first times an AI company has explicitly delayed a model release due to cyber risk if confirmed. Source: Fortune

Synthetix- edTechniti Blog