Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Voice Interaction Integration

Section Goal: Integrate speech recognition (STT) and text-to-speech (TTS) capabilities into the Agent, and understand real-time voice processing architecture and multilingual support strategies.

Voice Interaction Architecture


Voice Interaction Architecture Overview

The core challenge of voice Agents is latency. Users expect voice interaction to feel "as natural as talking to a person," meaning the delay from finishing speaking to hearing a reply cannot exceed 1–2 seconds. The entire pipeline involves three latency sources:

StageTypical LatencyOptimization Direction
Speech Recognition (STT)0.5–2sStreaming recognition, endpoint detection
Agent Reasoning1–5sStreaming output, caching
Text-to-Speech (TTS)0.3–1sStreaming synthesis, pre-generation
End-to-End Total Latency2–8sFull-pipeline streaming

Two Architecture Patterns

Pattern 1: Sequential Processing (STT → LLM → TTS)

After audio recording is complete, perform speech recognition, LLM reasoning, and speech synthesis in sequence. Simple to implement, higher latency (3–8 seconds), suitable for scenarios with low real-time requirements. The first half of this section uses this pattern.

Pattern 2: Full-Pipeline Streaming

Speech recognition transcribes while listening, LLM outputs while thinking, TTS synthesizes while playing — three stages execute in parallel pipeline. Latency can be reduced to 1–2 seconds, but implementation complexity is high. OpenAI's Realtime API uses this pattern.


Speech Recognition (Speech-to-Text)

Use OpenAI's Whisper model to convert speech to text:

from openai import OpenAI
from pathlib import Path

class SpeechToText:
    """Speech-to-text tool"""
    
    def __init__(self):
        self.client = OpenAI()
    
    def transcribe(
        self,
        audio_path: str,
        language: str = "en"
    ) -> str:
        """Convert an audio file to text"""
        
        with open(audio_path, "rb") as audio_file:
            transcript = self.client.audio.transcriptions.create(
                model="whisper-1",
                file=audio_file,
                language=language,
                response_format="text"
            )
        
        return transcript
    
    def transcribe_with_timestamps(
        self,
        audio_path: str,
        language: str = "en"
    ) -> dict:
        """Transcribe and return timestamps"""
        
        with open(audio_path, "rb") as audio_file:
            transcript = self.client.audio.transcriptions.create(
                model="whisper-1",
                file=audio_file,
                language=language,
                response_format="verbose_json",
                timestamp_granularities=["segment"]
            )
        
        return {
            "text": transcript.text,
            "segments": [
                {
                    "text": seg.text,
                    "start": seg.start,
                    "end": seg.end
                }
                for seg in (transcript.segments or [])
            ]
        }


# Usage example
stt = SpeechToText()
text = stt.transcribe("user_voice.mp3")
print(f"User said: {text}")

Text-to-Speech

Convert the Agent's text replies to audio:

class TextToSpeech:
    """Text-to-speech tool"""
    
    VOICES = ["alloy", "echo", "fable", "onyx", "nova", "shimmer"]
    
    def __init__(self, voice: str = "nova"):
        self.client = OpenAI()
        self.voice = voice
    
    def speak(
        self,
        text: str,
        output_path: str = "response.mp3",
        speed: float = 1.0
    ) -> str:
        """Convert text to an audio file"""
        
        response = self.client.audio.speech.create(
            model="tts-1",       # or "tts-1-hd" for high quality
            voice=self.voice,
            input=text,
            speed=speed          # 0.25 - 4.0
        )
        
        response.stream_to_file(output_path)
        return output_path
    
    def stream_speak(self, text: str):
        """Streaming speech synthesis (generate and play simultaneously)"""
        
        response = self.client.audio.speech.create(
            model="tts-1",
            voice=self.voice,
            input=text
        )
        
        # Return byte stream, can be received and played simultaneously
        return response.content


# Usage example
tts = TextToSpeech(voice="nova")
tts.speak("Hello! I'm your AI assistant. How can I help you today?")

Voice Conversation Loop

import asyncio

class VoiceConversation:
    """Voice conversation system"""
    
    def __init__(self, agent_func):
        self.stt = SpeechToText()
        self.tts = TextToSpeech(voice="nova")
        self.agent = agent_func  # Agent processing function
    
    async def process_voice(self, audio_path: str) -> str:
        """Process one round of voice conversation"""
        
        # 1. Audio → Text
        print("🎤 Recognizing speech...")
        user_text = self.stt.transcribe(audio_path)
        print(f"📝 Recognition result: {user_text}")
        
        # 2. Agent processing
        print("🤔 Agent thinking...")
        response_text = await self.agent(user_text)
        print(f"💬 Agent reply: {response_text}")
        
        # 3. Text → Audio
        print("🔊 Generating audio...")
        audio_output = self.tts.speak(response_text)
        print(f"✅ Audio saved: {audio_output}")
        
        return audio_output

OpenAI Realtime API: Low-Latency Voice Interaction

In late 2024, OpenAI launched the Realtime API [1], supporting direct audio input/output (Audio-in, Audio-out), bypassing the traditional STT → LLM → TTS three-stage process and reducing end-to-end latency to sub-second levels.

Core Features

  • Speech-to-Speech: The model directly receives audio input and generates audio output, without intermediate text conversion steps
  • WebSocket persistent connection: Establishes a persistent connection via WebSocket, supporting bidirectional real-time communication
  • Function Calling: Even in voice mode, the Agent can call external tools
  • Voice Activity Detection (VAD): Built-in endpoint detection, automatically determines when the user has finished speaking

Basic Connection Example

import asyncio
import websockets
import json
import base64

class RealtimeVoiceAgent:
    """Voice Agent based on OpenAI Realtime API"""
    
    REALTIME_URL = "wss://api.openai.com/v1/realtime?model=gpt-4o-realtime-preview"
    
    def __init__(self, api_key: str, instructions: str = "You are a friendly voice assistant"):
        self.api_key = api_key
        self.instructions = instructions
    
    async def connect(self):
        """Establish WebSocket connection"""
        headers = {
            "Authorization": f"Bearer {self.api_key}",
            "OpenAI-Beta": "realtime=v1"
        }
        
        async with websockets.connect(
            self.REALTIME_URL, 
            additional_headers=headers
        ) as ws:
            # Configure session
            await ws.send(json.dumps({
                "type": "session.update",
                "session": {
                    "instructions": self.instructions,
                    "voice": "nova",
                    "input_audio_format": "pcm16",
                    "output_audio_format": "pcm16",
                    "turn_detection": {
                        "type": "server_vad",  # Server-side voice activity detection
                        "threshold": 0.5,
                        "silence_duration_ms": 500
                    }
                }
            }))
            
            # Event handling loop
            async for message in ws:
                event = json.loads(message)
                await self._handle_event(event)
    
    async def _handle_event(self, event: dict):
        """Handle server-side events"""
        event_type = event.get("type", "")
        
        if event_type == "response.audio.delta":
            # Received audio chunk, can play immediately
            audio_chunk = base64.b64decode(event["delta"])
            await self._play_audio(audio_chunk)
        
        elif event_type == "response.text.delta":
            # Received text chunk (for debugging)
            print(event["delta"], end="", flush=True)
        
        elif event_type == "response.function_call_arguments.done":
            # Agent requests tool call
            await self._handle_tool_call(event)
    
    async def _play_audio(self, chunk: bytes):
        """Play audio chunk (requires audio playback library integration)"""
        # In actual implementation, use pyaudio or sounddevice
        pass
    
    async def _handle_tool_call(self, event: dict):
        """Handle tool calls"""
        # Same logic as Function Calling in text mode
        pass

💡 Selection advice: If your Agent needs "phone call-like" real-time voice interaction (e.g., customer service bots, voice assistants), the Realtime API currently provides the best experience. If you only occasionally need voice input/output (e.g., voice memos), the traditional STT + TTS approach is simpler and more controllable.


Multilingual and Dialect Support

Whisper's Language Capabilities

OpenAI Whisper supports speech recognition in 99 languages, but accuracy varies significantly across languages:

Language CategoryRepresentative LanguagesWER (Word Error Rate)Recommendation
High-resource languagesEnglish, Mandarin Chinese< 5%Use directly, excellent results
Medium-resource languagesJapanese, Korean, German5–15%Usable, recommend specifying language parameter
Low-resource languagesCantonese, Tibetan, dialects15–30%Requires post-processing correction

Multilingual Agent Design Strategy

class MultilingualVoiceAgent:
    """Multilingual voice Agent"""
    
    # Language to TTS voice mapping
    VOICE_MAP = {
        "en": "alloy",     # English uses alloy
        "zh": "nova",      # Chinese uses nova (more natural)
        "ja": "shimmer",   # Japanese uses shimmer
    }
    
    def __init__(self):
        self.stt = SpeechToText()
        self.client = OpenAI()
    
    async def detect_and_respond(self, audio_path: str) -> str:
        """Auto-detect language and reply in the same language"""
        
        # Whisper auto-detects language
        with open(audio_path, "rb") as f:
            result = self.client.audio.transcriptions.create(
                model="whisper-1",
                file=f,
                response_format="verbose_json"
            )
        
        detected_lang = result.language  # e.g., "english", "chinese"
        lang_code = {"chinese": "zh", "english": "en", "japanese": "ja"
                    }.get(detected_lang, "en")
        
        # Reply in the detected language
        voice = self.VOICE_MAP.get(lang_code, "nova")
        
        # Agent processing (specify reply language in System Prompt)
        response = await self._process_with_language(result.text, lang_code)
        
        # Synthesize speech with the corresponding language voice
        tts = TextToSpeech(voice=voice)
        return tts.speak(response)

Voice Sentiment Analysis

Voice conveys not only text information but also emotional cues (tone, speed, pitch). In customer service, mental health, and other scenarios, recognizing user emotions has important value:

class VoiceSentimentAnalyzer:
    """Voice-based sentiment analysis"""
    
    def __init__(self):
        self.client = OpenAI()
    
    async def analyze(self, audio_path: str) -> dict:
        """Analyze the text content and sentiment of speech"""
        
        # Step 1: Transcribe speech
        with open(audio_path, "rb") as f:
            transcript = self.client.audio.transcriptions.create(
                model="whisper-1", file=f, 
                response_format="verbose_json",
                timestamp_granularities=["segment"]
            )
        
        # Step 2: Let LLM analyze sentiment
        # Note: This only analyzes the sentiment of the text semantics.
        # True acoustic sentiment analysis requires specialized models (e.g., emotion2vec)
        analysis_prompt = f"""Analyze the sentiment of the following speech transcription, return JSON:
{{
  "text": "original text",
  "sentiment": "positive/neutral/negative",
  "emotion": "primary emotion (e.g., happy, anxious, angry, calm)",
  "urgency": "high/medium/low",
  "confidence": 0.0-1.0
}}

Transcription: {transcript.text}"""
        
        response = self.client.chat.completions.create(
            model="gpt-4o",
            messages=[{"role": "user", "content": analysis_prompt}],
            response_format={"type": "json_object"}
        )
        
        return json.loads(response.choices[0].message.content)

Voice Agent Design Patterns

In practice, voice Agents have several common design patterns:

Pattern 1: Voice Frontend + Text Backend

The simplest pattern. Voice is just the "skin" for input/output; the Agent core still processes text. Suitable for the vast majority of scenarios.

🎤 User voice → [STT] → Text → [Agent] → Text → [TTS] → 🔊 Voice reply

Pattern 2: Voice-Aware Agent

The Agent can perceive voice characteristics (speed, pauses, emotion) and adjust reply strategy accordingly. For example, when detecting urgency in the user's tone, reply more concisely and directly.

Pattern 3: Continuous Listening Agent

The Agent continuously monitors ambient audio and activates when it detects a wake word or specific event. Suitable for smart home and in-car scenarios. Requires a lightweight local VAD (Voice Activity Detection) model.

Pattern Comparison

PatternComplexityLatencyApplicable Scenarios
Voice frontend + text backend3–8sGeneral assistants, customer service
Voice-aware Agent⭐⭐⭐3–8sEmotional customer service, counseling
Continuous listening Agent⭐⭐⭐⭐1–3sSmart home, in-car
Realtime API Agent⭐⭐0.5–2sReal-time conversation, phone Agent

References

[1] OpenAI. "Realtime API." OpenAI Platform Documentation, 2024. https://platform.openai.com/docs/guides/realtime


Summary

FeatureModelDescription
Speech recognitionWhisperSupports 50+ languages, high accuracy
Text-to-speechTTS-16 voice options
Voice conversationCombined useComplete STT → Agent → TTS pipeline

Next: 21.4 Practice: Multimodal Personal Assistant →