Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Chapter 21 Project Practice: Multimodal Agent

🎨 "The real world speaks more than words — images, audio, and video are all languages an Agent needs to understand."


Chapter Overview

In previous chapters, all the Agents we built only processed text. But in the real world, information exists in many forms — users might send a screenshot asking "how do I fix this error," or say "help me analyze this chart." Multimodal Agents can understand and generate multiple types of content, greatly expanding application scenarios. This chapter will guide you to build a multimodal personal assistant from scratch that can handle text, images, and audio simultaneously.

Chapter Goals

After completing this chapter, you will be able to:

  • ✅ Understand the core capabilities and application scenarios of multimodal models
  • ✅ Implement image understanding and analysis with GPT-4o
  • ✅ Implement image generation with DALL-E
  • ✅ Integrate speech recognition (STT) and text-to-speech (TTS)
  • ✅ Build a complete multimodal personal assistant Agent

Chapter Structure

SectionContentDifficulty
21.1 Multimodal Capabilities OverviewCapabilities and application scenarios of multimodal models⭐⭐
21.2 Image Understanding and GenerationGPT-4o analyzes images, DALL-E generates images⭐⭐⭐
21.3 Voice Interaction IntegrationSpeech recognition and text-to-speech⭐⭐⭐
21.4 Practice: Multimodal Personal AssistantBuild a complete multimodal Agent⭐⭐⭐⭐

⏱️ Estimated Learning Time

Approximately 90–120 minutes (including hands-on exercises)

💡 Prerequisites

  • Completed Agent development basics from previous chapters
  • Familiar with basic OpenAI API usage (including Vision and Audio APIs)
  • Python async programming basics (async/await)

🔗 Learning Path

Core Prerequisites: Chapter 4: Tool Calling, Chapter 12: LangGraph Recommended but not required: Chapters 16–18: Production Series

Related Projects:


Next: 21.1 Multimodal Capabilities Overview