Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Foundation Model Landscape and Selection Guide

🌍 "Models iterate rapidly — today's SOTA may be tomorrow's baseline. But understanding the evolution trends lets you make better choices amid change."

In the previous sections, we learned about the basic principles of LLMs, prompt engineering, API calls, and model parameters. That knowledge represents the "unchanging" underlying capabilities. This section discusses the "changing" part — the technical frontier and industry landscape of foundation models.

As an Agent developer, you don't need to train your own foundation model, but you must understand the capability boundaries and development trends of models — because the choice of model directly determines the ceiling of your Agent.

Foundation Model Landscape and Four Major Trends

Trend 1: The Leap in Reasoning Capability

In September 2024, OpenAI's o1 first proved the feasibility of "trading more reasoning time for better results." In January 2025, the open-source release of DeepSeek-R1 ignited the democratization of reasoning models — it was the first to demonstrate how pure RL training (GRPO) could cause Chain-of-Thought capability to emerge spontaneously in a model.

In April 2025, OpenAI released o3 and o4-mini, achieving multimodal reasoning ("thinking while looking at images") and autonomous tool chain calls for the first time. In August 2025, GPT-5 was officially released, with reasoning capability built in as a native feature, eliminating the need for a separate o-series model.

By early 2026, reasoning had become standard in all mainstream models:

ModelReleaseReasoning ModeKey Breakthrough
Claude Opus 4.72026.04Adaptive thinking depthSWE-bench Verified #1, visual capability tops charts, new tokenizer
GPT-5.42026.03Built-in Thinking modeReasoning+coding+Computer Use+search unified, 1M context
Claude Opus 4.62026.02Adaptive thinking depth1M context (Beta) + SWE-bench 80.8%
GPT-52025.08Built-in intelligent routingSWE-bench 75%, unified system architecture, multimodal
Claude Opus 42025.05Deep reasoningSWE-bench 72.5%, continuous 7-hour operation
Gemini 2.5 Pro2025.03Native multimodal reasoning1M context + dynamic reasoning budget control
DeepSeek-R12025.01Pure RL reasoningOpen-source reasoning model ignites the world, GRPO training
Kimi K2.62026.04Agent reasoning1T params open-source, 13-hour coding, 300 sub-agents parallel
Kimi K22025.07Agent reasoning1T total/32B active, MuonClip optimizer, open-source Agent SOTA
Qwen3-235B-A22B2025.04Hybrid reasoning (fast/slow)Open-source flagship, surpasses DeepSeek-R1 and o1

💡 Impact on Agents: Reasoning models give Agents a qualitative leap in "planning" and "complex decision-making." In real engineering, more and more Agents adopt a "fast-slow dual system" — fast models for simple routing, reasoning models for complex planning. The arrival of GPT-5 and Claude 4.6 makes this switching more seamless — reasoning capability is now built into general-purpose models.

Trend 2: MoE and the Efficiency Revolution

Large models keep getting larger, but inference costs are falling — driven by the comprehensive victory of Mixture of Experts (MoE).

The core idea of MoE: the total parameter count can be very large (hundreds of billions), but only a small fraction is activated during each inference. Like a large company with hundreds of employees, but only the most suitable dozen are assigned to each project.

# Intuitive understanding of MoE models (conceptual illustration)
class MixtureOfExperts:
    """
    Using Qwen3.5-Plus as an example:
    Total parameters: 397B
    Active per inference: 17B (~4.3% only)
    Effect: Approaches or exceeds trillion-parameter dense models, at a fraction of the inference cost
    """
    def __init__(self, num_experts=128, active_experts=8):
        self.num_experts = num_experts
        self.active_experts = active_experts
    
    def forward(self, input_tokens):
        # Router decides which experts to activate
        scores = self.router(input_tokens)
        top_k = scores.topk(self.active_experts)
        # Only selected experts participate in computation
        return sum(expert(input_tokens) * w for expert, w in top_k)
ModelTotal ParamsActive ParamsArchitecture Highlights
Kimi K2.61T32BK2 upgrade, 13-hour coding, 300 sub-agents parallel, SWE-bench Pro 58.6%
Kimi K21T32BMuonClip optimizer, trillion-parameter open-source MoE
Qwen3.6-35B-A3B35B3BReleased 2026.04, lightweight MoE, extreme efficiency
Llama 4 Maverick~400B17B128 experts, native multimodal, text generation surpasses GPT-4.1
Qwen3-235B-A22B235B22BHybrid reasoning, Apache 2.0, tops open-source leaderboard
Qwen3-30B-A3B30B3BLightweight MoE, runs on single GPU
DeepSeek-V3671B37BMoE architecture, $5.57M training cost, best price-performance
DeepSeek-V3-0324685B37BMinor update, major coding improvement
Gemma 4-26B26B4B (active)Apache 2.0, native video/image, 256K context
Llama 4 Scout109B17B16 experts, 10M token ultra-long context

💡 Impact on Agents: MoE makes "large model capability + small model cost" a reality. The biggest change in early 2026 is Kimi K2.6 open-sourcing at trillion-parameter scale with 300 sub-agents running in parallel, pushing MoE-based Agent capability to new heights. Qwen3.6-35B-A3B achieves extreme efficiency with only 3B active parameters. DeepSeek-V3-0324 significantly enhances coding and tool-calling capability. These advances mean Agent operating costs are falling rapidly.

Trend 3: The Full Rise of the Open-Source Ecosystem

In 2025–2026, open-source models are no longer just "catching up" with closed-source — they have formed a competitive balance and even locally surpassed closed-source in multiple areas:

Tier 1 (Competing with GPT-5.4 / Claude Opus 4.7):

  • Kimi K2.6 (Moonshot AI, 2026.04): 1T params open-source MoE, 13-hour continuous coding, 300 sub-agents parallel, SWE-bench Pro 58.6%, API price only 1/8 of Opus 4.6
  • Kimi K2 (Moonshot AI, 2025.07): 1T total/32B active MoE, MuonClip optimizer doubles training efficiency, open-source Agent SOTA, compatible with OpenAI/Anthropic API
  • Qwen3-235B-A22B (Alibaba, 2025.04): 235B MoE hybrid reasoning, surpasses DeepSeek-R1 and o1, Apache 2.0
  • DeepSeek-V3-0324 (DeepSeek, 2025.03): 685B MoE, coding surpasses Claude 3.7, more permissive open-source license
  • Llama 4 Maverick (Meta, 2025.04): ~400B MoE multimodal, text generation surpasses GPT-4.1

Tier 2 (Lightweight and Efficient, single-GPU capable):

  • Qwen3.6-35B-A3B (Alibaba, 2026.04): 35B total/3B active, lightweight MoE, extreme efficiency
  • Qwen3.6-Plus / Flash / Max (Alibaba, 2026.04): Qwen3 rapid iteration, covering different performance tiers
  • Gemma 4-31B (Google, 2026.04): Dense model, top-3 open-source on Arena Elo, Apache 2.0, native video/image multimodal
  • Llama 4 Scout (Meta, 17B active/109B total): 10M context window, runs on a single H100
  • Phi-4 (Microsoft, 14B): The ceiling for small-size models, reasoning surpasses many 70B models
  • Phi-4-multimodal (Microsoft, 5.6B): Unified architecture for speech + vision + text
  • Gemma 4-E2B/E4B (Google, 2026.04): 2.3B/4.5B, phone/edge devices, native audio/video, Apache 2.0
  • Qwen3 series (Alibaba, 0.6B~235B): Full coverage from phones to servers, Apache 2.0

Open-source vs. Closed-source Decision Matrix:

DimensionClosed-sourceOpen-source
Peak CapabilityStill has an edge (GPT-5.4, Claude Opus 4.7)Rapidly catching up; Kimi K2.6/Qwen3.6 locally surpass in coding
CostPay-per-use APINear-zero marginal cost after self-deployment
PrivacyData sent to third partyData completely private
CustomizationLimited (Fine-tuning API)Fully controllable (LoRA/full fine-tuning)
LatencyAffected by networkControllable with local deployment
Agent CapabilityMature and stable tool callingKimi K2.6, Qwen3.6 natively support Agent; K2.6 supports 300 sub-agents parallel
Best ForRapid prototyping, general tasksProduction deployment, data-sensitive scenarios

Trend 4: The Rise of Agent-Native Models

The most notable new trend in 2025–2026 is: models are beginning to be specifically optimized for Agent scenarios.

  • Claude Opus 4.7 (2026.04): SWE-bench Verified #1, visual capability tops charts, Claude Code fully upgraded, production-ready foundation for RPA and automated testing
  • Kimi K2.6 (2026.04): 1T params open-source, 300 sub-agents parallel, continuous 5-day operation for complex DevOps, SWE-bench Pro 58.6%, API price only 1/8 of Opus 4.6
  • GPT-5.4 (2026.03): First to unify reasoning+coding+Computer Use+deep search in a single model, natively controls browsers and OS, Agent tool-call token cost cut in half
  • Kimi K2: Trillion-parameter open-source MoE, Agent capability reaches open-source SOTA on multiple benchmarks, focused on Agent-specific pre-training and post-training, compatible with Claude Code and other mainstream Agent frameworks
  • DeepSeek-V3-0324: Significantly enhanced coding and tool-calling capability, more permissive open-source license, suitable for Agent production deployment
  • GPT-5: Unified system architecture, built-in reasoning routing, stable Agent tool calling, supports Computer Use
  • Claude Opus 4.6: 1M context (Beta), handles massive codebases, autonomously discovers zero-day vulnerabilities, enterprise-grade Agent workflows
  • Claude Opus 4: Continuous autonomous operation for 7 hours, SWE-bench 72.5%, new Agent coding benchmark
  • Qwen3-235B-A22B: Deeply adapted to Agent frameworks, dramatically improved tool call accuracy, hybrid reasoning auto-switches fast/slow thinking
  • Llama 4 Scout: 10M token ultra-long context, suitable for Agent tasks requiring very long documents

This means Agent developers no longer need to "force a fit" — the models themselves are designed for Agents.

Multimodal Foundation Models: More Than Just Text

In 2026, foundation models are almost all natively multimodal — supporting mixed input and output of text, images, audio, and video at the architecture level.

# Typical multimodal Agent call
from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
    model="gpt-5",  # GPT-5 natively supports multimodal
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's wrong with this architecture diagram? Please provide improvement suggestions."},
            {"type": "image_url", "image_url": {"url": "data:image/png;base64,..."}}
        ]
    }]
)

# GPT-5 can not only "understand" images, but also generate images and have real-time voice conversations

Mainstream Multimodal Model Comparison:

ModelReleaseInput ModalitiesOutput ModalitiesSpecial Capabilities
Claude Opus 4.72026.04Text+Image+PDFTextSWE-bench Verified #1, image input 3.75M pixels, visual capability tops charts
GPT-5.42026.03Text+Image+AudioText+ImageComputer Use surpasses humans, reasoning+coding+search unified, 1M context
GPT-52025.08Text+Image+AudioText+Image+AudioReal-time voice conversation, native image generation, Computer Use
Claude Opus 4.62026.02Text+Image+PDFText1M context (Beta), enterprise-grade Agent workflows
Gemini 2.5 Pro2025.03Text+Image+Video+AudioText+ImageNative video understanding, 1M context
Kimi K2.62026.04TextTextTrillion-param open-source, 300 sub-agents parallel, Agent coding SOTA
Kimi K22025.07TextTextTrillion-param Agent SOTA, strongest tool calling
Llama 4 Maverick2025.04Text+ImageTextOpen-source multimodal MoE, ~400B total params
Phi-4-multimodal2025.02Text+Image+SpeechTextOnly 5.6B params, unified multimodal architecture

The Rise of Small Models: SLM and Edge Deployment

The progress of Small Language Models (SLMs) is remarkable — 14B parameter models from 2025 have comprehensively surpassed GPT-4 from 2023.

# Impressive small model performance (2025–2026 benchmark data)
slm_benchmarks = {
    "Phi-4 (14B)":             {"MMLU": 84.8, "HumanEval": 82.6, "GSM8K": 94.5},
    "Phi-4-reasoning (14B)":   {"MMLU": 86.2, "HumanEval": 85.1, "GSM8K": 95.8},
    "Qwen 3 (8B)":            {"MMLU": 81.2, "HumanEval": 79.8, "GSM8K": 91.3},
    "Llama 4 Scout (17B act)": {"MMLU": 83.5, "HumanEval": 80.1, "GSM8K": 92.1},
    "Phi-4-mini (3.8B)":      {"MMLU": 72.1, "HumanEval": 68.5, "GSM8K": 84.2},
    # Comparison: GPT-4 from 2023 (~1.7T params estimated)
    "GPT-4 (2023)":           {"MMLU": 86.4, "HumanEval": 67.0, "GSM8K": 92.0},
}

# Phi-4-reasoning (14B) has comprehensively surpassed GPT-4 (2023) in coding and math!
# Phi-4-mini (3.8B) can even run on a phone and still do function calling
# This means: Agents don't necessarily need the largest model

💡 Impact on Agents: SLMs allow Agents to run locally on phones, laptops, and edge devices, enabling zero-latency, fully private interactions. Apple Intelligence, Google's Gemini Nano, and Microsoft's Phi-4-mini are all products of this trend. Phi-4-multimodal handles speech, vision, and text simultaneously with just 5.6B parameters, opening the door for edge-side multimodal Agents.

Model Selection Guide for Agent Developers

With so many model choices, how do you pick the right foundation model for your Agent?

def select_model(requirements: dict) -> str:
    """Agent model selection decision function (April 2026 edition)"""
    
    budget = requirements.get("monthly_budget_usd", 100)
    task_type = requirements.get("task_type", "general")
    privacy = requirements.get("privacy_required", False)
    latency_ms = requirements.get("max_latency_ms", 5000)
    reasoning = requirements.get("complex_reasoning", False)
    agent_native = requirements.get("agent_native", False)
    
    # Decision tree
    if privacy:
        if reasoning:
            return "Kimi K2.6 (self-hosted)"  # Open-source + Agent + best price-performance
        elif latency_ms < 500:
            return "Phi-4 / Qwen3-8B (local deployment)"  # Edge SLM
        else:
            return "Qwen3-235B / Llama 4 Maverick (self-hosted)"  # Open-source general
    
    if agent_native:
        if budget > 500:
            return "Claude Opus 4.7 / GPT-5.4"  # Top-tier Agent experience
        else:
            return "Kimi K2.6 API / DeepSeek-V3 API"  # Value Agent (K2.6 is 1/8 the price of Opus 4.6)
    
    if reasoning:
        if budget > 500:
            return "Claude Opus 4.7 / GPT-5.4"  # Top-tier reasoning
        else:
            return "DeepSeek-V3 API / o4-mini"  # Value reasoning
    
    if budget < 50:
        return "DeepSeek-V3 API / GPT-4o-mini"  # Extreme value
    
    return "GPT-5.4 / Claude Sonnet 4.6"  # Balanced general choice

Recommended models by Agent scenario:

Agent ScenarioRecommended ModelReason
Coding assistantClaude Opus 4.7 / Kimi K2.6SWE-bench dual #1; K2.6 extreme price-performance (1/8 of Opus 4.6)
Data analysisGPT-5.4 / Gemini 2.5 ProMultimodal understanding + stable function calling
Customer serviceGPT-4.1-mini / Qwen3-8BCost-sensitive, high response speed requirement
Deep researchClaude Opus 4.6 / GPT-5.41M context + deep reasoning
Document processingGemini 2.5 Pro / Claude Opus 4.61M ultra-long document input, PDF layout understanding
Local privacyKimi K2.6 / Qwen3-235B (self-hosted)Data stays local, complete Agent capability, K2.6 is open-source
Edge deploymentPhi-4-mini (3.8B) / Qwen3-4BRuns on phone/laptop
Multimodal AgentGPT-5.4 / Gemini 2.5 ProComputer Use surpasses humans, native multimodal + visual understanding
RPA / automated testingClaude Opus 4.7 / GPT-5.4Visual capability tops charts, ScreenSpot-Pro/OSWorld all #1

2025–2026 Key Model Release Timeline

2024.09  OpenAI o1 ──── The year of reasoning models
2024.12  Phi-4 (14B) ── Microsoft releases strongest small model
2025.01  DeepSeek-R1 ── Open-source reasoning model ignites the world
2025.02  Phi-4-multimodal / Phi-4-mini ── Edge multimodal
2025.03  Gemini 2.5 Pro ── 1M context + reasoning, tops leaderboards
2025.03  DeepSeek-V3-0324 ── Major coding improvement, more permissive license
2025.04  Llama 4 Scout/Maverick ── Meta's first MoE open-source multimodal
2025.04  o3 / o4-mini ── OpenAI multimodal reasoning
2025.04  Qwen3 ── Alibaba hybrid reasoning full series (0.6B~235B)
2025.05  Claude Opus 4 / Sonnet 4 ── 7-hour continuous coding, new Agent benchmark
2025.07  Kimi K2 ── Moonshot AI trillion-parameter open-source MoE, MuonClip optimizer
2025.08  GPT-5 ── OpenAI unified architecture, built-in reasoning routing, SWE-bench 75%
━━━━━━━━━━━━━━━━━━━━━━━━ 2026 ━━━━━━━━━━━━━━━━━━━━━━━━
2026.02  Claude Opus 4.6 ── 1M context (Beta), SWE-bench 80.8%, enterprise Agent
2026.03  GPT-5.4 ── Reasoning+coding+Computer Use+search unified, 1M context, 3 variants
2026.04  Gemma 4 (E2B/E4B/26B/31B) ── Google open-source, native video/audio, Apache 2.0
2026.04  Claude Opus 4.7 ── SWE-bench Verified #1, visual capability tops charts, Claude Code fully upgraded
2026.04  Kimi K2.6 ── Moonshot AI open-source, 13-hour coding, 300 sub-agents parallel, SWE-bench Pro 58.6%
2026.04  Qwen3.6 series ── Alibaba rapid iteration (35B-A3B/Flash/Plus/Max), full tier coverage

Outlook: What's Next for Foundation Models

Several development directions worth watching:

  1. Reasoning Built-in: Reasoning capability moves from standalone o-series models into general-purpose models (GPT-5.4 Thinking mode, Qwen3 hybrid reasoning) — developers no longer need to manually choose between "reasoning model" and "general model"
  2. Computer Use Maturity: GPT-5.4 and Claude Opus 4.7 surpass human-level performance on ScreenSpot-Pro and OSWorld — Agents can now natively control browsers and operating systems, bringing RPA into production-ready territory
  3. Agent Clustering: Models evolve from "single-task execution" to "large-scale autonomous collaboration" — Kimi K2.6's 300 sub-agents running in parallel for 5 continuous days is a landmark milestone
  4. MoE Efficiency Revolution: Kimi K2.6/Qwen3.6 open-source at trillion-parameter scale with only 3B~32B active parameters — Agent operating costs fall dramatically; K2.6 API is only 1/8 the price of Opus 4.6
  5. Open-Source Full Rise: Kimi K2.6/Qwen3.6/Gemma 4 form a complete ecosystem — private Agent deployment matures; data security is no longer a bottleneck
  6. World Models: From language models to world models — understanding physical laws and causal relationships, not just text patterns
  7. Continual Learning and Personalization: Models continuously learn from post-deployment interactions; each Agent has unique "experience"
  8. Native Multimodal: Text → vision + speech + video full modality — Agents can "see," "hear," and "draw"

Section Summary

TrendCore ChangeImpact on Agent Development
Reasoning built-inGPT-5.4 Thinking mode, Qwen3 hybrid fast/slow thinkingQualitative leap in Agent complex planning; no need to manually choose reasoning model
Computer Use maturityGPT-5.4/Claude Opus 4.7 surpass human levelAgents directly control browsers and OS; RPA enters production-ready stage
Agent clusteringKimi K2.6's 300 sub-agents parallel, continuous 5-day operationAgents evolve from single-task execution to large-scale autonomous collaboration
MoE efficiency revolutionKimi K2.6/Qwen3.6 open-source at trillion-param scale, only 3B~32B activeAgent operating costs fall dramatically; K2.6 API only 1/8 of Opus 4.6
Open-source full riseKimi K2.6/Qwen3.6/Gemma 4 form complete ecosystemPrivate Agent deployment matures; data security no longer a bottleneck
Agent-NativeModels specifically optimized for Agent scenarios (tool calling/long tasks)Developers no longer need to "force a fit"; models are Agent-ready
Native multimodalText → vision + speech + video full modalityAgents can "see," "hear," and "draw"; more natural interaction
Small model progress3.8B params run on phones; 14B surpasses GPT-4Agents can run on edge devices; zero latency, complete privacy

Note: Model technology evolves extremely fast. The data in this section is current as of April 23, 2026. Claude Opus 4.7 and Kimi K2.6 were just released in April 2026. The industry landscape is still rapidly evolving. It is recommended to regularly follow vendor release announcements and authoritative benchmark evaluations (such as LMArena, Open LLM Leaderboard, Chatbot Arena) for the latest information.


Next section: 3.7 Foundation Model Architecture Explained