Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Chapter 11: Agentic-RL: Agent Reinforcement Learning Training

📖 "If Prompt Engineering is writing an 'instruction manual' for an Agent, then Agentic-RL is letting the Agent figure out the optimal solution through repeated practice."

Chapter Overview

In previous chapters, we built Agents using prompts + tool calling — all of the Agent's capabilities came from the base model's pre-training knowledge plus carefully designed prompts. This approach is simple and flexible, but has a fundamental bottleneck:

The Agent's capability ceiling = the base model's general capability ceiling.

Agentic-RL (Agentic Reinforcement Learning) provides an alternative path: training models through reinforcement learning to autonomously learn optimal strategies for completing Agent tasks. Works such as DeepSeek-R1 [1] and DeepSWE [2] have demonstrated that RL-trained models can develop reasoning strategies that never appeared in the training data, significantly surpassing pure prompt approaches in reasoning and tool use capabilities.

What You'll Learn

SectionContentKey Takeaway
11.1What Is Agentic-RLUnderstand the essential difference between Agentic-RL and traditional post-training; master MDP framework modeling
11.2SFT + LoRA Basic TrainingMaster the formal principles of supervised fine-tuning and LoRA parameter-efficient training
11.3PPO: Proximal Policy OptimizationStarting from policy gradients, systematically understand importance sampling, advantage functions, GAE, and the Clip mechanism
11.4DPO: Direct Preference OptimizationMaster the complete mathematical derivation from RLHF to DPO; understand the implicit reward concept
11.5GRPO/GSPO + Reward Function DesignUnderstand the principle of intra-group comparison replacing the Critic; master GSPO's sequence-level optimization improvements; multi-dimensional reward function design and reward hacking defense
11.6Practice: Complete Training PipelineComplete a full Agentic-RL training from data preparation to model deployment based on GSM8K
11.7Latest Research Progress (2025–2026)Survey frontier work including DeepSeek-R1, DAPO, VAPO, SAR; stay current with the field

Prerequisites

  • Understanding of basic LLM working principles (Chapter 3)
  • Familiarity with Python and PyTorch basics
  • Basic concepts in machine learning / deep learning

🔗 Learning Path

Prerequisites: Chapter 3: LLM Fundamentals Recommended but not required: Chapter 6: Planning and Reasoning, Appendix E: KL Divergence Explained

Recommended Next:


References

[1] DEEPSEEK AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning[R]. arXiv preprint arXiv:2501.12948, 2025.

[2] DEEPSEEK AI. DeepSWE: An open agentic SWE model that matches the performance of closed-source models[R]. 2025.