Chapter 11: Agentic-RL: Agent Reinforcement Learning Training

📖 "If Prompt Engineering is writing an 'instruction manual' for an Agent, then Agentic-RL is letting the Agent figure out the optimal solution through repeated practice."

Chapter Overview

In previous chapters, we built Agents using prompts + tool calling — all of the Agent's capabilities came from the base model's pre-training knowledge plus carefully designed prompts. This approach is simple and flexible, but has a fundamental bottleneck:

The Agent's capability ceiling = the base model's general capability ceiling.

Agentic-RL (Agentic Reinforcement Learning) provides an alternative path: training models through reinforcement learning to autonomously learn optimal strategies for completing Agent tasks. Works such as DeepSeek-R1 [1] and DeepSWE [2] have demonstrated that RL-trained models can develop reasoning strategies that never appeared in the training data, significantly surpassing pure prompt approaches in reasoning and tool use capabilities.

What You'll Learn

Section	Content	Key Takeaway
11.1	What Is Agentic-RL	Understand the essential difference between Agentic-RL and traditional post-training; master MDP framework modeling
11.2	SFT + LoRA Basic Training	Master the formal principles of supervised fine-tuning and LoRA parameter-efficient training
11.3	PPO: Proximal Policy Optimization	Starting from policy gradients, systematically understand importance sampling, advantage functions, GAE, and the Clip mechanism
11.4	DPO: Direct Preference Optimization	Master the complete mathematical derivation from RLHF to DPO; understand the implicit reward concept
11.5	GRPO/GSPO + Reward Function Design	Understand the principle of intra-group comparison replacing the Critic; master GSPO's sequence-level optimization improvements; multi-dimensional reward function design and reward hacking defense
11.6	Practice: Complete Training Pipeline	Complete a full Agentic-RL training from data preparation to model deployment based on GSM8K
11.7	Latest Research Progress (2025–2026)	Survey frontier work including DeepSeek-R1, DAPO, VAPO, SAR; stay current with the field

Prerequisites

Understanding of basic LLM working principles (Chapter 3)
Familiarity with Python and PyTorch basics
Basic concepts in machine learning / deep learning

🔗 Learning Path

Prerequisites: Chapter 3: LLM Fundamentals Recommended but not required: Chapter 6: Planning and Reasoning, Appendix E: KL Divergence Explained

Recommended Next:

👉 Chapter 12: LangChain — quickly practice with your trained model using a framework

👉 Chapter 17: Evaluation and Optimization — evaluate Agent performance after RL training

References

[1] DEEPSEEK AI. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning[R]. arXiv preprint arXiv:2501.12948, 2025.

[2] DEEPSEEK AI. DeepSWE: An open agentic SWE model that matches the performance of closed-source models[R]. 2025.