11.7 Latest Research Advances (2025–2026)

📖 "From DeepSeek-R1 making the cover of Nature to DAPO/VAPO setting new reasoning benchmarks, Agentic-RL is moving from the laboratory to engineering practice at an astonishing pace. This section will give you a panoramic view of the cutting-edge research in this field."

⏰ Timeliness Note: The content of this section is updated to March 20, 2026. Since this field is developing extremely rapidly, readers are advised to supplement with open-source projects like Awesome-RL-Reasoning-Recipes for the latest developments.

Agentic-RL Frontier Research Panorama

7.1 Overview: The Paradigm Shift from RLHF to Reasoning RL

The past two years (2025–2026) have been years of explosive development in the field of large model reinforcement learning. Marked by DeepSeek-R1 making the cover of Nature, RL training of LLMs has leaped from an auxiliary role in "aligning human preferences" (RLHF) to the core technology for stimulating model reasoning capabilities. We can use a timeline to overview the key milestones:

2024.09  OpenAI o1 released, first demonstrates potential of "test-time compute scaling"
2025.01  DeepSeek-R1 released, pure RL training stimulates autonomous reasoning, uses GRPO algorithm
2025.01  Kimi k1.5 released, 128K long-context RL training, Long2Short distillation technique
2025.02  QwQ-32B released, demonstrates reasoning RL training effects at medium scale
2025.03  DAPO open-sourced, proposes reproducible large-scale RL training solution
2025.04  VAPO released, value-augmented PPO framework, AIME 2024 reaches 60.4
2025.06  OpenAI o3 released, reasoning capability further leaps
2025.07  GSPO proposed (Qwen team), sequence-level policy optimization stabilizes MoE training, trains Qwen3
2025.08  Self-Aligned Reward (SAR) proposed, uses perplexity signals to address overthinking
2025.10  PURE framework released, min-form credit assignment solves reward hacking
2025.12  Co-rewarding (ICLR 2026) proposes self-supervised RL learning scheme
2026.01  RLVR new paradigm: efficient RL method based on problem decomposition
2026.02  DRQA dynamic reasoning quota allocation, token cost reduced by 31%
2026.03  CoRLHF proposes cooperative policy-reward joint optimization

These works can be summarized into the following core research directions:

Direction	Representative Works	Core Problem
Reasoning model training	DeepSeek-R1, Kimi k1.5, QwQ	How to stimulate LLM reasoning capability through RL?
RL algorithm improvements	DAPO, VAPO, GSPO, GRPO variants	How to make large model RL training more stable and efficient?
Reward design and feedback	SAR, Co-rewarding, CoRLHF	How to design better reward signals?
Overthinking and efficiency	PURE, DRQA, DEER	How to make models reason "just right"?
Agentic task RL	AgentPRM, R³L, DeepSWE	How to extend RL to Agent tasks like tool calling?

Let's dive into the important papers in each direction one by one.

7.2 Reasoning Models: Pure RL Training Stimulates Autonomous Reasoning

7.2.1 DeepSeek-R1: The Nature Cover Breakthrough

Paper: DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning (Nature, 2025) [1]

DeepSeek-R1 is the most milestone-significant work in this field. Its core finding is:

Through RL training alone (without manually annotated reasoning chains), models can autonomously emerge advanced cognitive capabilities such as multi-step reasoning, self-reflection, and dynamic strategy adjustment.

Core Technical Points

GRPO Algorithm: Uses Group Relative Policy Optimization (see Section 11.5), optimizes policy through within-group response competition, avoids expensive Critic networks, total training cost approximately $294,000.
Multi-phase Training Framework:
- R1-Zero Phase: Uses only result correctness as reward (Verifiable Reward RL, RLVR), no SFT data used. The model spontaneously emerges "Aha moments" — learning to self-reflect and correct errors during the reasoning process.
- R1 Phase: Building on R1-Zero, incorporates a small amount of high-quality SFT data and human preference alignment to improve comprehensive capabilities.
Verifiable Rewards (RLVR): Reward signals come from automatically verifiable tasks (such as final answers to math problems), no manual annotation needed.

Key Experimental Results

Achieves SOTA on 21 benchmarks including MMLU, AIME 2024, LiveCodeBench
R1-Zero demonstrates the possibility of "learning to reason from scratch" — reasoning chain length spontaneously grows during RL training
Maintains strong reasoning capability after distillation to 7B/14B small models

Why Is It Important?

DeepSeek-R1 proves two key arguments:

RL can stimulate latent reasoning capabilities from pre-training — these capabilities are difficult to fully release through SFT or prompt engineering
Reasoning capabilities can "emerge" in a pure RL environment — without relying on manually annotated reasoning chains as demonstrations

7.2.2 Kimi k1.5: Breakthrough in Long-Context RL

Paper: Kimi k1.5: Scaling Reinforcement Learning with LLMs (2025) [2]

Kimi k1.5, developed by the Moonshot AI team, makes unique contributions in several areas:

Core Innovations

128K Long-Context RL Training: Extends the RL training context window from the traditional 4K-8K to 128K tokens, improving training efficiency through Partial Rollout Reuse.
Simplified RL Framework: Abandons Monte Carlo Tree Search (MCTS) and value functions, directly optimizes the model through improved Online Mirror Descent, greatly reducing computational burden.
Long2Short Distillation Technique: "Compresses" long-context reasoning capability into short-context models. Specifically:
- First train strong reasoning capability in long-context settings
- Then through knowledge distillation, teach short-context models to "refine" reasoning

Key Results

Surpasses GPT-4o by 550% on short tasks like LiveCodeBench
Long2Short technique proves that long-chain reasoning capability can be compressed without significant loss
First demonstrates feasibility of RL training with 128K context window

7.2.3 QwQ-32B: Reasoning RL at Medium Scale

Paper: QwQ: Reflect and Question to Understand the World (Alibaba, 2025) [3]

QwQ-32B is a medium-scale reasoning model released by Alibaba's Tongyi team. Its significance lies in proving that models at the 32B parameter scale can also obtain strong reasoning capabilities through RL training.

Technical Characteristics

Based on Qwen2.5-32B for RL training
Approaches DeepSeek-R1's performance on mathematical reasoning tasks
Training cost far lower than 670B-scale models

Why Is It Important?

QwQ proves that reasoning RL is not "exclusive to large models" — medium-scale models can also achieve significant reasoning capability improvements through appropriate RL training. This has great practical value for resource-constrained teams and edge deployment scenarios.

7.2.4 OpenAI o1/o3: Test-Time Compute Scaling

Models: OpenAI o1 (2024.09) / OpenAI o3 (2025.06) [4]

Although OpenAI has not published complete technical reports, the o1 and o3 series models have had a profound impact on the industry:

Core Concept: Test-Time Compute Scaling

Traditional Scaling Laws focus on training-time compute scaling (larger models + more data). The o1/o3 series proposes another dimension:

Investing more computation at inference time (longer thinking chains, more search/verification) can also continuously improve model capabilities.

This means there are two complementary scaling paths:

Training-time scaling: increase model size, increase data
Inference-time scaling: increase reasoning steps, verification loops

Impact on the Field

Spawned the new category of "reasoning models"
Drove the development of RL algorithms for reasoning tasks like GRPO, DAPO, VAPO
Triggered attention to "reasoning efficiency" — the Overthinking problem emerged

7.3 RL Algorithm Improvements: Making Large Model RL Training More Stable and Efficient

7.3.1 DAPO: Large-Scale Reproducible RL Training

Paper: DAPO: An Open-Source LLM Reinforcement Learning System at Scale (2025) [5]

DAPO (Decoupled Clip and Dynamic Sampling PPO), proposed by ByteDance's Seed team, has the core goal of solving the reproducibility problem of large-scale RL training.

Core Techniques

Decoupled Clipping: Traditional PPO uses symmetric clipping $\epsilon $; D A PO se p a r a t es t h e u pp er an d l o w erc l i pp in g b o u n d a r i es : -$ \epsilon_{\text{high}} $(l a r g er) : e n co u r a g ese x pl or a t i o n o f g oo d res p o n ses -$ \epsilon_{\text{low}}$ (smaller): strictly suppresses bad responses

This asymmetric design lets the model "boldly explore good behavior" while "conservatively suppressing bad behavior."
Dynamic Sampling: Dynamically adjusts the number of samples per question based on training progress:
- Early training: more sampling, increase exploration
- Late training: less sampling, fine-grained optimization
Token-Level Policy Constraint: Applies KL constraints at the token level rather than sequence level, more precisely controlling policy drift.

Open-Source Contribution

DAPO fully open-sources training code and datasets (based on Qwen2.5-32B), making it one of the most reproducible large-scale RL training solutions currently available.

7.3.2 VAPO: Value-Augmented PPO

Paper: VAPO: Efficient and Reliable RL Framework for Advanced Reasoning Tasks (ByteDance Seed, 2025) [6]

VAPO (Value-based Augmented PPO) is a follow-up to DAPO, specifically targeting challenges in long-chain reasoning tasks.

Core Problems

In long-chain reasoning (such as mathematical proofs, complex programming), RL training faces three major challenges:

Value model bias: Critic network's value estimation for long sequences is inaccurate
Heterogeneous sequence lengths: Response lengths within the same batch vary greatly
Sparse rewards: Only the final answer has a reward signal

Core Techniques

Value Pretraining: Uses Monte Carlo returns to pretrain the Critic network, reducing initialization bias.
Decoupled GAE:
- Uses $\lambda_V = 1.0 $f or t h e v a l u e n e tw or k (l o w bia s, hi g h v a r ian ce) - U ses$ \lambda_P = 0.95 $f or t h e p o l i cy n e tw or k (ba l an ce d bia s an d v a r ian ce) 3. * * L e n g t h - A d a pt i v e G A E * * : Dy nami c a ll y a d j u s t s$ \lambda $ba se d o n se q u e n ce l e n g t h :$ $λ = 1 - \frac{1}{0.05 \cdot l}$ $Wh ere$ l $i s t h ese q u e n ce l e n g t h . L o n g se q u e n ces u se l a r g er$ \lambda $(re d u ce bia s), s h or t se q u e n ces u ses ma ll er$ \lambda $(re d u ce v a r ian ce) .4. * * Cl i p - H i g h er E x pl or a t i o n * * : U ses a sy mm e t r i cc l i pp in g$ \epsilon_{\text{high}} = 0.28 $,$ \epsilon_{\text{low}} = 0.2$, encouraging diverse sampling.

Key Results

Model	AIME 2024	Training Steps	Stability
DeepSeek-R1-Zero (671B)	~50	Many	Occasional collapse
DAPO (32B)	~50	Medium	Relatively stable
VAPO (32B)	60.4	~5,000	No collapse

VAPO surpasses DeepSeek-R1-Zero (671B) using only Qwen-32B and 5,000 training steps, with completely collapse-free training.

7.3.3 GRPO Variants and Improvements

Since DeepSeek-R1 proposed GRPO, multiple papers have improved upon it:

Improvement Direction	Representative Work	Problem Solved
Sequence-level optimization	GSPO [15]	Token-level importance weights introduce high-variance noise, causing MoE model training collapse. GSPO elevates importance sampling to sequence level, trains Qwen3
Remove mean normalization	Dr. GRPO	Original GRPO's within-group mean normalization introduces bias
Adaptive group size	Adaptive GRPO	Fixed group size doesn't suit all problem difficulties
Token-level advantage	Token-level GRPO	Sequence-level advantage is not fine-grained enough for long sequences
Online/offline hybrid	Hybrid GRPO	Pure online sampling is inefficient

Among these, GSPO is the most practically impactful improvement — it has already been used by Alibaba's Qwen team to train the Qwen3 series models. For detailed principles and implementation of GSPO, see Section 11.5's GSPO chapter.

7.4 Reward Design: How to Tell the Model What Good Reasoning Is?

The reward function is the "soul" of RL training. In 2025–2026, three important directions emerged in reward design.

7.4.1 Self-Aligned Reward (SAR): Leveraging Model Internal Signals

Paper: Self-Aligned Reward: Towards Effective and Efficient Reasoners (UIUC & Amazon AWS, 2025) [7]

Core Idea

SAR's core insight is: differences in the model's internal perplexity (PPL) can serve as high-quality reward signals.

Specifically, SAR computes the perplexity difference under two conditions:

$ $r_{SAR} (y ∣ x) = \frac{PPL ( y ) - PPL ( y ∣ x )}{PPL ( y )}$ $Wh ere : -$ \text{PPL}(y|x) $: p er pl e x i t yo f g e n er a t in g res p o n se$ y $g i v e n q u es t i o n$ x $-$ \text{PPL}(y) $: p er pl e x i t yo f t re a t in g res p o n se$ y$ as independent text

Intuitive explanation:

High SAR: response highly depends on the question (targeted, concise response)
Low SAR: response weakly associated with the question (possibly verbose, generic content)

Why Is It Effective?

No external reward model needed: leverages the model's own language modeling capability
Fine-grained scoring: can distinguish "correct and concise" vs "correct but verbose"
Cross-task generalization: trained on math data, also effective on non-math tasks like logical reasoning

Experimental Results

Across 4 base models and 7 datasets:

Average accuracy improvement of 4%
Output length reduced by 30%

7.4.2 Co-rewarding: Self-Supervised RL Learning

Paper: Co-rewarding: Self-Supervised RL for LLM Reasoning (ICLR 2026) [8]

Core Problem

Self-rewarding RL (letting the model score itself) is prone to training collapse — the model learns to generate responses that are "easy to give itself high scores" rather than "truly good."

Solution

Co-rewarding introduces complementary supervision signals:

Generate paraphrased versions of the same question
Use responses to paraphrased questions as auxiliary evaluation for original question responses
Evaluations in both directions mutually constrain each other, preventing collapse

Key Results

Performance improvement of 12.9% on reasoning tasks (without ground truth labels)
Training process significantly more stable

7.4.3 CoRLHF: Cooperative Policy-Reward Joint Optimization

Paper: CoRLHF: Reinforcement Learning from Human Feedback with Cooperative Policy-Reward Optimization (Expert Systems with Applications, 2026) [9]

Core Innovation

Traditional RLHF has two steps: first train the reward model, then use the reward model to train the policy. This causes a distribution mismatch problem — the data distribution seen during reward model training is inconsistent with the data distribution generated during policy optimization.

CoRLHF merges policy optimization and reward model optimization into one iterative process:

Policy generates new data
Reward model updates on new data
Policy optimizes on updated rewards
Iterative loop

This approach bridges RLHF and RLAIF, maintaining alignment quality while reducing dependence on human feedback.

7.4.4 Endogenous Reward: LLM as a Built-in Reward Model

Paper: Related work by Zhi-Hua Zhou's team (Nanjing University, 2025) [10]

Disruptive Finding

This research finds: LLM's next-token prediction capability itself contains a general reward function (Endogenous Reward).

That is, the language model distribution learned during pre-training has already implicitly encoded the judgment capability of "what is good output," without needing to additionally train a reward model.

Practical Significance

Reduces one component (reward model) in the RLHF pipeline
Reduces the risk of error accumulation
Surpasses traditional reward models on multiple alignment benchmarks

7.5 Overthinking and Reasoning Efficiency

With the popularization of reasoning models, a new problem has emerged: Overthinking — models generate verbose reasoning chains even for simple problems, wasting computational resources and potentially reducing accuracy.

7.5.1 Problem Analysis: Why Do Reasoning Models "Think Too Much"?

The root of overthinking lies in the reward structure of RLVR (RL with Verifiable Rewards):

As long as the final answer is correct, regardless of how long or redundant the reasoning process is, the model receives the same reward.

This leads to two problems:

Reward inflation: standard RL's summation-form credit assignment makes models prefer generating more steps
Undifferentiated incentives: cannot distinguish "concisely correct" from "verbosely correct"

7.5.2 PURE: Min-Form Credit Assignment

Paper: Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning (2025) [11]

Core Insight

Traditional RL defines trajectory value as the sum of future rewards:

$ $V_{sum} (s_{t}) = \sum_{k = t}^{T} γ^{k - t} r_{k}$ $P U REp ro p osesre pl a c in g t h es u m w i t h t h e * * minim u m * * :$ $V_{min} (s_{t}) = min (r_{t}, r_{t + 1}, \dots, r_{T})$ $

Intuition: the strength of a reasoning chain depends on its weakest link.

Method	Training Signal	Consequence
Sum form	"Generate more 'okay' steps to accumulate score"	Verbose, circular reasoning
Min form	"Every step must be correct; one wrong step loses everything"	Concise, precise

Implementation

PURE converts process rewards to new rewards through a temperature parameter $T$, making standard RL algorithms (PPO/GRPO)'s summation formula mathematically equivalent to taking the minimum — no need to modify the underlying algorithm, only reward preprocessing needed.

Experimental Results

Sum-form training collapses almost immediately
Min-form training improves stably
Sample efficiency improved 2-3×

7.5.3 DRQA: Dynamic Reasoning Quota Allocation

Paper: DRQA: Dynamic Reasoning Quota Allocation for Controlling Overthinking in Reasoning Large Language Models (2026) [12]

Core Observation

An interesting finding: when models batch process multiple questions (rather than processing one by one), total output length significantly shortens — models seem to implicitly distinguish problem difficulty and "compress" reasoning for simple problems.

Method

Build preference data:
- Reasoning chains generated individually (verbose version)
- Reasoning chains generated in batches (refined version)
- Annotate preferences by correctness and conciseness
Use GRPO to train models to simultaneously optimize logical correctness and reasoning conciseness

Results

Reasoning token cost reduced by 31%
Accuracy actually improves
Shortens most for simple problems, maintains sufficient reasoning for complex problems

7.5.4 DEER: Dynamic Early Exit in Reasoning

Paper: Dynamic Early Exit in Reasoning Models (DEER) (2026) [13]

DEER is a training-free inference-time optimization method:

Monitors model confidence in real-time during reasoning
Triggers early exit when model is highly confident about the current answer
Simple problems end quickly, complex problems continue thinking

Results

Reasoning chain length shortened by 19.1%–80.1%
Accuracy improved by 0.3%–5.0%
No additional training needed, plug-and-play

7.5.5 Method Comparison

Method	Core Idea	Training Required	Efficiency Gain	Accuracy Impact
SAR	Perplexity difference as reward	Yes (RL training)	Length -30%	+4%
PURE	Min-form credit assignment	Yes (reward preprocessing)	2-3× sample efficiency	Significant improvement
DRQA	Quota allocation simulating batch reasoning	Yes (GRPO training)	Token -31%	Improvement
DEER	Confidence-triggered early exit	No (inference time)	Length -19%~80%	+0.3%~5%
Concise RL	Two-phase refinement training	Yes (two-phase RL)	Length significantly shortened	Improves rather than decreases

7.6 RLVR: Reinforcement Learning with Verifiable Rewards

RLVR (Reinforcement Learning with Verifiable Rewards) is one of the hottest research directions in 2025–2026, and is also the key to DeepSeek-R1's success.

7.6.1 What Is RLVR?

Unlike traditional RLHF which relies on manually annotated preference data, RLVR uses automatically verifiable signals as rewards:

Comparison Dimension	RLHF	RLVR
Reward source	Manually annotated preferences	Automatic verification (e.g., answer correctness)
Annotation cost	High	Extremely low
Applicable tasks	Open-ended (dialogue, writing)	Tasks with clear correct answers (math, code)
Scalability	Limited by annotation speed	Almost unlimited scaling

7.6.2 RLVR Problems and Improvements

Problem Decomposition Framework (Renmin University & ByteDance, 2026) [14]:

Traditional RLVR only gives rewards at the final answer (sparse rewards), causing credit assignment difficulties in long-chain reasoning. This work proposes the Decomposer-Reasoner Framework:

Decomposer: decomposes complex problems into sub-problems
Reasoner: solves sub-problems step by step
Dense rewards: each sub-problem solution has a verifiable reward

This approach converts sparse rewards to dense rewards, significantly improving the exploration efficiency of RL training.

7.7 RL Training for Agentic Tasks

Most of the above discussion is about RL training for reasoning tasks (math, code). A more cutting-edge direction is applying RL to truly Agentic tasks — scenarios requiring tool calling, environment interaction, and multi-step decision making.

7.7.1 AgentPRM: Process Reward Models for Agent Evaluation

In multi-turn Agent tasks (such as web navigation, API calls), evaluating only the final result is insufficient — the quality of each decision step needs to be evaluated. AgentPRM introduces Process Reward Models to evaluate the Agent's intermediate decisions.

7.7.2 R³L: Reflect-then-Retry RL

R³L (Reflect-then-Retry RL) targets failure recovery in Agent tasks:

When the Agent fails, generate language feedback diagnosing the error cause
Restart from the failure point, using feedback to avoid repeating mistakes
Greatly reduces rollout cost

7.7.3 DeepSWE: RL Training for Software Engineering Agents

DeepSeek team's DeepSWE demonstrates that RL-trained software engineering Agents can match closed-source models' SWE-bench performance, proving RL's potential in complex Agentic tasks.

7.8 Open Challenges and Future Directions

Despite rapid progress, the field still faces many open challenges:

7.8.1 Reward Hacking

Models may find loopholes in reward functions to "cheat" rather than truly improving capabilities. For example:

Generating long text that "looks like reasoning" but is actually nonsense
Using formatting tricks (such as specific keywords) to get high rewards
Learning to "self-deceive" in self-evaluation

7.8.2 Training Stability

Large model RL training is still not stable enough:

KL divergence management: excessive policy drift causes catastrophic forgetting
Reward scale: inconsistent scales across different reward dimensions
Data diversity: diversity of training data directly affects exploration quality

7.8.3 Generalization Capability

Current RL-trained reasoning capabilities are mainly validated in math and code domains; generalization to the following areas still needs exploration:

Open-domain reasoning (scientific reasoning, commonsense reasoning)
Multimodal reasoning (vision-language, video understanding)
Cross-lingual reasoning

7.8.4 Efficiency and Cost

RL training computational costs are still high:

Large amounts of rollout sampling
Multiple models (Policy, Reference, possibly Critic) simultaneously in GPU memory
Memory and time overhead of long-sequence reasoning

7.8.5 Future Outlook

Based on current research trends, we expect the following directions to become hot topics:

Direction	Expected Progress
Internal signal mining	More use of model's own signals (like SAR, endogenous reward) to replace external reward models
Self-evolving training	Closed-loop systems where models autonomously generate training data and reward signals
Multimodal RL	Extending reasoning RL to multimodal scenarios like vision and speech
Agentic RL expansion	Extending RL from reasoning tasks to Agent scenarios like tool calling and environment interaction
Efficient training	New algorithms reducing rollout cost and improving sample efficiency
Theoretical foundations	Deeper theoretical analysis of how RL stimulates LLM reasoning capabilities

7.9 Paper List

The following are the main papers covered in this section, organized by topic:

Reasoning Models

#	Paper	Author/Institution	Year	Core Contribution
[1]	DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via RL	DeepSeek AI	2025	Pure RL training stimulates autonomous reasoning, GRPO algorithm
[2]	Kimi k1.5: Scaling Reinforcement Learning with LLMs	Moonshot AI	2025	128K long-context RL, Long2Short distillation
[3]	QwQ: Reflect and Question to Understand the World	Alibaba	2025	Medium-scale reasoning RL
[4]	OpenAI o1/o3 System Card	OpenAI	2024/2025	Test-time compute scaling

RL Algorithms

#	Paper	Author/Institution	Year	Core Contribution
[5]	DAPO: An Open-Source LLM RL System at Scale	ByteDance Seed	2025	Decoupled clipping + dynamic sampling, open-source reproducible
[6]	VAPO: Efficient and Reliable RL for Advanced Reasoning	ByteDance Seed	2025	Value pretraining + length-adaptive GAE, AIME 60.4
[15]	GSPO: Group Sequence Policy Optimization	Alibaba (Qwen Team)	2025	Sequence-level importance sampling, stabilizes MoE training, trains Qwen3

Reward Design

#	Paper	Author/Institution	Year	Core Contribution
[7]	Self-Aligned Reward (SAR)	UIUC & AWS	2025	Perplexity difference as intrinsic reward
[8]	Co-rewarding	ICLR 2026	2025	Self-supervised RL, complementary evaluation signals
[9]	CoRLHF	Expert Systems with Applications	2026	Policy-reward joint iterative optimization
[10]	Endogenous Reward	Nanjing University (Zhi-Hua Zhou's team)	2025	LLM contains general reward function

Reasoning Efficiency

#	Paper	Author/Institution	Year	Core Contribution
[11]	PURE: Min-Form Credit Assignment	—	2025	Min-form replaces sum-form credit assignment
[12]	DRQA: Dynamic Reasoning Quota Allocation	—	2026	Dynamic reasoning quota allocation, token -31%
[13]	DEER: Dynamic Early Exit in Reasoning Models	—	2026	Training-free dynamic early exit
[14]	RLVR with Adaptive Problem Decomposition	Renmin University & ByteDance	2026	Problem decomposition dense rewards

7.10 Recommended Reading Path

If you are new to this field, it is recommended to read in the following order:

Beginner path:
1. DeepSeek-R1 paper (understand core ideas of RLVR + GRPO)
   ↓
2. GSPO paper (understand advantages of sequence-level optimization over token-level)
   ↓
3. DAPO paper + code (hands-on reproduction of large model RL training)
   ↓
4. VAPO paper (understand role of value function in long-chain reasoning)
   ↓
5. SAR / PURE papers (understand reward design and overthinking problems)
   ↓
6. Kimi k1.5 / QwQ (understand different teams' technical approaches)

If you are interested in specific topics:

Want to train reasoning models → Focus on DeepSeek-R1 + GSPO + DAPO + VAPO
Want to design reward functions → Focus on SAR + PURE + Co-rewarding
Want to optimize reasoning efficiency → Focus on DRQA + DEER + PURE
Want to do Agent RL → Focus on DeepSWE + AgentPRM + R³L
Want to train MoE models → Focus on GSPO + DAPO

Summary

In 2025–2026, the Agentic-RL field underwent a fundamental transformation from "alignment auxiliary tool" to "core capability stimulation engine." Several key trends are worth noting:

RL from auxiliary to core: RL is no longer just used for "alignment," but for stimulating latent reasoning capabilities from pre-training
Algorithms from complex to practical: from PPO's four-model architecture to GRPO's two-model architecture, then to GSPO's sequence-level optimization and VAPO's value-augmented solution, training is becoming increasingly efficient and stable
Rewards from external to internal: from manual annotation to verifiable rewards to model internal signals, reward design is becoming increasingly self-consistent
Focus from "stronger" to "more efficient": the overthinking problem has spawned a series of reasoning efficiency optimization solutions

These advances are gradually making the vision of "letting models learn autonomously through practice" a reality.

Keyboard shortcuts

Learn Agent Development from Scratch