Appendix E: KL Divergence (Kullback-Leibler Divergence) Explained

This appendix provides a complete introduction to KL divergence for readers with no prior background. If you are already familiar with information theory basics, you can skip directly to the Application in Agentic-RL section.

KL Divergence Intuitive Understanding

Intuitive Understanding: What Does KL Divergence Measure?

Imagine you are a weather forecaster. You have built a weather prediction model $Q$ , while the true weather distribution is $P$ . KL divergence $D_{K L} (P ∥ Q)$ measures: when you use model $Q$ to approximate the true distribution $P$ , how much information is lost on average.

More plainly:

KL divergence measures the "distance" between two probability distributions — but it is an asymmetric distance.

A few key intuitions:

$D_{K L} (P ∥ Q) = 0$ : Holds if and only if $P$ and $Q$ are identical. The more "similar" the two distributions, the smaller the KL divergence.
$D_{K L} (P ∥ Q) \geq 0$ : KL divergence is always non-negative (guaranteed by Gibbs' inequality).
$D_{K L} (P ∥ Q) \neq = D_{K L} (Q ∥ P)$ : Asymmetry! The "distance" from $P$ to $Q$ and from $Q$ to $P$ are generally different. This is why KL divergence is not a strict "metric" but a "divergence."

Mathematical Definition

Discrete Case

For two discrete probability distributions $P$ and $Q$ (defined on the same event space $X$ ):

$D_{K L} (P ∥ Q) = x \in X \sum P (x) lo g \frac{P ( x )}{Q ( x )}$

Continuous Case

For two continuous probability distributions (with probability density functions $p (x)$ and $q (x)$ ):

$D_{K L} (P ∥ Q) = \int_{- \infty}^{+ \infty} p (x) lo g \frac{p ( x )}{q ( x )} d x$

Term-by-Term Interpretation

Using the discrete case as an example:

$D_{K L} (P ∥ Q) = x \in X \sum P (x) lo g \frac{P ( x )}{Q ( x )} = x \in X \sum P (x) [lo g P (x) - lo g Q (x)]$

$P (x)$ : The probability (weight) of event $x$ in the true distribution
$lo g P (x) - lo g Q (x)$ : The "information gap" between the true and approximate distributions at event $x$
The whole expression is a weighted average: using the true distribution $P$ as weights, taking the expectation of the information gap at each event

A Concrete Example

Suppose we have a 6-sided die, with true distribution $P$ and two model distributions $Q_{1}$ , $Q_{2}$ as follows:

Face	$P$ (True)	$Q_{1}$ (Uniform Model)	$Q_{2}$ (Skewed Model)
1	1/6	1/6	1/2
2	1/6	1/6	1/10
3	1/6	1/6	1/10
4	1/6	1/6	1/10
5	1/6	1/6	1/10
6	1/6	1/6	1/10

Results:

$D_{K L} (P ∥ Q_{1}) = 0$ ( $Q_{1}$ is identical to $P$ , no information loss)
$D_{K L} (P ∥ Q_{2}) \approx 0.216$ bits ( $Q_{2}$ deviates from the true distribution, causing information loss)

This tells us: the skewed model $Q_{2}$ is "worse" than the uniform model $Q_{1}$ — using $Q_{2}$ to approximate the true distribution loses more information.

Relationship with Information Theory

KL divergence can be understood through two fundamental concepts in information theory:

Entropy

$H (P) = - x \sum P (x) lo g P (x)$

Entropy measures the uncertainty of distribution $P$ , and is also the minimum average number of bits needed to optimally encode events from $P$ .

Cross-Entropy

$H (P, Q) = - x \sum P (x) lo g Q (x)$

Cross-entropy measures: if the true distribution is $P$ , but we use an encoding scheme designed based on $Q$ , how many bits on average are needed to encode one event.

The Relationship Between the Three

$D_{K L} (P ∥ Q) = H (P, Q) - H (P)$

That is: KL divergence = Cross-entropy − Entropy = the extra cost of encoding with the wrong distribution.

This is why KL divergence is also called Relative Entropy.

Intuition Behind Asymmetry

The asymmetry of KL divergence has important practical implications:

$D_{K L} (P ∥ Q)$ (Forward KL): Penalizes $Q$ for assigning low probability where $P$ has probability density. The effect is that $Q$ tends to cover all modes of $P$ (mode-covering), potentially making $Q$ too spread out.
$D_{K L} (Q ∥ P)$ (Reverse KL): Penalizes $Q$ for assigning high probability where $P$ has no probability density. The effect is that $Q$ tends to concentrate on one mode of $P$ (mode-seeking), potentially making $Q$ too concentrated.

A vivid analogy:

Forward KL is like a "cautious person": would rather over-cover than miss any possibility

Reverse KL is like a "focused person": would rather focus only on the most important part than spread attention

Application in Agentic-RL

In 18.1 What is Agentic-RL, the RL stage loss function includes a KL divergence penalty term:

$L_{R L} (θ) = - E_{τ \sim π_{θ}} [R (τ)] + β \cdot D_{K L} (π_{θ} ∥ π_{SFT})$

The specific meaning of $D_{K L} (π_{θ} ∥ π_{SFT})$ here is:

Why Is KL Constraint Needed?

During RL training, the model continuously updates parameters to maximize rewards. Without constraints, the model might go to two extremes:

Reward Hacking: The model finds ways to exploit loopholes in the reward function to get high scores, but actual output quality is poor. For example, the model might learn to generate a specific format to fool the reward model, rather than truly solving the problem.
Language Degeneration: The model's output no longer resembles natural language, producing repetitive, meaningless token sequences.

The KL divergence penalty term acts as a "safety rope":

$D_{K L} (π_{θ} ∥ π_{SFT}) = E_{x \sim D} [t \sum π_{θ} (y_{t} ∣ x, y_{< t}) lo g \frac{π _{θ} ( y _{t} ∣ x , y _{< t} )}{π _{SFT} ( y _{t} ∣ x , y _{< t} )}]$

If the current policy $π_{θ}$ has the same output distribution as the SFT policy $π_{SFT}$ , $D_{K L} = 0$ , no additional penalty
If the current policy deviates too far from the SFT policy, $D_{K L}$ increases, the penalty term in the loss function increases, "pulling" the policy back to a safe range

The Role of $β$

The hyperparameter $β$ controls the strength of the KL constraint:

$β$ Value	Effect	Applicable Scenario
Larger (e.g., 0.1–0.5)	Conservative policy, closely follows SFT model	Early training, high task safety requirements
Smaller (e.g., 0.001–0.01)	Free policy, allows large exploration	Late training, task has clear objective evaluation criteria
Adaptive	Dynamically adjusted, keeps KL within target range	Commonly used in PPO

In GRPO (Group Relative Policy Optimization), the specific implementation of KL penalty differs; see 18.5 GRPO: Group Relative Policy Optimization and Reward Function Design.

Summary

Concept	One-Line Description
KL Divergence	Average information loss when approximating distribution $P$ with distribution $Q$
Non-negativity	$D_{K L} (P ∥ Q) \geq 0$ , equality holds if and only if $P = Q$
Asymmetry	$D_{K L} (P ∥ Q) \neq = D_{K L} (Q ∥ P)$
Relationship with Cross-Entropy	$D_{K L}$ = Cross-Entropy − Entropy
Role in RL	Prevents policy from deviating too far from the reference model, avoiding reward hacking and language degeneration

Learn Agent Development from Scratch