Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Appendix E: KL Divergence (Kullback-Leibler Divergence) Explained

This appendix provides a complete introduction to KL divergence for readers with no prior background. If you are already familiar with information theory basics, you can skip directly to the Application in Agentic-RL section.

KL Divergence Intuitive Understanding


Intuitive Understanding: What Does KL Divergence Measure?

Imagine you are a weather forecaster. You have built a weather prediction model , while the true weather distribution is . KL divergence measures: when you use model to approximate the true distribution , how much information is lost on average.

More plainly:

KL divergence measures the "distance" between two probability distributions — but it is an asymmetric distance.

A few key intuitions:

  • : Holds if and only if and are identical. The more "similar" the two distributions, the smaller the KL divergence.
  • : KL divergence is always non-negative (guaranteed by Gibbs' inequality).
  • : Asymmetry! The "distance" from to and from to are generally different. This is why KL divergence is not a strict "metric" but a "divergence."

Mathematical Definition

Discrete Case

For two discrete probability distributions and (defined on the same event space ):

Continuous Case

For two continuous probability distributions (with probability density functions and ):

Term-by-Term Interpretation

Using the discrete case as an example:

  • : The probability (weight) of event in the true distribution
  • : The "information gap" between the true and approximate distributions at event
  • The whole expression is a weighted average: using the true distribution as weights, taking the expectation of the information gap at each event

A Concrete Example

Suppose we have a 6-sided die, with true distribution and two model distributions , as follows:

Face (True) (Uniform Model) (Skewed Model)
11/61/61/2
21/61/61/10
31/61/61/10
41/61/61/10
51/61/61/10
61/61/61/10

Results:

  • ( is identical to , no information loss)
  • bits ( deviates from the true distribution, causing information loss)

This tells us: the skewed model is "worse" than the uniform model — using to approximate the true distribution loses more information.


Relationship with Information Theory

KL divergence can be understood through two fundamental concepts in information theory:

Entropy

Entropy measures the uncertainty of distribution , and is also the minimum average number of bits needed to optimally encode events from .

Cross-Entropy

Cross-entropy measures: if the true distribution is , but we use an encoding scheme designed based on , how many bits on average are needed to encode one event.

The Relationship Between the Three

That is: KL divergence = Cross-entropy − Entropy = the extra cost of encoding with the wrong distribution.

This is why KL divergence is also called Relative Entropy.


Intuition Behind Asymmetry

The asymmetry of KL divergence has important practical implications:

  • (Forward KL): Penalizes for assigning low probability where has probability density. The effect is that tends to cover all modes of (mode-covering), potentially making too spread out.
  • (Reverse KL): Penalizes for assigning high probability where has no probability density. The effect is that tends to concentrate on one mode of (mode-seeking), potentially making too concentrated.

A vivid analogy:

  • Forward KL is like a "cautious person": would rather over-cover than miss any possibility
  • Reverse KL is like a "focused person": would rather focus only on the most important part than spread attention

Application in Agentic-RL

In 18.1 What is Agentic-RL, the RL stage loss function includes a KL divergence penalty term:

The specific meaning of here is:

Why Is KL Constraint Needed?

During RL training, the model continuously updates parameters to maximize rewards. Without constraints, the model might go to two extremes:

  1. Reward Hacking: The model finds ways to exploit loopholes in the reward function to get high scores, but actual output quality is poor. For example, the model might learn to generate a specific format to fool the reward model, rather than truly solving the problem.
  2. Language Degeneration: The model's output no longer resembles natural language, producing repetitive, meaningless token sequences.

The KL divergence penalty term acts as a "safety rope":

  • If the current policy has the same output distribution as the SFT policy , , no additional penalty
  • If the current policy deviates too far from the SFT policy, increases, the penalty term in the loss function increases, "pulling" the policy back to a safe range

The Role of

The hyperparameter controls the strength of the KL constraint:

ValueEffectApplicable Scenario
Larger (e.g., 0.1–0.5)Conservative policy, closely follows SFT modelEarly training, high task safety requirements
Smaller (e.g., 0.001–0.01)Free policy, allows large explorationLate training, task has clear objective evaluation criteria
AdaptiveDynamically adjusted, keeps KL within target rangeCommonly used in PPO

In GRPO (Group Relative Policy Optimization), the specific implementation of KL penalty differs; see 18.5 GRPO: Group Relative Policy Optimization and Reward Function Design.


Summary

ConceptOne-Line Description
KL DivergenceAverage information loss when approximating distribution with distribution
Non-negativity, equality holds if and only if
Asymmetry
Relationship with Cross-Entropy = Cross-Entropy − Entropy
Role in RLPrevents policy from deviating too far from the reference model, avoiding reward hacking and language degeneration

Further Reading

  • Kullback S, Leibler R A. On Information and Sufficiency[J]. The Annals of Mathematical Statistics, 1951, 22(1): 79-86.
  • Cover T M, Thomas J A. Elements of Information Theory[M]. 2nd ed. Wiley, 2006. (Chapter 2 covers KL divergence properties in detail)
  • Schulman J, et al. Proximal Policy Optimization Algorithms[R]. arXiv:1707.06347, 2017. (Engineering practice of KL constraints in PPO)