Appendix E: KL Divergence (Kullback-Leibler Divergence) Explained
This appendix provides a complete introduction to KL divergence for readers with no prior background. If you are already familiar with information theory basics, you can skip directly to the Application in Agentic-RL section.
Intuitive Understanding: What Does KL Divergence Measure?
Imagine you are a weather forecaster. You have built a weather prediction model , while the true weather distribution is . KL divergence measures: when you use model to approximate the true distribution , how much information is lost on average.
More plainly:
KL divergence measures the "distance" between two probability distributions — but it is an asymmetric distance.
A few key intuitions:
- : Holds if and only if and are identical. The more "similar" the two distributions, the smaller the KL divergence.
- : KL divergence is always non-negative (guaranteed by Gibbs' inequality).
- : Asymmetry! The "distance" from to and from to are generally different. This is why KL divergence is not a strict "metric" but a "divergence."
Mathematical Definition
Discrete Case
For two discrete probability distributions and (defined on the same event space ):
Continuous Case
For two continuous probability distributions (with probability density functions and ):
Term-by-Term Interpretation
Using the discrete case as an example:
- : The probability (weight) of event in the true distribution
- : The "information gap" between the true and approximate distributions at event
- The whole expression is a weighted average: using the true distribution as weights, taking the expectation of the information gap at each event
A Concrete Example
Suppose we have a 6-sided die, with true distribution and two model distributions , as follows:
| Face | (True) | (Uniform Model) | (Skewed Model) |
|---|---|---|---|
| 1 | 1/6 | 1/6 | 1/2 |
| 2 | 1/6 | 1/6 | 1/10 |
| 3 | 1/6 | 1/6 | 1/10 |
| 4 | 1/6 | 1/6 | 1/10 |
| 5 | 1/6 | 1/6 | 1/10 |
| 6 | 1/6 | 1/6 | 1/10 |
Results:
- ( is identical to , no information loss)
- bits ( deviates from the true distribution, causing information loss)
This tells us: the skewed model is "worse" than the uniform model — using to approximate the true distribution loses more information.
Relationship with Information Theory
KL divergence can be understood through two fundamental concepts in information theory:
Entropy
Entropy measures the uncertainty of distribution , and is also the minimum average number of bits needed to optimally encode events from .
Cross-Entropy
Cross-entropy measures: if the true distribution is , but we use an encoding scheme designed based on , how many bits on average are needed to encode one event.
The Relationship Between the Three
That is: KL divergence = Cross-entropy − Entropy = the extra cost of encoding with the wrong distribution.
This is why KL divergence is also called Relative Entropy.
Intuition Behind Asymmetry
The asymmetry of KL divergence has important practical implications:
- (Forward KL): Penalizes for assigning low probability where has probability density. The effect is that tends to cover all modes of (mode-covering), potentially making too spread out.
- (Reverse KL): Penalizes for assigning high probability where has no probability density. The effect is that tends to concentrate on one mode of (mode-seeking), potentially making too concentrated.
A vivid analogy:
- Forward KL is like a "cautious person": would rather over-cover than miss any possibility
- Reverse KL is like a "focused person": would rather focus only on the most important part than spread attention
Application in Agentic-RL
In 18.1 What is Agentic-RL, the RL stage loss function includes a KL divergence penalty term:
The specific meaning of here is:
Why Is KL Constraint Needed?
During RL training, the model continuously updates parameters to maximize rewards. Without constraints, the model might go to two extremes:
- Reward Hacking: The model finds ways to exploit loopholes in the reward function to get high scores, but actual output quality is poor. For example, the model might learn to generate a specific format to fool the reward model, rather than truly solving the problem.
- Language Degeneration: The model's output no longer resembles natural language, producing repetitive, meaningless token sequences.
The KL divergence penalty term acts as a "safety rope":
- If the current policy has the same output distribution as the SFT policy , , no additional penalty
- If the current policy deviates too far from the SFT policy, increases, the penalty term in the loss function increases, "pulling" the policy back to a safe range
The Role of
The hyperparameter controls the strength of the KL constraint:
| Value | Effect | Applicable Scenario |
|---|---|---|
| Larger (e.g., 0.1–0.5) | Conservative policy, closely follows SFT model | Early training, high task safety requirements |
| Smaller (e.g., 0.001–0.01) | Free policy, allows large exploration | Late training, task has clear objective evaluation criteria |
| Adaptive | Dynamically adjusted, keeps KL within target range | Commonly used in PPO |
In GRPO (Group Relative Policy Optimization), the specific implementation of KL penalty differs; see 18.5 GRPO: Group Relative Policy Optimization and Reward Function Design.
Summary
| Concept | One-Line Description |
|---|---|
| KL Divergence | Average information loss when approximating distribution with distribution |
| Non-negativity | , equality holds if and only if |
| Asymmetry | |
| Relationship with Cross-Entropy | = Cross-Entropy − Entropy |
| Role in RL | Prevents policy from deviating too far from the reference model, avoiding reward hacking and language degeneration |
Further Reading
- Kullback S, Leibler R A. On Information and Sufficiency[J]. The Annals of Mathematical Statistics, 1951, 22(1): 79-86.
- Cover T M, Thomas J A. Elements of Information Theory[M]. 2nd ed. Wiley, 2006. (Chapter 2 covers KL divergence properties in detail)
- Schulman J, et al. Proximal Policy Optimization Algorithms[R]. arXiv:1707.06347, 2017. (Engineering practice of KL constraints in PPO)