Skip to content

docs: clarify PPO entropy metrics in PPO trainer docs#5289

Open
biefan wants to merge 1 commit intohuggingface:mainfrom
biefan:docs/clarify-ppo-entropy-metrics-2023
Open

docs: clarify PPO entropy metrics in PPO trainer docs#5289
biefan wants to merge 1 commit intohuggingface:mainfrom
biefan:docs/clarify-ppo-entropy-metrics-2023

Conversation

@biefan
Copy link

@biefan biefan commented Mar 14, 2026

Summary

Clarify the difference between objective/entropy and policy/entropy_avg in the PPO trainer docs.

What changed

  • Updated the objective/entropy description to match the rollout-time computation ((-logprobs).sum(1).mean()).
  • Updated the policy/entropy_avg description to match the optimization-time entropy computed from logits.
  • Added a short note explaining why these two metrics are expected to differ.

Why

Issue #2023 points out that the two entropy metrics had very similar wording, which made interpretation difficult when debugging PPO runs.

Fixes #2023


Note

Low Risk
Low risk documentation-only change that updates metric wording and adds a brief clarification note; no runtime or API behavior is modified.

Overview
Clarifies the PPO trainer metric docs by rewriting the descriptions of objective/entropy (rollout-time proxy computed from -logprobs) and policy/entropy_avg (optimization-time categorical entropy computed from logits).

Adds an explicit note explaining that these metrics are measured at different phases (rollouts vs. PPO optimization) and therefore are expected to differ.

Written by Cursor Bugbot for commit 8367f1e. This will update automatically on new commits. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Clarification of 2 Entropies in PPOv2Trainer Documentation

1 participant