-
Notifications
You must be signed in to change notification settings - Fork 7
[Question] Details about DAPO training (entropy behavior, clip_high, dynamic sampling, loss design) #11
Description
Hi, thanks again for your great work!
I have a question regarding the DAPO training described in the paper.
In the paper, it is mentioned that using DAPO leads to a noticeable entropy decrease during training. However, from my understanding, one of the key motivations behind DAPO is to prevent entropy from collapsing too quickly and maintain better exploration.
While going through the released training scripts, I did not find implementations corresponding to some components that seem important for DAPO, such as:
clip_highin the policy update- dynamic filtering of samples
- specific mechanisms to stabilize entropy
So I would like to ask for more detailed clarification on the DAPO setup used in the paper:
-
Was
clip_highused in your implementation?
If so, what value/range was used? -
Was any form of dynamic sampling applied during training?
-
What loss formulation was used?
- Is it token-level mean loss (
token_mean) or sequence-level? - Any modification compared to standard PPO-style objectives?
- Is it token-level mean loss (
-
Was length normalization or length penalty applied?
If yes, how was it incorporated into the reward or loss? -
How do you interpret the entropy drop observed in your experiments?
Is it expected behavior under your configuration, or controlled in some way?
Understanding these details would be very helpful for reproducing the results and better understanding the role of DAPO in your framework.
Thanks a lot for your time and help!