Skip to content

[Question] Details about DAPO training (entropy behavior, clip_high, dynamic sampling, loss design) #11

@Hasuer

Description

@Hasuer

Hi, thanks again for your great work!

I have a question regarding the DAPO training described in the paper.

In the paper, it is mentioned that using DAPO leads to a noticeable entropy decrease during training. However, from my understanding, one of the key motivations behind DAPO is to prevent entropy from collapsing too quickly and maintain better exploration.

While going through the released training scripts, I did not find implementations corresponding to some components that seem important for DAPO, such as:

  • clip_high in the policy update
  • dynamic filtering of samples
  • specific mechanisms to stabilize entropy

So I would like to ask for more detailed clarification on the DAPO setup used in the paper:

  1. Was clip_high used in your implementation?
    If so, what value/range was used?

  2. Was any form of dynamic sampling applied during training?

  3. What loss formulation was used?

    • Is it token-level mean loss (token_mean) or sequence-level?
    • Any modification compared to standard PPO-style objectives?
  4. Was length normalization or length penalty applied?
    If yes, how was it incorporated into the reward or loss?

  5. How do you interpret the entropy drop observed in your experiments?
    Is it expected behavior under your configuration, or controlled in some way?

Understanding these details would be very helpful for reproducing the results and better understanding the role of DAPO in your framework.

Thanks a lot for your time and help!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions