Skip to content

oddqueue/e-dpo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🎚️ $\varepsilon$-Direct Preference Optimization ($\varepsilon$-DPO)

KL Penalty Control via Perturbation for Direct Preference Optimization,
Sangkyu Lee1,*, Janghoon Han2, Hosung Song2, Stanley Jungkyu Choi2, Honglak Lee2,3, Youngjae Yu4
1Yonsei University, 2LG AI Research, 3University of Michigan, Ann Arbor, 4Seoul National University
*Work done during internship at LG AI Research.

This is the official repository of "KL Penalty Control via Perturbation for Direct Preference Optimization":

  • EpsilonDPOTrainer and EpsilonDPOConfig for $\varepsilon$-Direct Preference Optimization ($\varepsilon$-DPO)
  • Example training script for Mistral-Instruct and Llama-3-Instrut

Installation

EpsilonDPOTrainer and EpsilonDPOConfig is implemented based on the DPOTrainer and DPOConfig of trl==0.13.0. Therefore, they should work fine in environments compatible with this version. For following our environment, please make sure to set up your environment with Python 3.10, then follow the installation:

pip install -r requirements.txt

If you want to use FlashAttention 2 when using included training script, you need to install flash-attn:

pip install flash-attn --no-build-isolation

Usage

EpsilonDPOTrainer shares arguments with DPOTrainer; it is straightforward to use as follows:

from config import EpsilonDPOConfig
from trainer import EpsilonDPOTrainer

...

args = EpsilonDPOConfig(**args)
trainer = EpsilonDPOTrainer(model=model,
                            ref_model=ref_model,
                            args=args,
                            processing_class=processing_class,
                            train_dataset=train_dataset,
                            eval_dataset=eval_dataset)
trainer.train()

Here, EpsilonDPOConfig additionally requires one more argument from DPOConfig:

  • epsilon: float=0.01; Parameter controlling the step size of KL penalty relaxation.

The included example training scripts can be used as:

# Mistral-Instruct
accelerate launch --config_file=configs/accelerate.yaml train.py --config=configs/mistral_instruct.yaml

# Llama-3-Instruct
accelerate launch --config_file=configs/accelerate.yaml train.py --config=configs/llama3_instruct.yaml

If you want to enable FlashAttention 2, please uncomment the attn_implementation: "flash_attention_2" in configs/mistral_instruct.yaml and configs/llama3_instruct.yaml.

Citation

@article{lee2025kl,
  title={KL Penalty Control via Perturbation for Direct Preference Optimization},
  author={Lee, Sangkyu and Han, Janghoon and Song, Hosung and Choi, Stanley Jungkyu and Lee, Honglak and Yu, Youngjae},
  journal={arXiv preprint arXiv:2502.13177},
  year={2025}
}

About

[NeurIPS 2025] The official implementation of "KL Penalty Control via Perturbation for Direct Preference Optimization"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages