KL Penalty Control via Perturbation for Direct Preference Optimization,
Sangkyu Lee1,*, Janghoon Han2, Hosung Song2, Stanley Jungkyu Choi2, Honglak Lee2,3, Youngjae Yu4
1Yonsei University, 2LG AI Research, 3University of Michigan, Ann Arbor, 4Seoul National University
*Work done during internship at LG AI Research.
This is the official repository of "KL Penalty Control via Perturbation for Direct Preference Optimization":
-
EpsilonDPOTrainerandEpsilonDPOConfigfor$\varepsilon$ -Direct Preference Optimization ($\varepsilon$ -DPO) - Example training script for
Mistral-InstructandLlama-3-Instrut
EpsilonDPOTrainer and EpsilonDPOConfig is implemented based on the DPOTrainer and DPOConfig of trl==0.13.0. Therefore, they should work fine in environments compatible with this version. For following our environment, please make sure to set up your environment with Python 3.10, then follow the installation:
pip install -r requirements.txt
If you want to use FlashAttention 2 when using included training script, you need to install flash-attn:
pip install flash-attn --no-build-isolation
EpsilonDPOTrainer shares arguments with DPOTrainer; it is straightforward to use as follows:
from config import EpsilonDPOConfig
from trainer import EpsilonDPOTrainer
...
args = EpsilonDPOConfig(**args)
trainer = EpsilonDPOTrainer(model=model,
ref_model=ref_model,
args=args,
processing_class=processing_class,
train_dataset=train_dataset,
eval_dataset=eval_dataset)
trainer.train()Here, EpsilonDPOConfig additionally requires one more argument from DPOConfig:
epsilon: float=0.01; Parameter controlling the step size of KL penalty relaxation.
The included example training scripts can be used as:
# Mistral-Instruct
accelerate launch --config_file=configs/accelerate.yaml train.py --config=configs/mistral_instruct.yaml
# Llama-3-Instruct
accelerate launch --config_file=configs/accelerate.yaml train.py --config=configs/llama3_instruct.yaml
If you want to enable FlashAttention 2, please uncomment the attn_implementation: "flash_attention_2" in configs/mistral_instruct.yaml and configs/llama3_instruct.yaml.
@article{lee2025kl,
title={KL Penalty Control via Perturbation for Direct Preference Optimization},
author={Lee, Sangkyu and Han, Janghoon and Song, Hosung and Choi, Stanley Jungkyu and Lee, Honglak and Yu, Youngjae},
journal={arXiv preprint arXiv:2502.13177},
year={2025}
}
