MWPO: Enhancing LLMs Performance through Multi-Weight Preference Strength and Length Optimization

we propose to reweight preference pairs based on implicit reward margins and response length margins, unifying them through a geometric mixture to generate synthetic weights for optimization. This method allows preference pairs with stronger preference signals or more favorable length feature to have a more pronounced impact on model parameters. Moreover, our method does not require additional annotators.

Training

We used trl as the training framework.

Environment Setup

Our main experiments are conduct on 4 × Ascend-910b3 NPUs To set up the environment, please use pip install the specified dependencies in requirments.txt

Training Scripts

We provide training config files for training Mistral-7B-Base models in the paper. U can start the training process as follows:

bash train.sh

Models

the weight of our method as follows:

Mistral-7B-Base: https://huggingface.co/AIR-hl/Mistral-7B-Base-MWPO

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
accelerate_configs		accelerate_configs
figures		figures
README.md		README.md
inference.py		inference.py
mwpo.py		mwpo.py
mwpo_config.py		mwpo_config.py
mwpo_trainer.py		mwpo_trainer.py
requirments.txt		requirments.txt
reward_compute.py		reward_compute.py
train.sh		train.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MWPO: Enhancing LLMs Performance through Multi-Weight Preference Strength and Length Optimization

Training

Environment Setup

Training Scripts

Models

About

Uh oh!

Releases

Packages

Uh oh!

Languages

AIR-hl/MWPO

Folders and files

Latest commit

History

Repository files navigation

MWPO: Enhancing LLMs Performance through Multi-Weight Preference Strength and Length Optimization

Training

Environment Setup

Training Scripts

Models

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages