Skip to content

ADaM-BJTU/System-2-alignment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Don't Command, Cultivate: an Exploratory Study of System-2 Alignment

Yuhang Wang, Yuxiang Zhang Yanxu Zhu, Xinyan Wen, Jitao Sang*
Department of Computer Science
Beijing Jiaotong University

The o1 system card identifies the o1 models as the most robust within OpenAI, with their defining characteristic being the progression from rapid, intuitive thinking (System-1) to more deliberate, reasoned thought (System-2). This observation motivated us to investigate the influence of System-2 thinking patterns on model safety. In our preliminary research, we conducted safety evaluations of the o1 model, including complex jailbreak attack scenarios using adversarial natural language prompts and mathematical encoding prompts. Our findings indicate that the o1 model demonstrates relatively improved safety performance, though vulnerabilities remain, especially against attacks leveraging mathematical encoding. Through detailed analysis, we identified specific response patterns associated with these vulnerabilities. We further explored System-2 Alignment on open-source models using prompt engineering and supervised fine-tuning techniques. Experimental results suggest that methods encouraging models to carefully analyze user inputs improve safety. Additionally, we proposed an implementation framework for reinforcement learning with process supervision to enhance safety alignment. The implementation details and experimental results will be presented in future versions.

Experimental dataset

The dataset utilized in our study was derived through sampling from the wildjailbreak dataset (see Raw Data LINK for more details).

PPO Code

The reward model training and PPO training code can be found in O1-CODER and OpenRFT, respectively.

Acknowledgement

  • Hugging Face for their open-source transformer models.

Citation

@misc{Wang2024DontCC,
      title={Don't Command, Cultivate: An Exploratory Study of System-2 Alignment}, 
      author={Yuhang Wang and Jitao Sang},
      year={2024}
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages