Don't Command, Cultivate: an Exploratory Study of System-2 Alignment

Yuhang Wang, Yuxiang Zhang Yanxu Zhu, Xinyan Wen, Jitao Sang*

Department of Computer Science

Beijing Jiaotong University

The o1 system card identifies the o1 models as the most robust within OpenAI, with their defining characteristic being the progression from rapid, intuitive thinking (System-1) to more deliberate, reasoned thought (System-2). This observation motivated us to investigate the influence of System-2 thinking patterns on model safety. In our preliminary research, we conducted safety evaluations of the o1 model, including complex jailbreak attack scenarios using adversarial natural language prompts and mathematical encoding prompts. Our findings indicate that the o1 model demonstrates relatively improved safety performance, though vulnerabilities remain, especially against attacks leveraging mathematical encoding. Through detailed analysis, we identified specific response patterns associated with these vulnerabilities. We further explored System-2 Alignment on open-source models using prompt engineering and supervised fine-tuning techniques. Experimental results suggest that methods encouraging models to carefully analyze user inputs improve safety. Additionally, we proposed an implementation framework for reinforcement learning with process supervision to enhance safety alignment. The implementation details and experimental results will be presented in future versions.

Experimental dataset

The dataset utilized in our study was derived through sampling from the wildjailbreak dataset (see Raw Data LINK for more details).

PPO Code

The reward model training and PPO training code can be found in O1-CODER and OpenRFT, respectively.

Acknowledgement

Hugging Face for their open-source transformer models.

Citation

@misc{Wang2024DontCC,
      title={Don't Command, Cultivate: An Exploratory Study of System-2 Alignment}, 
      author={Yuhang Wang and Jitao Sang},
      year={2024}
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
data/wildjailbreak		data/wildjailbreak
images		images
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Don't Command, Cultivate: an Exploratory Study of System-2 Alignment

Experimental dataset

PPO Code

Acknowledgement

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Don't Command, Cultivate: an Exploratory Study of System-2 Alignment

Experimental dataset

PPO Code

Acknowledgement

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages