ATGEN: Adversarial Reinforcement Learning for Test Case Generation

This repository is the official implementation for the paper, "ATGEN: ADVERSARIAL REINFORCEMENT LEARNING FOR TEST CASE GENERATION."

Abstract

While Large Language Models (LLMs) excel at code generation, their outputs often contain subtle bugs. A critical bottleneck is the scarcity of high-quality test cases that can effectively identify these errors. Existing LLM-based test generation approaches, whether through prompting or supervised fine-tuning, predominantly rely on static datasets, which imposes a "fixed-difficulty ceiling". This limits the model's ability to discover novel or more complex bugs beyond its training scope. To address this, we introduce ATGEN, a novel framework that trains a test case generator via Reinforcement Learning in an adversarial loop. ATGEN places the test generator in a dynamic loop with an adversarial code generator, which crafts a curriculum of increasingly difficult bugs. The test generator is then optimized with RL to maximize "Output Accuracy" and "Attack Success". Crucially, the adversarial code generator continuously produces "hard" buggy code designed to evade the current test generator's policy. These harder-to-find bugs serve as a dynamic curriculum that pushes the test generator beyond its current capabilities, effectively breaking the fixed-difficulty ceiling inherent in static methods. Extensive experiments show that ATGEN significantly outperforms strong baselines.

Framework Overview

The ATGEN framework is composed of two interconnected loops:

RL-based Test Generator Training: The test generator (policy) receives a problem description Q and a buggy code C_adver as its state. It generates a test case T_gen (an I/O pair) as its action. It then receives a multi-component reward based on format, I/O accuracy, and attack success, updating its policy using the GRPO algorithm.
Adversarial Code Generation: This loop dynamically creates challenging training data. A separate Code Generator samples a new, harder adversarial code C_adver that passes the current T_gen but fails against a ground-truth test suite T_gold. This new C_adver is fed back into the training loop, creating a dynamic curriculum.

Data Preparation

We conducted experiments on subsets of the APPS and Codeforces datasets. Before starting, please prepare your dataset.

An example of the data preprocessing script can be found at examples/data_preprocess/data_test_gen.py. Please use this as a reference to prepare your own data.

The processed data should be in .parquet format and contain the following key fields:

question: The problem description.
buggy_code: The buggy code snippet to be tested.
gold_code: The ground-truth correct code for reference.

Place the processed training and validation files in your designated data directory.

How to Train ATGEN

We use the verl framework for reinforcement learning. All hyperparameters are defined in ppo_trainer.yaml and can be overridden via command-line arguments in the training scripts.

The core training script is verl.trainer.main_ppo.

Training Mode Configuration

As described in the paper, you can activate different training modes by setting the following parameters in your .sh script:

ATGEN (w/o Adver):
- trainer.train_with_adver_code=False
ATGEN (Unconditional):
- trainer.train_with_adver_code=True
- trainer.cusgrpo_strict=False
ATGEN (Adaptive):
- trainer.train_with_adver_code=True
- trainer.cusgrpo_strict=True

Training Commands

Below are example commands for training on Qwen2.5-7B and 3B models. Please modify the paths and configurations in the scripts to match your setup.

Train ATGEN (Adaptive, 7B Model):

# Modify scripts/run_test_gen_7B.sh to match your setup.
# Key parameters for this mode:
# trainer.train_with_adver_code=True
# trainer.cusgrpo_strict=True
# trainer.attack_code_sample_max_retries=30
# trainer.start_adver_steps=50
bash scripts/run_test_gen_7B.sh

Train ATGEN (Adaptive, 3B Model):

# Modify scripts/run_test_gen_3B.sh.
# Key parameter for this mode:
# trainer.train_with_adver_code=False

bash scripts/run_test_gen_3B.sh

How to Evaluate Models

After training, models are saved as checkpoints in the default_local_dir specified in ppo_trainer.yaml. Before evaluation, you must first merge the adapter weights with the base model if you trained with LoRA.

Merge Model Checkpoints: Use the scripts/merge_lora_adapter.py script to merge the trained LoRA weights with the base model to create a standalone, loadable model for inference. Please update the model paths and output directory within the script.
Run Evaluation: Use the TestGenEval/test_gen_eval.py script to evaluate the performance of your merged model.

python TestGenEval/test_gen_eval.py \
    --model_name "Qwen2.5-7B-Instruct-rl-grpo-lora-False-verl-2048-sample-strict-30-09180900" \
    --task_name "test_io_gen" \
    --test_data_path "/path/to/your/test/data_list.json" \
    --test_strategy "difficulty_split" \
    --ray_parallel True \
    --ray_num_workers 50

This script will evaluate metrics such as IO Accuracy and Attack Rate on easy, medium, and hard splits of the test set, providing a comprehensive assessment of the model's capabilities, as described in the paper.

Downstream Applications

Training a Code Generator As demonstrated in our paper, ATGEN can serve as a high-quality reward source for training code generation models via RL. You can use the scripts/run_code_gen_7B.sh or scripts/run_code_gen_3B.sh scripts to launch this training process.

# In the script, ensure custom_reward_function.path points to a reward script
# that uses your trained ATGEN model to generate test cases and provide a reward signal.
bash scripts/run_code_gen_7B.sh

Using ATGEN as a Best-of-N Filter The run.py script is the entry point for reproducing the Best-of-N filtering experiments from Section 5.4 of our paper. By configuring this script, you can use your trained ATGEN model to filter candidate solutions from a code generator, improving the final pass@1 metric.

python run.py \
    --dataset apps \
    --arch "your_code_generator_model" \
    --test_gen_model_name "your_merged_atgen_model_name" \
    --reward_mode "gen_test" \
    --each_test_gen_num 10 \
    ... # Other relevant arguments

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
CodeGenEval		CodeGenEval
TestGenEval		TestGenEval
cusgrpo		cusgrpo
docker		docker
docs		docs
examples		examples
executors		executors
llm_service		llm_service
patches		patches
projutils		projutils
recipe		recipe
scripts		scripts
tests		tests
verl		verl
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
.style.yapf		.style.yapf
LICENSE		LICENSE
Notice.txt		Notice.txt
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
requirements_sglang.txt		requirements_sglang.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ATGEN: Adversarial Reinforcement Learning for Test Case Generation

Abstract

Framework Overview

Data Preparation

How to Train ATGEN

Training Mode Configuration

Training Commands

How to Evaluate Models

Downstream Applications

About

Uh oh!

Releases

Packages

Uh oh!

Languages

License

SIMONLQY/ATGen

Folders and files

Latest commit

History

Repository files navigation

ATGEN: Adversarial Reinforcement Learning for Test Case Generation

Abstract

Framework Overview

Data Preparation

How to Train ATGEN

Training Mode Configuration

Training Commands

How to Evaluate Models

Downstream Applications

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages