This repository is the official implementation for the paper, "ATGEN: ADVERSARIAL REINFORCEMENT LEARNING FOR TEST CASE GENERATION."
While Large Language Models (LLMs) excel at code generation, their outputs often contain subtle bugs. A critical bottleneck is the scarcity of high-quality test cases that can effectively identify these errors. Existing LLM-based test generation approaches, whether through prompting or supervised fine-tuning, predominantly rely on static datasets, which imposes a "fixed-difficulty ceiling". This limits the model's ability to discover novel or more complex bugs beyond its training scope. To address this, we introduce ATGEN, a novel framework that trains a test case generator via Reinforcement Learning in an adversarial loop. ATGEN places the test generator in a dynamic loop with an adversarial code generator, which crafts a curriculum of increasingly difficult bugs. The test generator is then optimized with RL to maximize "Output Accuracy" and "Attack Success". Crucially, the adversarial code generator continuously produces "hard" buggy code designed to evade the current test generator's policy. These harder-to-find bugs serve as a dynamic curriculum that pushes the test generator beyond its current capabilities, effectively breaking the fixed-difficulty ceiling inherent in static methods. Extensive experiments show that ATGEN significantly outperforms strong baselines.
The ATGEN framework is composed of two interconnected loops:
- RL-based Test Generator Training: The test generator (policy) receives a problem description Q and a buggy code Cadver as its state. It generates a test case Tgen (an I/O pair) as its action. It then receives a multi-component reward based on format, I/O accuracy, and attack success, updating its policy using the GRPO algorithm.
- Adversarial Code Generation: This loop dynamically creates challenging training data. A separate Code Generator samples a new, harder adversarial code Cadver that passes the current Tgen but fails against a ground-truth test suite Tgold. This new Cadver is fed back into the training loop, creating a dynamic curriculum.
We conducted experiments on subsets of the APPS and Codeforces datasets. Before starting, please prepare your dataset.
An example of the data preprocessing script can be found at examples/data_preprocess/data_test_gen.py. Please use this as a reference to prepare your own data.
The processed data should be in .parquet format and contain the following key fields:
question: The problem description.buggy_code: The buggy code snippet to be tested.gold_code: The ground-truth correct code for reference.
Place the processed training and validation files in your designated data directory.
We use the verl framework for reinforcement learning. All hyperparameters are defined in ppo_trainer.yaml and can be overridden via command-line arguments in the training scripts.
The core training script is verl.trainer.main_ppo.
As described in the paper, you can activate different training modes by setting the following parameters in your .sh script:
-
ATGEN (w/o Adver):
trainer.train_with_adver_code=False
-
ATGEN (Unconditional):
trainer.train_with_adver_code=Truetrainer.cusgrpo_strict=False
-
ATGEN (Adaptive):
trainer.train_with_adver_code=Truetrainer.cusgrpo_strict=True
Below are example commands for training on Qwen2.5-7B and 3B models. Please modify the paths and configurations in the scripts to match your setup.
Train ATGEN (Adaptive, 7B Model):
# Modify scripts/run_test_gen_7B.sh to match your setup.
# Key parameters for this mode:
# trainer.train_with_adver_code=True
# trainer.cusgrpo_strict=True
# trainer.attack_code_sample_max_retries=30
# trainer.start_adver_steps=50
bash scripts/run_test_gen_7B.shTrain ATGEN (Adaptive, 3B Model):
# Modify scripts/run_test_gen_3B.sh.
# Key parameter for this mode:
# trainer.train_with_adver_code=False
bash scripts/run_test_gen_3B.shAfter training, models are saved as checkpoints in the default_local_dir specified in ppo_trainer.yaml. Before evaluation, you must first merge the adapter weights with the base model if you trained with LoRA.
-
Merge Model Checkpoints: Use the
scripts/merge_lora_adapter.pyscript to merge the trained LoRA weights with the base model to create a standalone, loadable model for inference. Please update the model paths and output directory within the script. -
Run Evaluation: Use the
TestGenEval/test_gen_eval.pyscript to evaluate the performance of your merged model.
python TestGenEval/test_gen_eval.py \
--model_name "Qwen2.5-7B-Instruct-rl-grpo-lora-False-verl-2048-sample-strict-30-09180900" \
--task_name "test_io_gen" \
--test_data_path "/path/to/your/test/data_list.json" \
--test_strategy "difficulty_split" \
--ray_parallel True \
--ray_num_workers 50This script will evaluate metrics such as IO Accuracy and Attack Rate on easy, medium, and hard splits of the test set, providing a comprehensive assessment of the model's capabilities, as described in the paper.
Training a Code Generator
As demonstrated in our paper, ATGEN can serve as a high-quality reward source for training code generation models via RL. You can use the
scripts/run_code_gen_7B.sh or scripts/run_code_gen_3B.sh scripts to launch this training process.
# In the script, ensure custom_reward_function.path points to a reward script
# that uses your trained ATGEN model to generate test cases and provide a reward signal.
bash scripts/run_code_gen_7B.shUsing ATGEN as a Best-of-N Filter
The run.py script is the entry point for reproducing the Best-of-N filtering experiments from Section 5.4 of our paper. By configuring this script, you can use your trained ATGEN model to filter candidate solutions from a code generator, improving the final pass@1 metric.
python run.py \
--dataset apps \
--arch "your_code_generator_model" \
--test_gen_model_name "your_merged_atgen_model_name" \
--reward_mode "gen_test" \
--each_test_gen_num 10 \
... # Other relevant arguments