Skip to content

SIMONLQY/ATGen

Repository files navigation

ATGEN: Adversarial Reinforcement Learning for Test Case Generation

This repository is the official implementation for the paper, "ATGEN: ADVERSARIAL REINFORCEMENT LEARNING FOR TEST CASE GENERATION."

Abstract

While Large Language Models (LLMs) excel at code generation, their outputs often contain subtle bugs. A critical bottleneck is the scarcity of high-quality test cases that can effectively identify these errors. Existing LLM-based test generation approaches, whether through prompting or supervised fine-tuning, predominantly rely on static datasets, which imposes a "fixed-difficulty ceiling". This limits the model's ability to discover novel or more complex bugs beyond its training scope. To address this, we introduce ATGEN, a novel framework that trains a test case generator via Reinforcement Learning in an adversarial loop. ATGEN places the test generator in a dynamic loop with an adversarial code generator, which crafts a curriculum of increasingly difficult bugs. The test generator is then optimized with RL to maximize "Output Accuracy" and "Attack Success". Crucially, the adversarial code generator continuously produces "hard" buggy code designed to evade the current test generator's policy. These harder-to-find bugs serve as a dynamic curriculum that pushes the test generator beyond its current capabilities, effectively breaking the fixed-difficulty ceiling inherent in static methods. Extensive experiments show that ATGEN significantly outperforms strong baselines.

Framework Overview

The ATGEN framework is composed of two interconnected loops:

  1. RL-based Test Generator Training: The test generator (policy) receives a problem description Q and a buggy code Cadver as its state. It generates a test case Tgen (an I/O pair) as its action. It then receives a multi-component reward based on format, I/O accuracy, and attack success, updating its policy using the GRPO algorithm.
  2. Adversarial Code Generation: This loop dynamically creates challenging training data. A separate Code Generator samples a new, harder adversarial code Cadver that passes the current Tgen but fails against a ground-truth test suite Tgold. This new Cadver is fed back into the training loop, creating a dynamic curriculum.

Data Preparation

We conducted experiments on subsets of the APPS and Codeforces datasets. Before starting, please prepare your dataset.

An example of the data preprocessing script can be found at examples/data_preprocess/data_test_gen.py. Please use this as a reference to prepare your own data.

The processed data should be in .parquet format and contain the following key fields:

  • question: The problem description.
  • buggy_code: The buggy code snippet to be tested.
  • gold_code: The ground-truth correct code for reference.

Place the processed training and validation files in your designated data directory.

How to Train ATGEN

We use the verl framework for reinforcement learning. All hyperparameters are defined in ppo_trainer.yaml and can be overridden via command-line arguments in the training scripts.

The core training script is verl.trainer.main_ppo.

Training Mode Configuration

As described in the paper, you can activate different training modes by setting the following parameters in your .sh script:

  • ATGEN (w/o Adver):

    • trainer.train_with_adver_code=False
  • ATGEN (Unconditional):

    • trainer.train_with_adver_code=True
    • trainer.cusgrpo_strict=False
  • ATGEN (Adaptive):

    • trainer.train_with_adver_code=True
    • trainer.cusgrpo_strict=True

Training Commands

Below are example commands for training on Qwen2.5-7B and 3B models. Please modify the paths and configurations in the scripts to match your setup.

Train ATGEN (Adaptive, 7B Model):

# Modify scripts/run_test_gen_7B.sh to match your setup.
# Key parameters for this mode:
# trainer.train_with_adver_code=True
# trainer.cusgrpo_strict=True
# trainer.attack_code_sample_max_retries=30
# trainer.start_adver_steps=50
bash scripts/run_test_gen_7B.sh

Train ATGEN (Adaptive, 3B Model):

# Modify scripts/run_test_gen_3B.sh.
# Key parameter for this mode:
# trainer.train_with_adver_code=False

bash scripts/run_test_gen_3B.sh

How to Evaluate Models

After training, models are saved as checkpoints in the default_local_dir specified in ppo_trainer.yaml. Before evaluation, you must first merge the adapter weights with the base model if you trained with LoRA.

  1. Merge Model Checkpoints: Use the scripts/merge_lora_adapter.py script to merge the trained LoRA weights with the base model to create a standalone, loadable model for inference. Please update the model paths and output directory within the script.

  2. Run Evaluation: Use the TestGenEval/test_gen_eval.py script to evaluate the performance of your merged model.

python TestGenEval/test_gen_eval.py \
    --model_name "Qwen2.5-7B-Instruct-rl-grpo-lora-False-verl-2048-sample-strict-30-09180900" \
    --task_name "test_io_gen" \
    --test_data_path "/path/to/your/test/data_list.json" \
    --test_strategy "difficulty_split" \
    --ray_parallel True \
    --ray_num_workers 50

This script will evaluate metrics such as IO Accuracy and Attack Rate on easy, medium, and hard splits of the test set, providing a comprehensive assessment of the model's capabilities, as described in the paper.

Downstream Applications

Training a Code Generator As demonstrated in our paper, ATGEN can serve as a high-quality reward source for training code generation models via RL. You can use the scripts/run_code_gen_7B.sh or scripts/run_code_gen_3B.sh scripts to launch this training process.

# In the script, ensure custom_reward_function.path points to a reward script
# that uses your trained ATGEN model to generate test cases and provide a reward signal.
bash scripts/run_code_gen_7B.sh

Using ATGEN as a Best-of-N Filter The run.py script is the entry point for reproducing the Best-of-N filtering experiments from Section 5.4 of our paper. By configuring this script, you can use your trained ATGEN model to filter candidate solutions from a code generator, improving the final pass@1 metric.

python run.py \
    --dataset apps \
    --arch "your_code_generator_model" \
    --test_gen_model_name "your_merged_atgen_model_name" \
    --reward_mode "gen_test" \
    --each_test_gen_num 10 \
    ... # Other relevant arguments

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages