Toward Automated Validation of Language Model Synthesized Test Cases using Semantic Entropy

This repository contains the implementation of VALTEST, a framework for automatically validating test cases generated by Large Language Models (LLMs). The goal of VALTEST is to improve the reliability of LLM-generated test cases by leveraging token probabilities to predict the validity of test cases. The framework evaluates the validity, mutation score, and coverage metrics for test cases generated from three popular datasets—HumanEval, MBPP, and LeetCode—across three LLMs: GPT-4o, GPT-3.5-turbo, and LLama3.

License

This project is licensed under the Creative Commons Attribution 4.0 International (CC BY 4.0) License.

You are free to:

Share — Copy and redistribute the material in any medium or format.
Adapt — Remix, transform, and build upon the material for any purpose, even commercially.

Under the following terms:

Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
No additional restrictions — You may not apply legal terms or technological measures that legally restrict others from doing anything the license permits.

View Full License

Project Structure

main_train.py: The primary script for training machine learning models on extracted token probability features to predict test case validity. It handles feature extraction, model training, evaluation, and selection of valid test cases using various machine learning algorithms.
generate_testcases.py: A script to generate test cases using different LLMs. This script interacts with APIs such as OpenAI and Huggingface to generate test cases and captures token probabilities.
curate_testcases.py: This script refines and validates the generated test cases using a chain-of-thought approach. It also validates assertions within test cases by interacting with LLMs and runs final evaluations on curated test cases.
requirements.txt: Contains all dependencies required to run the project, including libraries for machine learning, deep learning, token probability extraction, and mutation testing.

Key Components

Feature Extraction

The framework extracts statistical features from the token probabilities for both function inputs and expected outputs. These features include mean, max, min, sum, variance, and total token counts for both the top-predicted token and the second-predicted token.

Model Training and Evaluation

The machine learning models (e.g., logistic regression, random forest, ensemble models) are trained using the extracted features. The models predict the validity of the test cases by evaluating their probability distributions. A K-fold cross-validation is used for training and testing across different LLM-generated test suites.

Test Case Generation

Test cases are generated by prompting LLMs using function signatures and task descriptions from datasets such as HumanEval, MBPP, and LeetCode. Token probabilities are captured during this process to extract features for model training.

Test Case Curation

Invalid test cases are identified and corrected through a chain-of-thought reasoning process. This step ensures that test cases reflect the expected behavior of the functions, even if the LLM initially generated invalid assertions.

Running the Project

SetUp

### To run bigcodebench evaluation you will need this
python3.10 -m venv .bigcode_venv 
source .bigcode_venv/bin/activate
pip install -r requirements_bigcode.txt
deactivate
###
### To run the project
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
###

Create a file named .env:

Get OpenAI key (for GPT models): https://platform.openai.com/

Get Fireworks key (for Qwen models): https://fireworks.ai/

openai_key=<your OpenAI key>
fireworks_key=<your fileworks key>

Optional: Re-Generating Test Cases

Please make a .env file in project root. For OpenAI's experiments, you need to set openai_key in the .env file. For CodeQwen, set fireworks_key in .env file. Find the key in https://fireworks.ai/. For LLama3, we use huggingface. The model needs ~25 GB Vram. The generated tests are stored in unfiltered_testcases folder in <dataset>-<llm_name>.pkl format. Use the existing generated tests or remove the tests and generate your own.

To generate test cases from an LLM, use the generate_testcases.py script:

python generate_testcases.py --dataset HumanEval --llm gpt-4o

Supported datasets: MBPP, HumanEval, LeetCode, BigCodeBench, BigCodeBenchHard

Supported LLMs: gpt-4o, gpt-3.5-turbo, llama3, codeqwen

Training the Model

To train a model and predict the validity of test cases, use the main_train.py script:

python main_train.py --dataset HumanEval --llm gpt-4o --mutation 0 --threshold 0.8 --topN 5 --features all

This will output the results in output folder in <dataset>-<llm_name>.txt format. The output pickle file is stored in filtered_testcases folder in <dataset>-<llm_name>.pkl format.

Parameters for `main_train.py`

The main_train.py script accepts several parameters to customize the execution of the test case validation process. Below is a description of each parameter and its use:

`--dataset`

Description: Specifies the dataset to use for generating and evaluating test cases.
Choices: MBPP, HumanEval, LeetCode, BigCodeBench, BigCodeBenchHard
Required: Yes
Example:
```
--dataset HumanEval
```

`--llm`

Description: Specifies the Large Language Model (LLM) to use for generating test cases.
Choices: gpt-4o, gpt-3.5-turbo, llama3, codeqwen
Required: Yes
Example:
```
--llm gpt-4o
```

`--mutation`

Description: Enables mutation testing for the selected dataset and LLM. Mutation testing measures how well the test cases detect faults in the code.
Choices: 0 (disable), 1 (enable)
Default: 0 (disabled)
Example:
```
--mutation 1
```

`--threshold`

Description: Specifies the threshold for selecting valid test cases. The threshold defines the minimum probability score for considering a test case as valid.
Choices: 0.5, 0.65, 0.7, 0.8, 0.85, 0.9
Default: 0.8
Example:
```
--threshold 0.8
```

`--topN`

Description: Specifies the number of top test cases to select per function. If fewer than N test cases meet the threshold, the top N cases are selected based on their probability scores.
Choices: 1, 3, 5, 7
Default: 3
Example:
```
--topN 3
```

`--features`

Description: Specifies which feature sets to use for training the model. The feature sets can focus on the function input, the expected output, or both.
Choices: all, input, output
Default: all
Example:
```
--features all
```

To perform BigCodeBench experiments, you need to make a python venv in project_root_dir + '/.bigcode_venv/ directory and install requirements_bigcode.txt libraries in that venv.

Curating Test Cases

To validate and curate the generated test cases:

python curate_testcases.py --dataset HumanEval --llm gpt-4o

python main_train.py --dataset HumanEval --llm gpt-4o --mutation 1

Results and Evaluation

The results will be stored in text file in output/{dataset}_{llm}_{approach}.txt.

Validity Rate (VR): The proportion of test cases that are valid after running on the source code.
Mutation Score (MS): The percentage of killed mutants during mutation testing, reflecting the fault-detection capability of the test cases.
Line Coverage (LC): Measures how much of the source code is executed by the test cases.

Name		Name	Last commit message	Last commit date
Latest commit History 99 Commits
curated_testcases		curated_testcases
dual		dual
filtered_testcases		filtered_testcases
loaders		loaders
models		models
output		output
unfiltered_testcases		unfiltered_testcases
.gitignore		.gitignore
README.md		README.md
categories.py		categories.py
curate_testcases.py		curate_testcases.py
datasets_and_llms.py		datasets_and_llms.py
evaluate_curated_testcases.py		evaluate_curated_testcases.py
evaluate_model_on_dataset.py		evaluate_model_on_dataset.py
function_executor.py		function_executor.py
function_executor_codet.py		function_executor_codet.py
generate_solutions.py		generate_solutions.py
generate_testcases.py		generate_testcases.py
llm_requester.py		llm_requester.py
log_probs.py		log_probs.py
main_train.py		main_train.py
mutation_testing.py		mutation_testing.py
mutation_testing_bigcode.py		mutation_testing_bigcode.py
print_categories.py		print_categories.py
prompts.py		prompts.py
requirements.txt		requirements.txt
requirements_bigcode.txt		requirements_bigcode.txt
run_tests_on_generated_code.py		run_tests_on_generated_code.py
save_features.py		save_features.py
test_coverage.py		test_coverage.py
train_model.py		train_model.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Toward Automated Validation of Language Model Synthesized Test Cases using Semantic Entropy

License

Project Structure

Key Components

Feature Extraction

Model Training and Evaluation

Test Case Generation

Test Case Curation

Running the Project

SetUp

Optional: Re-Generating Test Cases

Training the Model

Parameters for `main_train.py`

`--dataset`

`--llm`

`--mutation`

`--threshold`

`--topN`

`--features`

Curating Test Cases

Results and Evaluation

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Toward Automated Validation of Language Model Synthesized Test Cases using Semantic Entropy

License

Project Structure

Key Components

Feature Extraction

Model Training and Evaluation

Test Case Generation

Test Case Curation

Running the Project

SetUp

Optional: Re-Generating Test Cases

Training the Model

Parameters for main_train.py

--dataset

--llm

--mutation

--threshold

--topN

--features

Curating Test Cases

Results and Evaluation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Parameters for `main_train.py`

`--dataset`

`--llm`

`--mutation`

`--threshold`

`--topN`

`--features`

Packages