Hallucination Filtering in Radiology Vision-Language Models Using Discrete Semantic Entropy

This repository contains the implementation and evaluation code for the paper "Hallucination Filtering in Radiology Vision-Language Models Using Discrete Semantic Entropy".

Requirements

Note: we used Ubuntu 22.04 and Python 3.11.9 (specified in environment.yml)

Setup Instructions:

Ensure conda is installed on your system.
Set up the conda environment using the provided environment.yml file:

$ conda env create --name vision_semantic_entropy --file=environment.yml

Activate the environment:

$ conda activate vision_semantic_entropy

Configure API credentials:
- Open CONFIG.py
- Update OPENAI_KEY and AZURE_ENDPOINT with your OpenAI/Azure credentials

The environment setup is now complete.

Quick Evaluation

To reproduce our results using cached data, run:

$ python3 eval.py GPT4o > results_GPT4o.txt
$ python3 eval.py GPT4.1 > results_GPT4.1.txt

This command processes the cached LLM answers (stored in outputs) as well as the cached entailment data (stored in cache) to regenerate our findings.

Expected outputs are provided in:

Used Datasets

The evaluation uses two publicly available, de-identified datasets:

RadDataset [1]: 206 clinical images (60 CT, 60 MR, 60 radiographs, 26 angiograms) with diagnostic context.
- From the paper "Revolution or risk?—Assessing the potential and challenges of GPT-4V in radiologic image interpretation" (Electronic Supplementary Material)
- Creative Commons 4.0 License (see [1] for details), therefore contained in this repository
VQA-Med 2019 [2]: 500 radiological images with clinical questions across four categories (modality, plane, organ, abnormality) (we only use the test data)
- You can download it here
- Copy the VQAMed2019_Test_Questions_w_Ref_Answers.txt file into the questions folder
- Unpack the VQAMed2019_Test_Images.zip in the questions folder as well, so that there is a folder called VQAMed2019_Test_Images containing the .jpg images

Reproducing the Complete Results

Note: Due to the non-deterministic nature of LLMs, new prompting may yield slightly different results from those reported in the paper.

Prerequisites

Valid OpenAI API credentials with access to GPT-4o and GPT-4.1
Download the VQA-Med 2019 dataset (see section Used Datasets for details) (the RadDataset is already part of this repository)

Steps

Configure credentials:
- Open CONFIG.py
- Update OPENAI_KEY and AZURE_ENDPOINT with your credentials
Prepare the environment:
- Clear cache and outputs folders to delete the data from our evaluation
Generate new VLM responses:
```
$ python3 generateAnswers.py
```
This generates 15 responses per question for entropy calculation plus 1 low-temperature response for accuracy assessment. Results are saved in the outputs directory with filenames prefixed with EVAL.
Run the evaluation:
```
$ python3 eval.py
```
This step performs semantic clustering, entailment checking, entropy calculation, and statistical analysis. Entailment results are cached in the cache folder for efficiency.
Answer Validation: Our code contains automatic evaluation to determine if answers are correct by comparing them with the given dataset answers. To ensure that the grading of answers is accurate, we have manually verified every answer generated by the isAnswerCorrect function. The raw outputs are saved in cache/raw_entailmentCacheFile_GPT4o.csv and cache/raw_entailmentCacheFile_GPT4_1.csv. After manual verification, rerun:
```
$ python3 eval.py GPT4o > results_GPT4o.txt
$ python3 eval.py GPT4.1 > results_GPT4.1.txt
```

Methodology

The approach:

Multi-sampling: Generate 15 responses per question using high-temperature sampling (T=1.0) (code in generateAnswers.py)
Semantic Clustering: Group semantically equivalent responses using bidirectional entailment checks (code in EntailmentCheck.py)
Entropy Calculation: Compute discrete semantic entropy from cluster frequency distributions (code in clusterAnswers.py)
Selective Prediction: Filter out high-entropy questions to improve accuracy on remaining questions (code in clusterAnswers.py)

File Structure

├── README.md                    # This file
├── environment.yml              # Conda environment specification
├── CONFIG.py                    # Configuration and API credentials
├── eval.py                      # Main evaluation script
├── generateAnswers.py           # VLM response generation
├── promptLLM.py                 # LLM prompting interface
├── EntailmentCheck.py           # Semantic entailment
├── clusterAnswers.py            # Entropy calculation and analysis
├── RadDataset.py                # Dataset loading utilities (for both datasets)
├── utilFunctions.py             # General utility functions
├── outputs/                     # Generated VLM responses
├── cache/                       # Cached entailment results
├── questions/                   # Dataset files
├── results_GPT4o.txt            # Reference results
└── results_GPT4.1.txt           # Reference results

Citation

If you find this code helpful for your research, please cite our paper:

Wienholt, P., Caselitz, S., Siepmann, R. et al. Hallucination filtering in radiology vision-language models using discrete semantic entropy. Eur Radiol (2026). https://doi.org/10.1007/s00330-026-12384-z

References

[1] Huppertz, M.S., Siepmann, R., Topp, D. et al. Revolution or risk?—Assessing the potential and challenges of GPT-4V in radiologic image interpretation. Eur Radiol 35, 1111–1121 (2025). https://doi.org/10.1007/s00330-024-11115-6

[2] Ben Abacha, A., Hasan, S. A., Datla, V. V., Demner-Fushman, D., & Müller, H. (2019). Vqa-med: Overview of the medical visual question answering task at imageclef 2019. In Proceedings of CLEF (Conference and Labs of the Evaluation Forum) 2019 Working Notes. 9-12 September 2019.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Hallucination Filtering in Radiology Vision-Language Models Using Discrete Semantic Entropy

Requirements

Quick Evaluation

Used Datasets

Reproducing the Complete Results

Prerequisites

Steps

Methodology

File Structure

Citation

References

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
cache		cache
outputs		outputs
questions		questions
CONFIG.py		CONFIG.py
EntailmentCheck.py		EntailmentCheck.py
README.md		README.md
RadDataset.py		RadDataset.py
clusterAnswers.py		clusterAnswers.py
environment.yml		environment.yml
eval.py		eval.py
generateAnswers.py		generateAnswers.py
promptLLM.py		promptLLM.py
results_GPT4.1.txt		results_GPT4.1.txt
results_GPT4o.txt		results_GPT4o.txt
utilFunctions.py		utilFunctions.py

Folders and files

Latest commit

History

Repository files navigation

Hallucination Filtering in Radiology Vision-Language Models Using Discrete Semantic Entropy

Requirements

Quick Evaluation

Used Datasets

Reproducing the Complete Results

Prerequisites

Steps

Methodology

File Structure

Citation

References

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages