This repository contains the implementation and evaluation code for the paper "Hallucination Filtering in Radiology Vision-Language Models Using Discrete Semantic Entropy".
Note: we used Ubuntu 22.04 and Python 3.11.9 (specified in environment.yml)
Setup Instructions:
-
Ensure
condais installed on your system. -
Set up the conda environment using the provided
environment.ymlfile:
$ conda env create --name vision_semantic_entropy --file=environment.yml
- Activate the environment:
$ conda activate vision_semantic_entropy
- Configure API credentials:
- Open
CONFIG.py - Update
OPENAI_KEYandAZURE_ENDPOINTwith your OpenAI/Azure credentials
- Open
The environment setup is now complete.
To reproduce our results using cached data, run:
$ python3 eval.py GPT4o > results_GPT4o.txt
$ python3 eval.py GPT4.1 > results_GPT4.1.txt
This command processes the cached LLM answers (stored in outputs) as well as the cached entailment data (stored in cache) to regenerate our findings.
Expected outputs are provided in:
The evaluation uses two publicly available, de-identified datasets:
-
RadDataset [1]: 206 clinical images (60 CT, 60 MR, 60 radiographs, 26 angiograms) with diagnostic context.
- From the paper "Revolution or risk?—Assessing the potential and challenges of GPT-4V in radiologic image interpretation" (Electronic Supplementary Material)
- Creative Commons 4.0 License (see [1] for details), therefore contained in this repository
-
VQA-Med 2019 [2]: 500 radiological images with clinical questions across four categories (modality, plane, organ, abnormality) (we only use the test data)
- You can download it here
- Copy the
VQAMed2019_Test_Questions_w_Ref_Answers.txtfile into thequestionsfolder - Unpack the
VQAMed2019_Test_Images.zipin thequestionsfolder as well, so that there is a folder calledVQAMed2019_Test_Imagescontaining the.jpgimages
Note: Due to the non-deterministic nature of LLMs, new prompting may yield slightly different results from those reported in the paper.
- Valid OpenAI API credentials with access to GPT-4o and GPT-4.1
- Download the VQA-Med 2019 dataset (see section Used Datasets for details) (the RadDataset is already part of this repository)
-
Configure credentials:
- Open
CONFIG.py - Update
OPENAI_KEYandAZURE_ENDPOINTwith your credentials
- Open
-
Prepare the environment:
- Clear
cacheandoutputsfolders to delete the data from our evaluation
- Clear
-
Generate new VLM responses:
$ python3 generateAnswers.pyThis generates 15 responses per question for entropy calculation plus 1 low-temperature response for accuracy assessment. Results are saved in the
outputsdirectory with filenames prefixed withEVAL. -
Run the evaluation:
$ python3 eval.pyThis step performs semantic clustering, entailment checking, entropy calculation, and statistical analysis. Entailment results are cached in the
cachefolder for efficiency. -
Answer Validation: Our code contains automatic evaluation to determine if answers are correct by comparing them with the given dataset answers. To ensure that the grading of answers is accurate, we have manually verified every answer generated by the
isAnswerCorrectfunction. The raw outputs are saved incache/raw_entailmentCacheFile_GPT4o.csvandcache/raw_entailmentCacheFile_GPT4_1.csv. After manual verification, rerun:$ python3 eval.py GPT4o > results_GPT4o.txt $ python3 eval.py GPT4.1 > results_GPT4.1.txt
The approach:
- Multi-sampling: Generate 15 responses per question using high-temperature sampling (T=1.0) (code in
generateAnswers.py) - Semantic Clustering: Group semantically equivalent responses using bidirectional entailment checks (code in
EntailmentCheck.py) - Entropy Calculation: Compute discrete semantic entropy from cluster frequency distributions (code in
clusterAnswers.py) - Selective Prediction: Filter out high-entropy questions to improve accuracy on remaining questions (code in
clusterAnswers.py)
├── README.md # This file
├── environment.yml # Conda environment specification
├── CONFIG.py # Configuration and API credentials
├── eval.py # Main evaluation script
├── generateAnswers.py # VLM response generation
├── promptLLM.py # LLM prompting interface
├── EntailmentCheck.py # Semantic entailment
├── clusterAnswers.py # Entropy calculation and analysis
├── RadDataset.py # Dataset loading utilities (for both datasets)
├── utilFunctions.py # General utility functions
├── outputs/ # Generated VLM responses
├── cache/ # Cached entailment results
├── questions/ # Dataset files
├── results_GPT4o.txt # Reference results
└── results_GPT4.1.txt # Reference results
If you find this code helpful for your research, please cite our paper:
Wienholt, P., Caselitz, S., Siepmann, R. et al. Hallucination filtering in radiology vision-language models using discrete semantic entropy. Eur Radiol (2026). https://doi.org/10.1007/s00330-026-12384-z
[1] Huppertz, M.S., Siepmann, R., Topp, D. et al. Revolution or risk?—Assessing the potential and challenges of GPT-4V in radiologic image interpretation. Eur Radiol 35, 1111–1121 (2025). https://doi.org/10.1007/s00330-024-11115-6
[2] Ben Abacha, A., Hasan, S. A., Datla, V. V., Demner-Fushman, D., & Müller, H. (2019). Vqa-med: Overview of the medical visual question answering task at imageclef 2019. In Proceedings of CLEF (Conference and Labs of the Evaluation Forum) 2019 Working Notes. 9-12 September 2019.