Skip to content

TruhnLab/VisionSemanticEntropy

Repository files navigation

Hallucination Filtering in Radiology Vision-Language Models Using Discrete Semantic Entropy

Python

This repository contains the implementation and evaluation code for the paper "Hallucination Filtering in Radiology Vision-Language Models Using Discrete Semantic Entropy".

Requirements

Note: we used Ubuntu 22.04 and Python 3.11.9 (specified in environment.yml)

Setup Instructions:

  1. Ensure conda is installed on your system.

  2. Set up the conda environment using the provided environment.yml file:

$ conda env create --name vision_semantic_entropy --file=environment.yml
  1. Activate the environment:
$ conda activate vision_semantic_entropy
  1. Configure API credentials:
    • Open CONFIG.py
    • Update OPENAI_KEY and AZURE_ENDPOINT with your OpenAI/Azure credentials

The environment setup is now complete.

Quick Evaluation

To reproduce our results using cached data, run:

$ python3 eval.py GPT4o > results_GPT4o.txt
$ python3 eval.py GPT4.1 > results_GPT4.1.txt

This command processes the cached LLM answers (stored in outputs) as well as the cached entailment data (stored in cache) to regenerate our findings.

Expected outputs are provided in:

Used Datasets

The evaluation uses two publicly available, de-identified datasets:

  • RadDataset [1]: 206 clinical images (60 CT, 60 MR, 60 radiographs, 26 angiograms) with diagnostic context.

    • From the paper "Revolution or risk?—Assessing the potential and challenges of GPT-4V in radiologic image interpretation" (Electronic Supplementary Material)
    • Creative Commons 4.0 License (see [1] for details), therefore contained in this repository
  • VQA-Med 2019 [2]: 500 radiological images with clinical questions across four categories (modality, plane, organ, abnormality) (we only use the test data)

Reproducing the Complete Results

Note: Due to the non-deterministic nature of LLMs, new prompting may yield slightly different results from those reported in the paper.

Prerequisites

  1. Valid OpenAI API credentials with access to GPT-4o and GPT-4.1
  2. Download the VQA-Med 2019 dataset (see section Used Datasets for details) (the RadDataset is already part of this repository)

Steps

  1. Configure credentials:

    • Open CONFIG.py
    • Update OPENAI_KEY and AZURE_ENDPOINT with your credentials
  2. Prepare the environment:

    • Clear cache and outputs folders to delete the data from our evaluation
  3. Generate new VLM responses:

    $ python3 generateAnswers.py
    

    This generates 15 responses per question for entropy calculation plus 1 low-temperature response for accuracy assessment. Results are saved in the outputs directory with filenames prefixed with EVAL.

  4. Run the evaluation:

    $ python3 eval.py
    

    This step performs semantic clustering, entailment checking, entropy calculation, and statistical analysis. Entailment results are cached in the cache folder for efficiency.

  5. Answer Validation: Our code contains automatic evaluation to determine if answers are correct by comparing them with the given dataset answers. To ensure that the grading of answers is accurate, we have manually verified every answer generated by the isAnswerCorrect function. The raw outputs are saved in cache/raw_entailmentCacheFile_GPT4o.csv and cache/raw_entailmentCacheFile_GPT4_1.csv. After manual verification, rerun:

    $ python3 eval.py GPT4o > results_GPT4o.txt
    $ python3 eval.py GPT4.1 > results_GPT4.1.txt
    

Methodology

The approach:

  1. Multi-sampling: Generate 15 responses per question using high-temperature sampling (T=1.0) (code in generateAnswers.py)
  2. Semantic Clustering: Group semantically equivalent responses using bidirectional entailment checks (code in EntailmentCheck.py)
  3. Entropy Calculation: Compute discrete semantic entropy from cluster frequency distributions (code in clusterAnswers.py)
  4. Selective Prediction: Filter out high-entropy questions to improve accuracy on remaining questions (code in clusterAnswers.py)

File Structure

├── README.md                    # This file
├── environment.yml              # Conda environment specification
├── CONFIG.py                    # Configuration and API credentials
├── eval.py                      # Main evaluation script
├── generateAnswers.py           # VLM response generation
├── promptLLM.py                 # LLM prompting interface
├── EntailmentCheck.py           # Semantic entailment
├── clusterAnswers.py            # Entropy calculation and analysis
├── RadDataset.py                # Dataset loading utilities (for both datasets)
├── utilFunctions.py             # General utility functions
├── outputs/                     # Generated VLM responses
├── cache/                       # Cached entailment results
├── questions/                   # Dataset files
├── results_GPT4o.txt            # Reference results
└── results_GPT4.1.txt           # Reference results

Citation

If you find this code helpful for your research, please cite our paper:

Wienholt, P., Caselitz, S., Siepmann, R. et al. Hallucination filtering in radiology vision-language models using discrete semantic entropy. Eur Radiol (2026). https://doi.org/10.1007/s00330-026-12384-z

References

[1] Huppertz, M.S., Siepmann, R., Topp, D. et al. Revolution or risk?—Assessing the potential and challenges of GPT-4V in radiologic image interpretation. Eur Radiol 35, 1111–1121 (2025). https://doi.org/10.1007/s00330-024-11115-6

[2] Ben Abacha, A., Hasan, S. A., Datla, V. V., Demner-Fushman, D., & Müller, H. (2019). Vqa-med: Overview of the medical visual question answering task at imageclef 2019. In Proceedings of CLEF (Conference and Labs of the Evaluation Forum) 2019 Working Notes. 9-12 September 2019.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages