Identifying Confabulation Hotspots of Large Language Models in Radiology through Semantic Entropy

Introduction

This repository contains the code used for the (not yet published) paper titled "[TODO ADD PAPER]". If you find this code or research valuable for your work, please consider citing our paper: [TODO ADD PAPER]

We have created a set of 90 questions, which span 4 different areas:

Guidelines and Indications
Image Acquisition
Imaging Education
Research

The CSV files containing all the questions can be found in the questions folder.

Requirements

Note: We used Ubuntu 22.04

Ensure conda is installed on your system.
Set up the conda environment using the provided environment.yml file:

$ conda env create --name semantic_entropy_radiology --file=environment.yml

Activate the environment:

$ conda activate semantic_entropy_radiology

The environment setup is now complete.

Quick Evaluation

To reproduce our results, run:

$ python3 eval.py

This command processes the cached LLM responses (stored in cache) to regenerate our findings. For a complete evaluation including new LLM prompts, see the "Reproducing the Results" section below.

Reproducing the Results

Note: Due to the non-deterministic nature of LLMs, new prompting may yield slightly different results.

Configure credentials:
- Open CONFIG.py
- Update LOCAL_KEY, LOCAL_ENDPOINT, OPENAI_KEY and AZURE_ENDPOINT with your credentials.
Prepare the environment:
- Clear the cache directory to ensure a clean evaluation.
Generate new answers:

$ python3 generateAnswers.py

Results are saved in the cache directory with filenames prefixed with EVAL.

Run the evaluation:

$ python3 eval.py

This step utilizes GPT4o for semantic clustering and answer comparison. Previous entailment results are cached in entailmentCacheFile_GPT4o.csv.

Project Components

Configuration Files

CONFIG.py: Contains the configurations needed to run the code locally.

Dataset Structure

RadDataset.py: Defines the dataset structure used in the project.

Core Functionality

Answer Generation

generateAnswers.py: Handles answer generation from the dataset
- Generates primary answers (ID 0) with temperature 0.1 for accuracy assessment
- Creates additional answers for semantic entropy computation
- Stores results in designated output files

LLM Integration

promptLLM.py: Implements the logic for prompting various large language models (LLMs).

Semantic Entropy Calculation

clusterAnswers.py: Implements semantic entropy calculations and evaluation metrics (AUROC, AURAC)
EntailmentCheck.py: Manages clustering and entailment verification
- Implements prompt caching for efficient repeated executions

Radiologist correction

To verify answer correctness, all correctness labels were reviewed by a board-certified radiologist. This led to changes in some reported values. The corresponding corrections are included in entailmentCacheFile_GPT4o.csv.

Evaluation Tools

eval.py: Handles the evaluation pipeline
Features bootstrapping analysis for confidence interval calculations across semantic entropy thresholds

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Identifying Confabulation Hotspots of Large Language Models in Radiology through Semantic Entropy

Introduction

Requirements

Quick Evaluation

Reproducing the Results

Project Components

Configuration Files

Dataset Structure

Core Functionality

Answer Generation

LLM Integration

Semantic Entropy Calculation

Radiologist correction

Evaluation Tools

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
__pycache__		__pycache__
cache		cache
questions		questions
CONFIG.py		CONFIG.py
EntailmentCheck.py		EntailmentCheck.py
README.md		README.md
RadDataset.py		RadDataset.py
clusterAnswers.py		clusterAnswers.py
environment.yml		environment.yml
eval.py		eval.py
evalOutput.txt		evalOutput.txt
generateAnswers.py		generateAnswers.py
promptLLM.py		promptLLM.py
utilFunctions.py		utilFunctions.py

Folders and files

Latest commit

History

Repository files navigation

Identifying Confabulation Hotspots of Large Language Models in Radiology through Semantic Entropy

Introduction

Requirements

Quick Evaluation

Reproducing the Results

Project Components

Configuration Files

Dataset Structure

Core Functionality

Answer Generation

LLM Integration

Semantic Entropy Calculation

Radiologist correction

Evaluation Tools

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages