This repository contains the code used for the (not yet published) paper titled "[TODO ADD PAPER]". If you find this code or research valuable for your work, please consider citing our paper: [TODO ADD PAPER]
We have created a set of 90 questions, which span 4 different areas:
- Guidelines and Indications
- Image Acquisition
- Imaging Education
- Research
The CSV files containing all the questions can be found in the questions folder.
Note: We used Ubuntu 22.04
- Ensure
condais installed on your system. - Set up the conda environment using the provided
environment.ymlfile:
$ conda env create --name semantic_entropy_radiology --file=environment.yml
- Activate the environment:
$ conda activate semantic_entropy_radiology
The environment setup is now complete.
To reproduce our results, run:
$ python3 eval.py
This command processes the cached LLM responses (stored in cache) to regenerate our findings. For a complete evaluation including new LLM prompts, see the "Reproducing the Results" section below.
Note: Due to the non-deterministic nature of LLMs, new prompting may yield slightly different results.
-
Configure credentials:
- Open
CONFIG.py - Update
LOCAL_KEY,LOCAL_ENDPOINT,OPENAI_KEYandAZURE_ENDPOINTwith your credentials.
- Open
-
Prepare the environment:
- Clear the
cachedirectory to ensure a clean evaluation.
- Clear the
-
Generate new answers:
$ python3 generateAnswers.py
Results are saved in the cache directory with filenames prefixed with EVAL.
- Run the evaluation:
$ python3 eval.py
This step utilizes GPT4o for semantic clustering and answer comparison. Previous entailment results are cached in entailmentCacheFile_GPT4o.csv.
CONFIG.py: Contains the configurations needed to run the code locally.
RadDataset.py: Defines the dataset structure used in the project.
generateAnswers.py: Handles answer generation from the dataset- Generates primary answers (ID
0) with temperature0.1for accuracy assessment - Creates additional answers for semantic entropy computation
- Stores results in designated output files
- Generates primary answers (ID
promptLLM.py: Implements the logic for prompting various large language models (LLMs).
clusterAnswers.py: Implements semantic entropy calculations and evaluation metrics (AUROC, AURAC)EntailmentCheck.py: Manages clustering and entailment verification- Implements prompt caching for efficient repeated executions
To verify answer correctness, all correctness labels were reviewed by a board-certified radiologist. This led to changes in some reported values. The corresponding corrections are included in entailmentCacheFile_GPT4o.csv.
eval.py: Handles the evaluation pipeline- Features bootstrapping analysis for confidence interval calculations across semantic entropy thresholds