This project uses Large Language Models to predict appropriate codes from the German fee schedule for physicians (Gebührenordnung für Ärzte, GOÄ) based on radiology findings. The LLMs can either be fine-tuned using the training scripts, or zero-shotted and few-shotted for out-of-the-box predictions.
RAD-Bill addresses the challenge of medical billing in radiology by automatically identifying and predicting relevant GOÄ codes from German radiology reports. The system uses fine-tuned LLMs (e.g., Mistral) trained on radiology findings to predict comma-separated billing codes.
- Fine-tuning Pipeline: Train LLMs with the TRL library
- Multi-label Evaluation: Comprehensive metrics including precision, recall, F1-score, and Jaccard similarity
- Bootstrap Confidence Intervals: Statistical evaluation with confidence intervals
- OpenAI Integration: Evaluation support for OpenAI models
- Weights & Biases Integration: Experiment tracking and logging
- Python >= 3.13
- CUDA-compatible GPU (recommended)
- Install dependencies using
uvorpip:
# Using uv (recommended)
uv sync
# Or using pip
pip install -e .- Create a
.envfile with the required environment variables:
# Model Configuration
MODEL=<path-to-base-model>
INPUT_LENGTH=<max-input-tokens>
TARGET_LENGTH=<max-target-tokens>
# Directory Configuration
OUTPUT_DIR=<path-to-output-directory>
DATASET_DIR=<path-to-dataset>
EVAL_DIR=<path-to-evaluation-output>
# Training Hyperparameters
TRAIN_TEST_SPLIT=0.8
TRAIN_BATCH_SIZE=4
EVAL_BATCH_SIZE=4
NUM_TRAIN_EPOCHS=3
GRADIENT_CHECKPOINTING=true
GRADIENT_ACCUMULATION_STEPS=4
LEARNING_RATE=2e-5
# API Keys
WANDB_API_KEY=<your-wandb-api-key>
OPENAI_API_KEY=<your-openai-api-key> # Optional, for OpenAI evaluationThe training pipeline uses the tlr library for fine-tuning of large language models.
python training/training.pyThe training script supports the following configurations:
- Model: Any Hugging Face causal language model (tested with Mistral)
- Data Splitting: Group-based train/test split by patient ID
- Logging: Weights & Biases integration for experiment tracking
python evaluation/sample/evaluate_finetuned_models.pypython evaluation/calculate_scores.pyThe evaluation framework computes:
- Micro/Macro Precision, Recall, F1-Score
- Subset Accuracy: Exact match accuracy
- Jaccard Score: Set similarity metric
- Hamming Loss: Fraction of incorrect labels
- Bootstrap Confidence Intervals: Statistical significance
| Package | Version |
|---|---|
| transformers | >=4.57.3 |
| torch | >=2.9.1 |
| peft | >=0.18.0 |
| trl | >=0.26.2 |
| datasets | >=4.4.2 |
| scikit-learn | >=1.8.0 |
| pandas | >=2.3.3 |
| wandb | >=0.23.1 |
| openai | >=2.14.0 |
| python-dotenv | >=1.2.1 |
This project is licenced under the MIT Licence.
Contributions are welcome! Please feel free to submit a Pull Request.
For questions or collaboration inquiries, please open an issue in this repository.
