A Python package for extracting breast tissue removal data (weight/units) from surgical reports using a multi-stage pipeline.
- Regex & Sectionizer: Splits the report into clinical sections and identifies candidate mentions of weights with their immediate context.
- LLM Disambiguation: Uses an LLM (via Hugging Face) and a Jinja2-rendered prompt to select the correct weights for the left and right breasts from the candidates.
- Auditor Pass: Programmatically verifies that the numerical value the LLM claims to have found actually exists within its provided evidence string.
- Evaluation: Measures accuracy, Mean Absolute Error (MAE), and audit failure rates against ground truth data.
python -m venv .venv
source .venv/bin/activate
pip install -e .[dev]The package can be run directly from the command line:
python -m repo_parse csv=reports.csv column="Surgical Note"python -m repo_parse csv=reports.csv output=results.csvpython -m repo_parse eval=true samples=data/samples gt=data/ground_truth.csv- Run tests:
pytest - Linting:
make lint - Formatting:
make format