A clinical question-answering system that uses retrieval-augmented generation (RAG) over synthetic patient records and answers with every sentence traced to the FHIR resource it came from. Built to demonstrate the things clinical AI teams actually hire for: citation grounding, abstention, PHI governance, and a CI-gated evaluation harness.
Scope & safety. Synthetic and public data only - never real PHI. This system is not clinically validated, is not fit for real clinical use, and does not provide medical advice or treatment recommendations. It is strictly a retrieval-and-grounding demonstrator.
Built in stages (see the build brief). All five stages are implemented:
- Stage 1 (MVP): ingest one Synthea patient β pgvector; patient-scoped semantic retrieval; grounded Q&A with per-sentence inline citations (structured, validated output)
- Stage 2 (Abstention): refuses correctly when evidence is insufficient - including off-topic / out-of-record questions - via a retrieval relevance gate plus the grounding gate (see below)
- Stage 3 (PHI): Presidio-based detection & redaction of HIPAA Safe Harbor identifiers, before any text is embedded, stored, or sent to the model (see below)
- Stage 4 (Eval harness + CI gate): labeled gold set scored on groundedness, hallucination rate, retrieval precision/recall, and abstention correctness; deterministic metrics gate every PR in CI (see below)
- Stage 5 (Dashboard): Streamlit UI over the API that renders every answer with inline, clickable citations to the FHIR sources (see below)
Hybrid (semantic + keyword/BM25) retrieval - Β§6 of the brief - is built behind the
Retrieverseam as a later increment; semantic-only is the MVP baseline.
Synthea FHIR bundle
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Ingestion (app/ingest) β
β parse FHIR β PHI redact (hook) β embed β store β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β β²
βΌ β embeddings
βββββββββββββββββββββββ ββββββββββββββββββββ
β pgvector (app/db) βββββ retrieve βββββ Provider seam β
βββββββββββββββββββββββ β (app/llm) β
β β chat: MiniMax / β
βΌ β OpenAI / Synap β
βββββββββββββββββββββββ grounded β embed: local / β
β Q&A (app/qa) ββββ answer ββββββΆβ OpenAI-compat β
β retrieveβpromptβ β ββββββββββββββββββββ
β enforce grounding β
βββββββββββββββββββββββ
β
βΌ
FastAPI /ask β per-sentence citations + resolved sources
Every LLM and embedding call goes through one module, configured entirely by environment variables. No provider is hard-coded anywhere else.
- Chat (
CHAT_PROVIDER):anthropicβ Anthropic SDK β MiniMax-M3 athttps://api.minimax.io/anthropic(today's default)openaiβ OpenAI SDK β OpenAI / SynapticaAI / MiniMax OpenAI-mode
- Judge (
JUDGE_PROVIDER): the eval LLM-as-judge, on its own seam so it can be a different vendor and model family than the generator - here OpenAIgpt-5.5grading MiniMax, so the system is not marking its own work. The OpenAI adapter adapts to newer-model parameter rules (max_completion_tokens, fixed temperature) automatically, so the swap stays config-only. - Embeddings (
EMBEDDINGS_PROVIDER):localβ fastembed (bge-small-en-v1.5, 384-dim) - offline, no key, deterministic for CI (default)openaiβ any OpenAI-compatible/embeddingsendpoint (OpenAI, MiniMax, SynapticaAI)
Chat and embeddings are separate seams because the Anthropic SDK has no
embeddings API and embeddings are universally OpenAI-shaped. Pointing the app
at SynapticaAI later is a base-URL + key change in .env - no code change.
- mise (manages Python +
uv) - Docker (Postgres + pgvector, and Synthea generation - no local Java/JDK needed)
mise install # Python 3.12 + uv, per mise.toml
mise run install # uv sync - also pulls the spaCy en_core_web_lg model (~400MB) for PHICopy-Item .env.example .env
# Edit .env: set CHAT_API_KEY to your MiniMax key. Defaults use local embeddings
# (no key needed) so ingestion/retrieval work offline.docker compose up -d dbmise run synthea # writes data/sample/patient_bundle.json
mise run ingest data/sample/patient_bundle.jsonmise run api # http://localhost:8000/docsAsk a question:
$body = @{ patient_id = "<id printed by ingest>"; question = "What medications is this patient on, and any flagged allergies?" } | ConvertTo-Json
Invoke-RestMethod -Method Post -Uri http://localhost:8000/ask -ContentType application/json -Body $bodymise run dashboard # http://localhost:8501 (keep the API from step 5 running)docker compose --profile app upAnswers are structured, not free text, and validated with Pydantic
(app/schemas.py):
source_idis<ResourceType>/<resource_id>- a citation points at the exact FHIR resource it came from.- Grounding gate (
app/qa/service.py): citations that don't resolve to a retrieved source are dropped; a sentence with no surviving citation is dropped; if nothing grounded remains, the system abstains rather than asserting an unsupported clinical claim.
The system refuses to answer when the record doesn't support an answer. Two independent gates make this robust:
- Relevance gate (
app/retrieval/retriever.py): semantic search returns each chunk's cosine distance, and chunks beyondRETRIEVAL_MAX_DISTANCEare discarded. An off-topic or out-of-record question ("What is the capital of France?") finds nothing close enough, retrieves zero evidence, and the service abstains without ever calling the model. The threshold (default0.40) was calibrated on the ingested patient - relevant queries land at cosine distance ~0.24-0.34, off-topic ones at ~0.42+ - and is env-tunable so the Stage 4 eval harness can optimize it. - Grounding gate (above): even when evidence is retrieved, any answer the model can't tie back to it is dropped, falling through to abstention.
The model is also instructed to abstain on insufficient evidence and to answer only the supported parts of a multi-part question. Net effect: the system says "No matching information was found in this patient's record" instead of guessing - the strongest safety signal the brief asks for.
Before any text is embedded, stored, or sent to the model, it passes through the
PHI stage (app/ingest/phi.py), which uses Microsoft Presidio (spaCy NER +
rule-based recognizers) to find HIPAA Safe Harbor identifiers. Two deliberate
design choices:
- Structured hints, not NER guesswork. FHIR is structured, so at ingest time
we already know the patient's name and address. Synthea's names carry numeric
suffixes ("Vanna750 Rosenbaum794") that defeat NER outright - so
collect_phi_hints()pulls the known identifier strings from the Patient resource and seeds them as a Presidio deny-list. De-identification therefore does not depend on the model inferring that a token is a name. Custom recognizers cover the gaps in the base library (dashed SSNs, medical record numbers). - Detect-all, redact-curated. Every identifier category found is reported
(the governance signal -
mise run ingestprints the tally); direct identifiers (names, geography, contacts, SSN/MRN, account/license numbers) are redacted to<PERSON>,<LOCATION>, etc. Dates are flagged but kept by default (PHI_REDACT_DATES=false): onset and authoring dates are clinical content that powers the grounded answers, and the data is synthetic. This is the Safe-Harbor trade-off made explicit rather than blindly nuking dates.
After redaction the stored Patient record reads
Patient <PERSON>. Gender: female. Date of birth: 1966-10-28. Address: <LOCATION>, <LOCATION>.
- the name and city are gone everywhere, while gender, dates, and all clinical
codes survive so retrieval and citations are unaffected. Disable the stage with
PHI_REDACTION=false.
Synthetic data only. This stage demonstrates data governance; it is not a license to process real PHI.
A labeled gold set (evals/gold_set.yaml, ~14 items against the ingested
patient) drives the four metrics the brief asks for. The harness runs in two
layers:
- Deterministic layer (no API key) - retrieval precision/recall against
labeled
source_ids, and abstention reachability: off-topic questions must retrieve nothing (so the relevance gate guarantees abstention), answerable ones must retrieve evidence. This layer is the hard CI merge gate - it runs on a Postgres+pgvector service container on every PR (.github/workflows/ci.yml), so a retrieval or prompt change that regresses is caught before merge. - LLM layer (needs a key) - runs the real Q&A path for true abstention
correctness and a
must_includecontent check, plus an LLM-as-judge (evals/judge.py) scoring groundedness and hallucination rate sentence by sentence against each sentence's cited evidence. Advisory by default;--strictmakes it gate too.
mise run eval # full run if CHAT_API_KEY is set
uv run python -m evals.run --no-llm # deterministic gate only (what CI runs)Latest local run - MiniMax-M3 generator graded by an independent
gpt-5.5 judge (different vendor and model family), 14 items:
| Metric | Score |
|---|---|
| Retrieval recall | 1.00 |
| Retrieval precision | 0.23 (small relevant sets vs. top-8 - recall is the safety-relevant number) |
| Abstention reachability (deterministic) | 1.00 |
| Abstention correctness (LLM path) | 1.00 (abstains on blood type, cancer hx, procedures, vaccinations, redacted contact info, off-topic) |
| Groundedness | 0.92 |
| Hallucination rate | 0.08 |
The non-perfect groundedness is the harness working as intended. The independent judge flagged one sentence on the BMI question: the model correctly read the cited BMI value (28.32 kg/m2) from the Observation, but then asserted it was "below the 30 kg/m2 obesity threshold" - and that 30 cutoff appears nowhere in the cited resource. The model imported a fact from its own training and presented it as cited. A weaker, same-family judge waved this through; the stronger independent judge caught the unstated clinical assumption. That is exactly the failure mode an eval harness exists to surface.
Groundedness is therefore sensitive to judge capability: a smaller judge
(gpt-5.4-mini) scored this same answer a clean 1.00, while gpt-5.5 flagged
it - which is precisely why the judge runs on a deliberately strong model that
is independent of the generator, rather than the generator grading itself.
Honest caveats: the gold set is small (14 items). The grounding gate drops uncited sentences before the judge ever runs, so the judge's job is the subtler one above - catching a cited sentence whose citation doesn't actually support it. The judge runs on its own swappable
JUDGE_*seam (here OpenAIgpt-5.5, independent of the MiniMax generator). Two distinct abstention mechanisms are measured separately: the deterministic relevance gate (off-topic) and the grounded LLM layer (clinically-adjacent but out-of-record).
A Streamlit UI (streamlit_app.py) talks to the FastAPI service over HTTP - so
it shows exactly what any client would get - and renders the traceability that
is the whole point of the system:
- Each answer sentence carries superscript
[n]citations that link down to numbered source cards (the resolved FHIR resource: its text, clinical codes, and date). Grounding is visible, not just claimed. - Abstention is shown as a deliberate notice, not an error - the safety behavior reads as a feature.
- Redacted tokens (
<PERSON>,<LOCATION>) appear in the source text, so the PHI stage is visible too.
| Grounded answer | Abstention |
|---|---|
![]() |
![]() |
Left: a grounded answer - inline [n] citations resolve to FHIR source cards
with codes and dates. Right: the system declining to answer "what is the blood
type?" because no citable evidence exists, rather than guessing.
mise run api # terminal 1
mise run dashboard # terminal 2 -> http://localhost:8501The pure rendering logic (citation numbering, HTML escaping of redaction tokens)
lives in app/web/render.py, separate from the Streamlit script so it is unit
tested.
mise run testCovers FHIR parsing & code-system mapping, the embeddable-text composition, the grounding gate (including hallucinated-citation rejection and abstention fallback), tolerant JSON parsing, PHI redaction, eval metrics, and the dashboard rendering helpers.
The system is built behind clean seams (provider, retriever, ingestion), so the extensions below are additive rather than rewrites. None are implemented yet - they are recorded here as the natural next increments:
- Hybrid retrieval (semantic + keyword/BM25). The Β§6 architecture target;
slots in behind the existing
Retrieverinterface next to the semantic baseline, with no change to the Q&A or citation layers. - Summarization endpoint. A patient-level clinical summary that reuses the
same grounding gate and per-sentence citation contract as
/ask. - Larger Synthea cohort. Ingest many patients instead of one to stress retrieval precision and patient-scoping at scale.
- Free-text clinical notes (e.g. MTSamples). Run the PHI stage and chunking over unstructured notes, not just structured FHIR resources.
- Next.js dashboard on Vercel. A production-style frontend over the same FastAPI service, if a hosted live demo is ever wanted (the Python API + pgvector would still need separate hosting).
- Not clinically validated; not fit for real use.
- No treatment recommendations or medical advice - reports only what the record states.
- Strictly retrieval + grounding (+ later summarization) over synthetic data.
- Never touches real PHI.



{ "abstained": false, "sentences": [ { "text": "The patient is on Lisinopril 10 MG Oral Tablet.", "citations": ["MedicationRequest/med-1"] } ], "sources": [ /* every cited source, resolved in full for inline rendering */ ] }