Refactor REDCap extraction to use map-reduce and CoT reasoning by LakshinG · Pull Request #1 · LakshinG/Epic-api

LakshinG · 2026-05-07T02:55:12Z

Refactor REDCap extraction to use map-reduce and CoT reasoning

Replaced brittle regex text chunker with a Two-Pass Map-Reduce prompt for distilling clinical notes
Updated Pydantic schema with missing mappings for medhx_priorepisgy_type (VNS, DBS) based on the REDCap PDF Codebook
Expanded schema to include medhx_etio_focal and medhx_psych as Multi-Select variables
Implemented an internal_clinical_reasoning Chain-of-Thought field to eliminate LLM integer hallucinations
Updated Pandas export logic to expand new checkbox variables and drop reasoning column for REDCap CSV compatibility

PR created automatically by Jules for task 1218902283250309376 started by @LakshinG

- Replaced brittle regex text chunker with a Two-Pass Map-Reduce prompt for distilling clinical notes - Updated Pydantic schema with missing mappings for medhx_priorepisgy_type (VNS, DBS) based on the REDCap PDF Codebook - Expanded schema to include medhx_etio_focal and medhx_psych as Multi-Select variables - Implemented an `internal_clinical_reasoning` Chain-of-Thought field to eliminate LLM integer hallucinations - Updated Pandas export logic to expand new checkbox variables and drop reasoning column for REDCap CSV compatibility Co-authored-by: LakshinG <89883000+LakshinG@users.noreply.github.com>

google-labs-jules · 2026-05-07T02:55:13Z

👋 Jules, reporting for duty! I'm here to lend a hand with this pull request.

When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down.

I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job!

For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with @jules. You can find this option in the Pull Request section of your global Jules UI settings. You can always switch back!

New to Jules? Learn more at jules.google/docs.

For security, I will only act on instructions from the user who triggered this task.

- Added console print statements to output the Pass 1 distilled summary. - Added console print statements to output the Pass 2 Chain-of-Thought reasoning. - These changes will help diagnose why the model is hallucinating specific values despite the schema update. Co-authored-by: LakshinG <89883000+LakshinG@users.noreply.github.com>

- Replaced `int` types in Pydantic schema with strict `Literal` types to force JSON schema `enum` constraint on local LLM tool output - Hardcoded exact dictionary mappings into the system prompt to assist the 14B model's contextual understanding - Updated `internal_clinical_reasoning` prompt to explicitly require evaluating every single field instead of terminating early Co-authored-by: LakshinG <89883000+LakshinG@users.noreply.github.com>

- Updated Pydantic field description for `medhx_prior_episgy` to explicitly state that having seizures is not epilepsy surgery. - Updated Pydantic field description for `medhx_etio_focal` to instruct the model to use 999 (Unknown) if a specific structural cause is not explicitly mentioned, preventing the AI from hallucinating etiology based on psychological triggers. Co-authored-by: LakshinG <89883000+LakshinG@users.noreply.github.com>

- Reverted Pydantic `Literal` field types back to standard `int` and `List[int]` because the complex JSON schema `enum` constraints were overwhelming the local Qwen 14B model, causing it to return empty lists. - Relying on the updated `system_instructions` exact code mappings and improved field descriptions to guide the model instead. Co-authored-by: LakshinG <89883000+LakshinG@users.noreply.github.com>

- Removed complex `Field` descriptions from the Pydantic schema to prevent instruction overload in the local Qwen 14B model, which was causing it to skip generating JSON fields. - Consolidated all clinical logic, default fallback rules, negative constraints, and exact integer code mappings directly into the `system_instructions` system prompt. Co-authored-by: LakshinG <89883000+LakshinG@users.noreply.github.com>

- Removed accidental duplicate definition of `sz_age` in `REDCapEpilepsyData`. Co-authored-by: LakshinG <89883000+LakshinG@users.noreply.github.com>

- Updated `system_instructions` for `medhx_szsyndrome` to explicitly clarify that generic diagnoses (like 'complex partial seizures') do not count as formally named epilepsy syndromes. - Updated `system_instructions` for `medhx_psych` to explicitly remind the LLM to output multiple applicable codes (like Depression and PTSD) as a comma-separated list, rather than stopping after finding the first match. Co-authored-by: LakshinG <89883000+LakshinG@users.noreply.github.com>

- Updated `VariableReasoning.chosen_code` Pydantic description to forbid words and letters, explicitly requiring 'NONE' or comma-separated integers. This prevents the `re.findall` safety net from accidentally scraping conversational numbers (e.g., "patient had no surgery, so I will not use 2"). - Added explicit defaults to `system_instructions` for `medhx_etio_focal` and `medhx_priorepisgy_type` to force strict fallbacks when physical causes or prior surgeries are absent. Co-authored-by: LakshinG <89883000+LakshinG@users.noreply.github.com>

- Updated Pydantic description for `step_by_step_logic` to explicitly list all expected variable names. This prevents the local LLM from taking shortcuts and accidentally skipping fields (like `medhx_etio_focal`) during its reasoning phase, ensuring the regex safety net has the required data to process fallbacks. Co-authored-by: LakshinG <89883000+LakshinG@users.noreply.github.com>

- Updated `system_instructions` for `medhx_neurohx` to explicitly state `5=Headaches/Neuropathy` and added a note instructing the AI to use `5` instead of `0` when encountering neuropathy in the clinical text. This resolves the final evaluation mismatch and pushes the extraction accuracy toward 100%. Co-authored-by: LakshinG <89883000+LakshinG@users.noreply.github.com>

- Parsed PDF codebook to extract exact integer mappings for the remaining 25 core REDCap variables (which expand into 87 missing checkbox columns). - Appended variables like `mri_yn`, `emu_asm_type`, and `medhx_si` to the `REDCapEpilepsyData` Pydantic schema and the `step_by_step_logic` prompt constraint. - Added explicit mapping definitions for the new fields directly into the `system_instructions`. - Updated the Pandas `expand_checkboxes` post-processing logic to properly expand the new multi-select fields (`emu_asm_type` and `emu_asmdc_type`) into their required `___1`, `___2` REDCap formats. Co-authored-by: LakshinG <89883000+LakshinG@users.noreply.github.com>

- Eliminated the Pass 1 'Distillation' logic. Feeding the raw clinical note directly to the LLM ensures that critical sub-sections (like EMU Admissions and MRI results) are no longer accidentally deleted by the summarizer. - Hardened the `chosen_code` property in the Nested Chain-of-Thought schema to explicitly forbid sentences or words. This forces the LLM to output only pure integers (e.g., '1' or '63'), guaranteeing that the regex Safety Net does not inadvertently pull conversational numbers out of the AI's reasoning text. Co-authored-by: LakshinG <89883000+LakshinG@users.noreply.github.com>

- Replaced `chosen_code: str` with `chosen_codes_array: List[int]` in the Nested CoT reasoning block to mathematically force the LLM to output integer arrays rather than conversational sentences. This eliminates the chance of the regex safety net failing to find numbers. - Dropped the 'Distillation' step (Pass 1) to pass the raw clinical note directly into the extraction pipeline. Summarization was improperly deleting critical sub-sections (like EMU admissions and MRI results), causing those fields to evaluate to 0% accuracy. - Mapped all 87 REDCap columns natively in the extraction framework. Co-authored-by: LakshinG <89883000+LakshinG@users.noreply.github.com>

- Resolved the 70% accuracy drop caused by "Instruction Overload" by splitting the massive 87-column monolithic Pydantic schema into three smaller, logical schemas (`HistoryExtraction`, `EmuExtraction`, and `ImagingExtraction`). - Each schema now has its own dedicated system prompt containing only the relevant mapping dictionaries, allowing the `qwen2.5:14b` model to accurately process variables without skipping fields or hallucinating. - Upgraded the `VariableReasoning` fallback safety net by converting `chosen_code` from a `str` to a `List[int]` named `chosen_codes_array`. This enforces a strict JSON typing constraint that prevents the AI from outputting conversational sentences (which previously broke the regex fallback parser). - Reinstated the Distillation pass but explicitly protected all medications, surgeries, and imaging from summarization deletion. Co-authored-by: LakshinG <89883000+LakshinG@users.noreply.github.com>

- Upgraded the `ChatOllama` initialization in `batch_redcap_extractor.py` to point to `qwen2.5:32b` instead of the 14B variant to take advantage of its superior reasoning and strict instruction-following capabilities. Co-authored-by: LakshinG <89883000+LakshinG@users.noreply.github.com>

google-labs-jules Bot and others added 15 commits May 7, 2026 05:03

Clean up duplicate Pydantic field

b3456a8

- Removed accidental duplicate definition of `sz_age` in `REDCapEpilepsyData`. Co-authored-by: LakshinG <89883000+LakshinG@users.noreply.github.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor REDCap extraction to use map-reduce and CoT reasoning#1

Refactor REDCap extraction to use map-reduce and CoT reasoning#1
LakshinG wants to merge 16 commits into
medical-data-agent-5257841151516627047from
fix-redcap-extraction-hallucinations-1218902283250309376

LakshinG commented May 7, 2026

Uh oh!

google-labs-jules Bot commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

LakshinG commented May 7, 2026

Uh oh!

google-labs-jules Bot commented May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant