Refactor REDCap extraction to use map-reduce and CoT reasoning#1
Conversation
- Replaced brittle regex text chunker with a Two-Pass Map-Reduce prompt for distilling clinical notes - Updated Pydantic schema with missing mappings for medhx_priorepisgy_type (VNS, DBS) based on the REDCap PDF Codebook - Expanded schema to include medhx_etio_focal and medhx_psych as Multi-Select variables - Implemented an `internal_clinical_reasoning` Chain-of-Thought field to eliminate LLM integer hallucinations - Updated Pandas export logic to expand new checkbox variables and drop reasoning column for REDCap CSV compatibility Co-authored-by: LakshinG <89883000+LakshinG@users.noreply.github.com>
|
👋 Jules, reporting for duty! I'm here to lend a hand with this pull request. When you start a review, I'll add a 👀 emoji to each comment to let you know I've read it. I'll focus on feedback directed at me and will do my best to stay out of conversations between you and other bots or reviewers to keep the noise down. I'll push a commit with your requested changes shortly after. Please note there might be a delay between these steps, but rest assured I'm on the job! For more direct control, you can switch me to Reactive Mode. When this mode is on, I will only act on comments where you specifically mention me with New to Jules? Learn more at jules.google/docs. For security, I will only act on instructions from the user who triggered this task. |
- Added console print statements to output the Pass 1 distilled summary. - Added console print statements to output the Pass 2 Chain-of-Thought reasoning. - These changes will help diagnose why the model is hallucinating specific values despite the schema update. Co-authored-by: LakshinG <89883000+LakshinG@users.noreply.github.com>
- Replaced `int` types in Pydantic schema with strict `Literal` types to force JSON schema `enum` constraint on local LLM tool output - Hardcoded exact dictionary mappings into the system prompt to assist the 14B model's contextual understanding - Updated `internal_clinical_reasoning` prompt to explicitly require evaluating every single field instead of terminating early Co-authored-by: LakshinG <89883000+LakshinG@users.noreply.github.com>
- Updated Pydantic field description for `medhx_prior_episgy` to explicitly state that having seizures is not epilepsy surgery. - Updated Pydantic field description for `medhx_etio_focal` to instruct the model to use 999 (Unknown) if a specific structural cause is not explicitly mentioned, preventing the AI from hallucinating etiology based on psychological triggers. Co-authored-by: LakshinG <89883000+LakshinG@users.noreply.github.com>
- Reverted Pydantic `Literal` field types back to standard `int` and `List[int]` because the complex JSON schema `enum` constraints were overwhelming the local Qwen 14B model, causing it to return empty lists. - Relying on the updated `system_instructions` exact code mappings and improved field descriptions to guide the model instead. Co-authored-by: LakshinG <89883000+LakshinG@users.noreply.github.com>
- Removed complex `Field` descriptions from the Pydantic schema to prevent instruction overload in the local Qwen 14B model, which was causing it to skip generating JSON fields. - Consolidated all clinical logic, default fallback rules, negative constraints, and exact integer code mappings directly into the `system_instructions` system prompt. Co-authored-by: LakshinG <89883000+LakshinG@users.noreply.github.com>
- Removed accidental duplicate definition of `sz_age` in `REDCapEpilepsyData`. Co-authored-by: LakshinG <89883000+LakshinG@users.noreply.github.com>
- Updated `system_instructions` for `medhx_szsyndrome` to explicitly clarify that generic diagnoses (like 'complex partial seizures') do not count as formally named epilepsy syndromes. - Updated `system_instructions` for `medhx_psych` to explicitly remind the LLM to output multiple applicable codes (like Depression and PTSD) as a comma-separated list, rather than stopping after finding the first match. Co-authored-by: LakshinG <89883000+LakshinG@users.noreply.github.com>
- Updated `VariableReasoning.chosen_code` Pydantic description to forbid words and letters, explicitly requiring 'NONE' or comma-separated integers. This prevents the `re.findall` safety net from accidentally scraping conversational numbers (e.g., "patient had no surgery, so I will not use 2"). - Added explicit defaults to `system_instructions` for `medhx_etio_focal` and `medhx_priorepisgy_type` to force strict fallbacks when physical causes or prior surgeries are absent. Co-authored-by: LakshinG <89883000+LakshinG@users.noreply.github.com>
- Updated Pydantic description for `step_by_step_logic` to explicitly list all expected variable names. This prevents the local LLM from taking shortcuts and accidentally skipping fields (like `medhx_etio_focal`) during its reasoning phase, ensuring the regex safety net has the required data to process fallbacks. Co-authored-by: LakshinG <89883000+LakshinG@users.noreply.github.com>
- Updated `system_instructions` for `medhx_neurohx` to explicitly state `5=Headaches/Neuropathy` and added a note instructing the AI to use `5` instead of `0` when encountering neuropathy in the clinical text. This resolves the final evaluation mismatch and pushes the extraction accuracy toward 100%. Co-authored-by: LakshinG <89883000+LakshinG@users.noreply.github.com>
- Parsed PDF codebook to extract exact integer mappings for the remaining 25 core REDCap variables (which expand into 87 missing checkbox columns). - Appended variables like `mri_yn`, `emu_asm_type`, and `medhx_si` to the `REDCapEpilepsyData` Pydantic schema and the `step_by_step_logic` prompt constraint. - Added explicit mapping definitions for the new fields directly into the `system_instructions`. - Updated the Pandas `expand_checkboxes` post-processing logic to properly expand the new multi-select fields (`emu_asm_type` and `emu_asmdc_type`) into their required `___1`, `___2` REDCap formats. Co-authored-by: LakshinG <89883000+LakshinG@users.noreply.github.com>
- Eliminated the Pass 1 'Distillation' logic. Feeding the raw clinical note directly to the LLM ensures that critical sub-sections (like EMU Admissions and MRI results) are no longer accidentally deleted by the summarizer. - Hardened the `chosen_code` property in the Nested Chain-of-Thought schema to explicitly forbid sentences or words. This forces the LLM to output only pure integers (e.g., '1' or '63'), guaranteeing that the regex Safety Net does not inadvertently pull conversational numbers out of the AI's reasoning text. Co-authored-by: LakshinG <89883000+LakshinG@users.noreply.github.com>
- Replaced `chosen_code: str` with `chosen_codes_array: List[int]` in the Nested CoT reasoning block to mathematically force the LLM to output integer arrays rather than conversational sentences. This eliminates the chance of the regex safety net failing to find numbers. - Dropped the 'Distillation' step (Pass 1) to pass the raw clinical note directly into the extraction pipeline. Summarization was improperly deleting critical sub-sections (like EMU admissions and MRI results), causing those fields to evaluate to 0% accuracy. - Mapped all 87 REDCap columns natively in the extraction framework. Co-authored-by: LakshinG <89883000+LakshinG@users.noreply.github.com>
- Resolved the 70% accuracy drop caused by "Instruction Overload" by splitting the massive 87-column monolithic Pydantic schema into three smaller, logical schemas (`HistoryExtraction`, `EmuExtraction`, and `ImagingExtraction`). - Each schema now has its own dedicated system prompt containing only the relevant mapping dictionaries, allowing the `qwen2.5:14b` model to accurately process variables without skipping fields or hallucinating. - Upgraded the `VariableReasoning` fallback safety net by converting `chosen_code` from a `str` to a `List[int]` named `chosen_codes_array`. This enforces a strict JSON typing constraint that prevents the AI from outputting conversational sentences (which previously broke the regex fallback parser). - Reinstated the Distillation pass but explicitly protected all medications, surgeries, and imaging from summarization deletion. Co-authored-by: LakshinG <89883000+LakshinG@users.noreply.github.com>
- Upgraded the `ChatOllama` initialization in `batch_redcap_extractor.py` to point to `qwen2.5:32b` instead of the 14B variant to take advantage of its superior reasoning and strict instruction-following capabilities. Co-authored-by: LakshinG <89883000+LakshinG@users.noreply.github.com>
Refactor REDCap extraction to use map-reduce and CoT reasoning
internal_clinical_reasoningChain-of-Thought field to eliminate LLM integer hallucinationsPR created automatically by Jules for task 1218902283250309376 started by @LakshinG