feat(pdf): docling tables/figures extraction — advanced opt-in (R2) by lfnothias · Pull Request #12 · HolobiomicsLab/Perspicacite-AI

lfnothias · 2026-06-02T15:46:32Z

Summary

Adds docling as an opt-in advanced PDF-extraction layer (credibility-audit R2), on top of the always-on fast fitz text path.

Text stays 100% fitz (PyMuPDF → pdfplumber). Default, fast, unchanged.
docling is additive + off by default (knowledge_base.docling_extract_tables_figures: false): when enabled on the local-file ingest path, it extracts structured tables → content_type="table" chunks, on top of the fitz text. Figures are extracted + mapped to the multimodal record shape (not yet fed to the vision pipeline — follow-up).
Guarded: docling_max_pages (40) + docling_timeout_s (600); docling runs in a worker process. If the [docling] extra is absent, the PDF is oversized, or docling times out/errors, text is unaffected (graceful fallback, one structured docling_fallback log).
[docling] optional extra (uv sync --extra docling); converter forces CPU + generate_picture_images/images_scale=2.0.
Docs: docs/pdf-extraction-docling.md + CLAUDE.md "PDF extraction backends" + README line.

Why opt-in / CPU

docling is CPU-bound here (~10 min / typical paper; ~47 s/page). The MPS/GPU path is unusable on Apple Silicon — upstream transformers RT-DETRv2 positional embedding hard-codes float64 (huggingface/transformers#28334); PYTORCH_ENABLE_MPS_FALLBACK=1 doesn't help. So docling is a deliberate batch choice, not a default. A CUDA host makes it fast.

Scope

Wired for the local-file ingest path only. DOI/BibTeX path is text-only (adding table chunks there needs a Paper.tables field — follow-up).
Branched off main (includes R3 Anchor extracted quotes to verified source passages (R3) #10); all 11 commits replayed cleanly, no conflicts.

Test Plan

tests/unit/test_docling_pdf.py (records, converter config, figure/table mapping, figure→multimodal) — hermetic + docling-installed
tests/unit/test_pdf_backend_guard.py (flag/page/timeout guard + worker fallback)
tests/unit/test_docling_table_chunks.py (tables → content_type="table" chunks)
tests/unit/test_local_docs_docling_wire.py (fitz-text contract preserved)
tests/unit/test_config.py (docling config defaults coexist with R3 anchor config)
tests/unit/test_local_docs_worker.py regression — 44 passed on top of R3
Reviewer: live uv sync --extra docling + ingest a local PDF with docling_extract_tables_figures: true

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> (cherry picked from commit f82b9a57e8422913480312d888f4e6002a55ecd8)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> (cherry picked from commit d0ea45f4084aca111a52bf9a77323edf3fbf266c)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> (cherry picked from commit 2fa29a6961b62bf8d547db940918ffadf34f5a5b)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> (cherry picked from commit 8298502a2836103e015af1b1d0dc9d63ad277db9)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> (cherry picked from commit dcd8351d68916713a0089ae3b5c477b37eccb00e)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> (cherry picked from commit fd47e9541c3d6f4fbaed0b36a2218ff69fc95457)

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> (cherry picked from commit 16495f54590babcdc547c26e09d5edeb5250ba31)

local-file ingest now passes the KB config to PDFParser.parse so the docling backend activates per the guard, and appends content_type=table chunks from any extracted tables. BibTeX/DOI path unchanged (follow-up). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> (cherry picked from commit 08130497b8307a927f5f7482060db63c95d55110)

docling auto-selects MPS on Apple Silicon, which raises "Cannot convert a MPS Tensor to float64" and fails conversion on every page. Pin AcceleratorOptions(device=CPU). Verified: 13-page PDF extracts 6 figures on CPU (~10min); MPS unusable even with PYTORCH_ENABLE_MPS_FALLBACK=1. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> (cherry picked from commit a9893e738a8d6b3b5bd8afc3957a1d6ac849c942)

…ve advanced opt-in (R2) Text extraction stays 100% fitz (fast, default). docling no longer replaces the text path. New advanced flag docling_extract_tables_figures (off by default) runs docling on PDF ingest ONLY to append structured table chunks, guarded by docling_max_pages + docling_timeout_s (now 600s). Text is unaffected if docling is absent/oversized/times out. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> (cherry picked from commit 01c4b02c3269d21ad2b8f981839091e4227d2fa2)

…ction (R2) New docs/pdf-extraction-docling.md (two-layer model, enabling, guard knobs, CPU cost + MPS limitation, scope/limits); CLAUDE.md "PDF extraction backends" pointer; README feature line. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> (cherry picked from commit 8b794ed8369aee268a1377b5b9aece071fbd3b9c)

lfnothias · 2026-06-19T16:51:09Z

Companion local-LLM fix: #22 forwards num_ctx to Ollama so long RAG synthesis prompts aren't silently truncated (the empty-output issue on local Mistral). Independent of this PR — docling extraction (#12) and Ollama context (#22) address the two halves of the same user report (full-text extraction + local synthesis).

lfnothias and others added 11 commits June 2, 2026 17:42

feat(config): pdf_backend + docling guard knobs (R2)

83e949d

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> (cherry picked from commit f82b9a57e8422913480312d888f4e6002a55ecd8)

build: add [docling] optional extra (R2)

1c781ea

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> (cherry picked from commit d0ea45f4084aca111a52bf9a77323edf3fbf266c)

feat(parsers): docling record types + ParsedContent tables/figures (R2)

329928b

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> (cherry picked from commit 2fa29a6961b62bf8d547db940918ffadf34f5a5b)

feat(parsers): DoclingPDFParser converter + figure/table mapping (R2)

2a66d5d

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> (cherry picked from commit 8298502a2836103e015af1b1d0dc9d63ad277db9)

feat(parsers): docling backend selector + page/timeout guard (R2)

067c840

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> (cherry picked from commit dcd8351d68916713a0089ae3b5c477b37eccb00e)

feat(chunking): emit content_type=table chunks from docling tables (R2)

440e744

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> (cherry picked from commit fd47e9541c3d6f4fbaed0b36a2218ff69fc95457)

feat(parsers): map docling figures to multimodal record shape (R2)

2a4ee2e

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> (cherry picked from commit 16495f54590babcdc547c26e09d5edeb5250ba31)

lfnothias merged commit a9aa251 into main Jun 2, 2026
1 of 2 checks passed

lfnothias mentioned this pull request Jun 19, 2026

fix(llm): forward num_ctx to Ollama (local-model context truncation) #22

Merged

2 tasks

lfnothias mentioned this pull request Jun 19, 2026

Chore/harden perspicacite #11

Closed

8 tasks

lfnothias deleted the feat/docling-r2 branch June 19, 2026 17:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(pdf): docling tables/figures extraction — advanced opt-in (R2)#12

feat(pdf): docling tables/figures extraction — advanced opt-in (R2)#12
lfnothias merged 11 commits into
mainfrom
feat/docling-r2

lfnothias commented Jun 2, 2026

Uh oh!

Uh oh!

lfnothias commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

lfnothias commented Jun 2, 2026

Summary

Why opt-in / CPU

Scope

Test Plan

Uh oh!

Uh oh!

lfnothias commented Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant