Skip to content

feat(pdf): docling tables/figures extraction — advanced opt-in (R2)#12

Merged
lfnothias merged 11 commits into
mainfrom
feat/docling-r2
Jun 2, 2026
Merged

feat(pdf): docling tables/figures extraction — advanced opt-in (R2)#12
lfnothias merged 11 commits into
mainfrom
feat/docling-r2

Conversation

@lfnothias

Copy link
Copy Markdown
Collaborator

Summary

Adds docling as an opt-in advanced PDF-extraction layer (credibility-audit R2), on top of the always-on fast fitz text path.

  • Text stays 100% fitz (PyMuPDF → pdfplumber). Default, fast, unchanged.
  • docling is additive + off by default (knowledge_base.docling_extract_tables_figures: false): when enabled on the local-file ingest path, it extracts structured tablescontent_type="table" chunks, on top of the fitz text. Figures are extracted + mapped to the multimodal record shape (not yet fed to the vision pipeline — follow-up).
  • Guarded: docling_max_pages (40) + docling_timeout_s (600); docling runs in a worker process. If the [docling] extra is absent, the PDF is oversized, or docling times out/errors, text is unaffected (graceful fallback, one structured docling_fallback log).
  • [docling] optional extra (uv sync --extra docling); converter forces CPU + generate_picture_images/images_scale=2.0.
  • Docs: docs/pdf-extraction-docling.md + CLAUDE.md "PDF extraction backends" + README line.

Why opt-in / CPU

docling is CPU-bound here (~10 min / typical paper; ~47 s/page). The MPS/GPU path is unusable on Apple Silicon — upstream transformers RT-DETRv2 positional embedding hard-codes float64 (huggingface/transformers#28334); PYTORCH_ENABLE_MPS_FALLBACK=1 doesn't help. So docling is a deliberate batch choice, not a default. A CUDA host makes it fast.

Scope

Test Plan

  • tests/unit/test_docling_pdf.py (records, converter config, figure/table mapping, figure→multimodal) — hermetic + docling-installed
  • tests/unit/test_pdf_backend_guard.py (flag/page/timeout guard + worker fallback)
  • tests/unit/test_docling_table_chunks.py (tables → content_type="table" chunks)
  • tests/unit/test_local_docs_docling_wire.py (fitz-text contract preserved)
  • tests/unit/test_config.py (docling config defaults coexist with R3 anchor config)
  • tests/unit/test_local_docs_worker.py regression — 44 passed on top of R3
  • Reviewer: live uv sync --extra docling + ingest a local PDF with docling_extract_tables_figures: true

🤖 Generated with Claude Code

lfnothias and others added 11 commits June 2, 2026 17:42
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
(cherry picked from commit f82b9a57e8422913480312d888f4e6002a55ecd8)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
(cherry picked from commit d0ea45f4084aca111a52bf9a77323edf3fbf266c)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
(cherry picked from commit 2fa29a6961b62bf8d547db940918ffadf34f5a5b)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
(cherry picked from commit 8298502a2836103e015af1b1d0dc9d63ad277db9)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
(cherry picked from commit dcd8351d68916713a0089ae3b5c477b37eccb00e)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
(cherry picked from commit fd47e9541c3d6f4fbaed0b36a2218ff69fc95457)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
(cherry picked from commit 16495f54590babcdc547c26e09d5edeb5250ba31)
local-file ingest now passes the KB config to PDFParser.parse so the
docling backend activates per the guard, and appends content_type=table
chunks from any extracted tables. BibTeX/DOI path unchanged (follow-up).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
(cherry picked from commit 08130497b8307a927f5f7482060db63c95d55110)
docling auto-selects MPS on Apple Silicon, which raises "Cannot convert a
MPS Tensor to float64" and fails conversion on every page. Pin
AcceleratorOptions(device=CPU). Verified: 13-page PDF extracts 6 figures
on CPU (~10min); MPS unusable even with PYTORCH_ENABLE_MPS_FALLBACK=1.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
(cherry picked from commit a9893e738a8d6b3b5bd8afc3957a1d6ac849c942)
…ve advanced opt-in (R2)

Text extraction stays 100% fitz (fast, default). docling no longer
replaces the text path. New advanced flag docling_extract_tables_figures
(off by default) runs docling on PDF ingest ONLY to append structured
table chunks, guarded by docling_max_pages + docling_timeout_s (now 600s).
Text is unaffected if docling is absent/oversized/times out.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
(cherry picked from commit 01c4b02c3269d21ad2b8f981839091e4227d2fa2)
…ction (R2)

New docs/pdf-extraction-docling.md (two-layer model, enabling, guard knobs,
CPU cost + MPS limitation, scope/limits); CLAUDE.md "PDF extraction backends"
pointer; README feature line.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
(cherry picked from commit 8b794ed8369aee268a1377b5b9aece071fbd3b9c)
@lfnothias lfnothias merged commit a9aa251 into main Jun 2, 2026
1 of 2 checks passed
@lfnothias

Copy link
Copy Markdown
Collaborator Author

Companion local-LLM fix: #22 forwards num_ctx to Ollama so long RAG synthesis prompts aren't silently truncated (the empty-output issue on local Mistral). Independent of this PR — docling extraction (#12) and Ollama context (#22) address the two halves of the same user report (full-text extraction + local synthesis).

@lfnothias lfnothias mentioned this pull request Jun 19, 2026
8 tasks
@lfnothias lfnothias deleted the feat/docling-r2 branch June 19, 2026 17:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant