feat(pdf): docling tables/figures extraction — advanced opt-in (R2)#12
Merged
Conversation
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> (cherry picked from commit f82b9a57e8422913480312d888f4e6002a55ecd8)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> (cherry picked from commit d0ea45f4084aca111a52bf9a77323edf3fbf266c)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> (cherry picked from commit 2fa29a6961b62bf8d547db940918ffadf34f5a5b)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> (cherry picked from commit 8298502a2836103e015af1b1d0dc9d63ad277db9)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> (cherry picked from commit dcd8351d68916713a0089ae3b5c477b37eccb00e)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> (cherry picked from commit fd47e9541c3d6f4fbaed0b36a2218ff69fc95457)
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> (cherry picked from commit 16495f54590babcdc547c26e09d5edeb5250ba31)
local-file ingest now passes the KB config to PDFParser.parse so the docling backend activates per the guard, and appends content_type=table chunks from any extracted tables. BibTeX/DOI path unchanged (follow-up). Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> (cherry picked from commit 08130497b8307a927f5f7482060db63c95d55110)
docling auto-selects MPS on Apple Silicon, which raises "Cannot convert a MPS Tensor to float64" and fails conversion on every page. Pin AcceleratorOptions(device=CPU). Verified: 13-page PDF extracts 6 figures on CPU (~10min); MPS unusable even with PYTORCH_ENABLE_MPS_FALLBACK=1. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> (cherry picked from commit a9893e738a8d6b3b5bd8afc3957a1d6ac849c942)
…ve advanced opt-in (R2) Text extraction stays 100% fitz (fast, default). docling no longer replaces the text path. New advanced flag docling_extract_tables_figures (off by default) runs docling on PDF ingest ONLY to append structured table chunks, guarded by docling_max_pages + docling_timeout_s (now 600s). Text is unaffected if docling is absent/oversized/times out. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> (cherry picked from commit 01c4b02c3269d21ad2b8f981839091e4227d2fa2)
…ction (R2) New docs/pdf-extraction-docling.md (two-layer model, enabling, guard knobs, CPU cost + MPS limitation, scope/limits); CLAUDE.md "PDF extraction backends" pointer; README feature line. Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com> (cherry picked from commit 8b794ed8369aee268a1377b5b9aece071fbd3b9c)
2 tasks
Collaborator
Author
|
Companion local-LLM fix: #22 forwards |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds docling as an opt-in advanced PDF-extraction layer (credibility-audit R2), on top of the always-on fast fitz text path.
knowledge_base.docling_extract_tables_figures: false): when enabled on the local-file ingest path, it extracts structured tables →content_type="table"chunks, on top of the fitz text. Figures are extracted + mapped to the multimodal record shape (not yet fed to the vision pipeline — follow-up).docling_max_pages(40) +docling_timeout_s(600); docling runs in a worker process. If the[docling]extra is absent, the PDF is oversized, or docling times out/errors, text is unaffected (graceful fallback, one structureddocling_fallbacklog).[docling]optional extra (uv sync --extra docling); converter forces CPU +generate_picture_images/images_scale=2.0.docs/pdf-extraction-docling.md+ CLAUDE.md "PDF extraction backends" + README line.Why opt-in / CPU
docling is CPU-bound here (~10 min / typical paper; ~47 s/page). The MPS/GPU path is unusable on Apple Silicon — upstream
transformersRT-DETRv2 positional embedding hard-codesfloat64(huggingface/transformers#28334);PYTORCH_ENABLE_MPS_FALLBACK=1doesn't help. So docling is a deliberate batch choice, not a default. A CUDA host makes it fast.Scope
Paper.tablesfield — follow-up).main(includes R3 Anchor extracted quotes to verified source passages (R3) #10); all 11 commits replayed cleanly, no conflicts.Test Plan
tests/unit/test_docling_pdf.py(records, converter config, figure/table mapping, figure→multimodal) — hermetic + docling-installedtests/unit/test_pdf_backend_guard.py(flag/page/timeout guard + worker fallback)tests/unit/test_docling_table_chunks.py(tables →content_type="table"chunks)tests/unit/test_local_docs_docling_wire.py(fitz-text contract preserved)tests/unit/test_config.py(docling config defaults coexist with R3 anchor config)tests/unit/test_local_docs_worker.pyregression — 44 passed on top of R3uv sync --extra docling+ ingest a local PDF withdocling_extract_tables_figures: true🤖 Generated with Claude Code