HolobiomicsLab · lfnothias · Jun 2, 2026 · Jun 2, 2026 · Jun 2, 2026 · Jun 2, 2026
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -96,6 +96,10 @@ The `AgenticOrchestrator` ([src/perspicacite/rag/agentic/orchestrator.py](src/pe
 
 Publisher API keys are passed as kwargs; missing keys skip that source gracefully. Check `content_type` in the result: `"structured"` > `"full_text"` > `"abstract"` > `"none"`.
 
+### PDF extraction backends
+
+PDF **text** is always extracted with PyMuPDF (`fitz`) → `pdfplumber` fallback (`pipeline/parsers/pdf.py`) — fast, default, on every ingest. **Structured tables + figures** are an opt-in advanced layer via **docling**, off by default (`knowledge_base.docling_extract_tables_figures`, guarded by `docling_max_pages` / `docling_timeout_s`; needs `uv sync --extra docling`). When enabled on the local-file ingest path, docling adds `content_type="table"` chunks on top of the fitz text; if docling is absent / oversized / times out, the text is unaffected. docling is CPU-bound (~min/page) and the MPS/GPU path is unusable on Apple Silicon. Full details: [docs/pdf-extraction-docling.md](docs/pdf-extraction-docling.md).
+
 ### Retrieval
 
 `ChromaVectorStore` ([src/perspicacite/retrieval/chroma_store.py](src/perspicacite/retrieval/chroma_store.py)) wraps ChromaDB. KB collections are named via `chroma_collection_name_for_kb()` from `models/kb.py`. The hybrid retriever ([src/perspicacite/retrieval/hybrid.py](src/perspicacite/retrieval/hybrid.py)) combines ChromaDB cosine scores with BM25Okapi scores; weights default to 0.5/0.5 but can optionally be determined by the LLM at query time. `MultiKBRetriever` ([src/perspicacite/retrieval/multi_kb.py](src/perspicacite/retrieval/multi_kb.py)) fans a query across multiple KB collections, merges by score, deduplicates by `paper_id`, and tags results with `kb_name`; use `check_embedding_compat(kb_metas)` to validate that all queried KBs share the same embedding model before retrieval.

diff --git a/README.md b/README.md
@@ -28,6 +28,7 @@
 
 - **Multi-database search** — Semantic Scholar, OpenAlex, PubMed, arXiv, HAL, DBLP via SciLEx
 - **Unified content pipeline** — PMC JATS XML, arXiv HTML, OA PDFs, publisher APIs, and institutional-access via browser-cookie replay; quality-priority routing
+- **PDF extraction** — fast PyMuPDF text on every ingest, plus optional [docling](docs/pdf-extraction-docling.md) layout extraction for structured tables/figures (advanced, off by default)
 - **6 RAG modes** — Basic, Advanced, Profound, Agentic, Literature Survey, Contradiction; per-stage LLM tiering (Haiku routing/screening, Sonnet synthesis)
 - **Knowledge base management** — BibTeX import, DOI bulk-add, local document ingest, Zotero-collection import; async ingestion with SSE progress streaming
 - **Citation-graph expansion** — forward + backward snowball over OpenAlex; automatic Semantic Scholar fallback for arXiv-seeded papers (see [docs/concepts/citation-graph.md](docs/concepts/citation-graph.md))

diff --git a/config.example.yml b/config.example.yml
@@ -67,6 +67,13 @@ knowledge_base:
   embedding_model: "text-embedding-3-small"
   chunk_size: 1000
   chunk_overlap: 200
+  # Advanced: extract structured tables + figures with docling, IN ADDITION to
+  # the always-on fast fitz text extraction. Requires `uv sync --extra docling`.
+  # docling is CPU-only here and slow (~minutes/page), so this is off by default
+  # — turn it on for high-value PDFs where tables/figures matter.
+  docling_extract_tables_figures: false
+  docling_max_pages: 40     # skip docling extras for PDFs larger than this
+  docling_timeout_s: 600    # per-document docling wall-clock cap; on timeout, skip extras
   chunking_method: "token"
   default_top_k: 10
   similarity_threshold: 0.7

diff --git a/docs/pdf-extraction-docling.md b/docs/pdf-extraction-docling.md
@@ -0,0 +1,89 @@
+# PDF extraction: fast text + optional docling tables/figures
+
+Perspicacité extracts PDF content in **two independent layers**:
+
+| Layer | Engine | Runs | Output | Speed |
+|-------|--------|------|--------|-------|
+| **Text** (always on) | PyMuPDF (`fitz`) → `pdfplumber` fallback | every PDF ingest | full body text + sections | fast (sub-second) |
+| **Tables + figures** (opt-in, advanced) | docling layout model | only when enabled | structured tables as retrievable chunks | slow (CPU-bound, ~minutes/page) |
+
+The layers are decoupled: **text never depends on docling.** If the `[docling]`
+extra is not installed, the PDF exceeds the page cap, or docling errors/times
+out, you still get the full fitz text — you simply don't get the table chunks.
+Enabling docling can only *add* content, never break ingest.
+
+## Why docling is off by default
+
+Docling runs the RT-DETR layout model + TableFormer. On CPU this is roughly
+**~45–50 s per page (~10 min for a typical paper)**. On Apple Silicon the GPU
+(MPS) path is currently **unusable** — the upstream `transformers` RT-DETRv2
+positional embedding hard-codes `float64`, which MPS does not support
+(see [huggingface/transformers#28334](https://github.com/huggingface/transformers/issues/28334));
+`PYTORCH_ENABLE_MPS_FALLBACK=1` does not help. So docling here is a deliberate,
+batch/offline choice, not a hot-path default. A CUDA machine makes it fast
+enough for routine use.
+
+## Enabling docling
+
+1. Install the optional extra (one-time, heavy — pulls torch + layout models):
+
+   ```bash
+   uv sync --extra docling
+   ```
+
+2. In `config.yml` under `knowledge_base:`:
+
+   ```yaml
+   docling_extract_tables_figures: true   # default: false
+   docling_max_pages: 40                  # PDFs larger than this skip docling (text-only)
+   docling_timeout_s: 600                 # per-document wall-clock cap; on timeout, keep text, skip extras
+   ```
+
+3. Ingest **local PDF files** (the local-files / dropzone path). Each PDF gets
+   fitz text **plus** any tables docling extracts, added as searchable chunks
+   tagged `content_type="table"` (caption + page preserved in metadata).
+
+## Guard behaviour (config knobs)
+
+- `docling_extract_tables_figures` (bool, default `false`) — master switch for
+  the advanced layer.
+- `docling_max_pages` (int, default `40`) — documents with more pages skip
+  docling and use text-only fitz (avoids the worst-case multi-minute cost).
+- `docling_timeout_s` (int, default `600`) — per-document wall-clock cap. docling
+  runs in a worker process; on timeout it is abandoned and ingest falls back to
+  the already-extracted fitz text. Every fallback logs one structured
+  `docling_fallback` event (`reason=oversized|timeout|error`).
+
+## Scope and current limits
+
+- **Wired for the local-file ingest path** (`integrations/local_docs.py`). The
+  DOI/BibTeX download path is text-only for now (adding table chunks there needs
+  a `Paper.tables` field — a follow-up).
+- **Tables become chunks today; figures are extracted but not yet consumed.**
+  Docling figure records are produced (caption + image, dimensions populated)
+  and mapped to the existing multimodal record shape, but feeding figure images
+  into the answer/vision pipeline is a follow-up.
+- **CPU-only in practice** on Apple Silicon (see above). Prefer a CUDA host or a
+  remote docling service for large batches.
+
+## Implementation pointers
+
+- Converter + record mapping: `src/perspicacite/pipeline/parsers/docling_pdf.py`
+  (`DoclingPDFParser`, `DoclingTable`, `DoclingFigure`,
+  `figure_to_multimodal_record`). The converter forces
+  `AcceleratorDevice.CPU` and enables `generate_picture_images` + `images_scale=2.0`
+  (without picture-image rendering, `PictureItem.get_image()` returns `None` and
+  every figure is dropped).
+- Backend guard + worker: `src/perspicacite/pipeline/parsers/pdf.py`
+  (`_should_run_docling_extras`, `_run_docling_with_timeout`, `_docling_importable`).
+- Table → chunk: `src/perspicacite/pipeline/chunking_dispatch.py`
+  (`table_records_to_chunks`).
+- Config: `src/perspicacite/config/schema.py` (`KnowledgeBaseConfig`).
+
+## Note on full text vs. abstracts
+
+If a knowledge base shows only abstracts, that is a **source** issue, not a
+docling one: a Zotero `.bib` carries abstracts only. To get full text, ingest
+the actual **PDFs** (local-file path) — the fast fitz layer already returns the
+complete body text, no docling required. Enable docling only when you also want
+the papers' **tables** as retrievable content.
diff --git a/pyproject.toml b/pyproject.toml
@@ -151,6 +151,18 @@ adapters = [
     "indicium-adapters-metabolomics>=0.1.0",
 ]
 
+# docling — high-fidelity PDF -> structured document conversion (layout,
+# tables, sections) for the content pipeline. Heavier than the other
+# extras (pulls in torch-backed models + pandas). Install only when
+# docling-based parsing is needed.
+#
+# Install with:
+#   uv sync --extra docling
+docling = [
+    "docling>=2.5,<3",
+    "pandas>=2.0,<3",
+]
+
 [project.scripts]
 perspicacite = "perspicacite.cli:main"
 

diff --git a/src/perspicacite/config/schema.py b/src/perspicacite/config/schema.py
@@ -92,6 +92,25 @@ class KnowledgeBaseConfig(BaseModel):
     embedding_model: str = "text-embedding-3-small"
     chunk_size: int = Field(default=1000, ge=100, le=10000)
     chunk_overlap: int = Field(default=200, ge=0, le=1000)
+    docling_extract_tables_figures: bool = Field(
+        default=False,
+        description=(
+            "Advanced: when True, run docling (CPU, slow ~min/page) to extract "
+            "structured tables + figures from PDFs IN ADDITION to the always-on "
+            "fitz text extraction. Off by default."
+        ),
+    )
+    docling_max_pages: int = Field(
+        default=40, ge=1,
+        description="Skip the docling extras pass for PDFs with more pages than this.",
+    )
+    docling_timeout_s: int = Field(
+        default=600, ge=1,
+        description=(
+            "Per-document wall-clock cap for the docling extras pass; "
+            "on timeout, skip extras."
+        ),
+    )
     chunking_method: Literal["token", "semantic", "agentic"] = "token"
     default_top_k: int = Field(default=10, ge=1, le=100)
     similarity_threshold: float = Field(default=0.7, ge=0.0, le=1.0)

diff --git a/src/perspicacite/integrations/local_docs.py b/src/perspicacite/integrations/local_docs.py
@@ -104,10 +104,11 @@ async def _read_text(path: Path, content_type: str, pdf_parser) -> str | None:
         parsed = await pdf_parser.parse(path)
         return parsed.text or None
     try:
-        return path.read_text(encoding="utf-8", errors="replace")
+        raw = path.read_text(encoding="utf-8", errors="replace")
     except Exception as exc:
         logger.warning("local_docs_read_failed", path=str(path), error=str(exc))
         return None
+    return raw or None
 
 
 async def _ingest_files(
@@ -158,6 +159,23 @@ async def _ingest_files(
                 text, paper,
                 content_type=content_type, language=language, config=kb_cfg,
             )
+            # R2 advanced: optionally augment with docling-extracted tables.
+            if content_type == "pdf" and getattr(kb_cfg, "docling_extract_tables_figures", False):
+                parser = app_state.pdf_parser
+                pages = parser._page_count(fp)
+                if parser._should_run_docling_extras(pages, kb_cfg):
+                    pc = parser._run_docling_with_timeout(
+                        fp, int(getattr(kb_cfg, "docling_timeout_s", 600))
+                    )
+                    if pc is not None and pc.tables:
+                        from perspicacite.pipeline.chunking_dispatch import (
+                            table_records_to_chunks,
+                        )
+                        chunks.extend(
+                            table_records_to_chunks(
+                                pc.tables, paper, start_index=len(chunks)
+                            )
+                        )
             # ChunkMetadata is frozen — recreate with source_file_path set,
             # plus optional external_metadata annotations (Cycle C).
             ext_parent = (external_metadata or {}).get("parent_paper_id")

diff --git a/src/perspicacite/pipeline/chunking_dispatch.py b/src/perspicacite/pipeline/chunking_dispatch.py
@@ -58,6 +58,30 @@
 }
 
 
+def table_records_to_chunks(tables, paper, start_index: int) -> list[DocumentChunk]:
+    """Turn ``DoclingTable`` records into retrievable chunks tagged ``content_type='table'``.
+
+    No-op when ``tables`` is empty (the fitz path), preserving today's behaviour.
+    """
+    chunks: list[DocumentChunk] = []
+    for i, t in enumerate(tables):
+        body = (f"{t.caption}\n\n{t.markdown}" if t.caption else t.markdown).strip()
+        idx = start_index + i
+        meta = ChunkMetadata(
+            paper_id=getattr(paper, "paper_id", "unknown"),
+            chunk_index=idx,
+            content_type="table",
+            page=getattr(t, "page", None),
+            title=getattr(paper, "title", None),
+            doi=getattr(paper, "doi", None),
+            year=getattr(paper, "year", None),
+        )
+        chunks.append(
+            DocumentChunk(id=f"{meta.paper_id}:table:{idx}", text=body, metadata=meta)
+        )
+    return chunks
+
+
 def infer_content_type(path: Path) -> tuple[str, str | None]:
     """Map file extension to ``(content_type, language)``.