Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,6 +96,10 @@ The `AgenticOrchestrator` ([src/perspicacite/rag/agentic/orchestrator.py](src/pe

Publisher API keys are passed as kwargs; missing keys skip that source gracefully. Check `content_type` in the result: `"structured"` > `"full_text"` > `"abstract"` > `"none"`.

### PDF extraction backends

PDF **text** is always extracted with PyMuPDF (`fitz`) → `pdfplumber` fallback (`pipeline/parsers/pdf.py`) — fast, default, on every ingest. **Structured tables + figures** are an opt-in advanced layer via **docling**, off by default (`knowledge_base.docling_extract_tables_figures`, guarded by `docling_max_pages` / `docling_timeout_s`; needs `uv sync --extra docling`). When enabled on the local-file ingest path, docling adds `content_type="table"` chunks on top of the fitz text; if docling is absent / oversized / times out, the text is unaffected. docling is CPU-bound (~min/page) and the MPS/GPU path is unusable on Apple Silicon. Full details: [docs/pdf-extraction-docling.md](docs/pdf-extraction-docling.md).

### Retrieval

`ChromaVectorStore` ([src/perspicacite/retrieval/chroma_store.py](src/perspicacite/retrieval/chroma_store.py)) wraps ChromaDB. KB collections are named via `chroma_collection_name_for_kb()` from `models/kb.py`. The hybrid retriever ([src/perspicacite/retrieval/hybrid.py](src/perspicacite/retrieval/hybrid.py)) combines ChromaDB cosine scores with BM25Okapi scores; weights default to 0.5/0.5 but can optionally be determined by the LLM at query time. `MultiKBRetriever` ([src/perspicacite/retrieval/multi_kb.py](src/perspicacite/retrieval/multi_kb.py)) fans a query across multiple KB collections, merges by score, deduplicates by `paper_id`, and tags results with `kb_name`; use `check_embedding_compat(kb_metas)` to validate that all queried KBs share the same embedding model before retrieval.
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@

- **Multi-database search** — Semantic Scholar, OpenAlex, PubMed, arXiv, HAL, DBLP via SciLEx
- **Unified content pipeline** — PMC JATS XML, arXiv HTML, OA PDFs, publisher APIs, and institutional-access via browser-cookie replay; quality-priority routing
- **PDF extraction** — fast PyMuPDF text on every ingest, plus optional [docling](docs/pdf-extraction-docling.md) layout extraction for structured tables/figures (advanced, off by default)
- **6 RAG modes** — Basic, Advanced, Profound, Agentic, Literature Survey, Contradiction; per-stage LLM tiering (Haiku routing/screening, Sonnet synthesis)
- **Knowledge base management** — BibTeX import, DOI bulk-add, local document ingest, Zotero-collection import; async ingestion with SSE progress streaming
- **Citation-graph expansion** — forward + backward snowball over OpenAlex; automatic Semantic Scholar fallback for arXiv-seeded papers (see [docs/concepts/citation-graph.md](docs/concepts/citation-graph.md))
Expand Down
7 changes: 7 additions & 0 deletions config.example.yml
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,13 @@ knowledge_base:
embedding_model: "text-embedding-3-small"
chunk_size: 1000
chunk_overlap: 200
# Advanced: extract structured tables + figures with docling, IN ADDITION to
# the always-on fast fitz text extraction. Requires `uv sync --extra docling`.
# docling is CPU-only here and slow (~minutes/page), so this is off by default
# — turn it on for high-value PDFs where tables/figures matter.
docling_extract_tables_figures: false
docling_max_pages: 40 # skip docling extras for PDFs larger than this
docling_timeout_s: 600 # per-document docling wall-clock cap; on timeout, skip extras
chunking_method: "token"
default_top_k: 10
similarity_threshold: 0.7
Expand Down
89 changes: 89 additions & 0 deletions docs/pdf-extraction-docling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
# PDF extraction: fast text + optional docling tables/figures

Perspicacité extracts PDF content in **two independent layers**:

| Layer | Engine | Runs | Output | Speed |
|-------|--------|------|--------|-------|
| **Text** (always on) | PyMuPDF (`fitz`) → `pdfplumber` fallback | every PDF ingest | full body text + sections | fast (sub-second) |
| **Tables + figures** (opt-in, advanced) | docling layout model | only when enabled | structured tables as retrievable chunks | slow (CPU-bound, ~minutes/page) |

The layers are decoupled: **text never depends on docling.** If the `[docling]`
extra is not installed, the PDF exceeds the page cap, or docling errors/times
out, you still get the full fitz text — you simply don't get the table chunks.
Enabling docling can only *add* content, never break ingest.

## Why docling is off by default

Docling runs the RT-DETR layout model + TableFormer. On CPU this is roughly
**~45–50 s per page (~10 min for a typical paper)**. On Apple Silicon the GPU
(MPS) path is currently **unusable** — the upstream `transformers` RT-DETRv2
positional embedding hard-codes `float64`, which MPS does not support
(see [huggingface/transformers#28334](https://github.com/huggingface/transformers/issues/28334));
`PYTORCH_ENABLE_MPS_FALLBACK=1` does not help. So docling here is a deliberate,
batch/offline choice, not a hot-path default. A CUDA machine makes it fast
enough for routine use.

## Enabling docling

1. Install the optional extra (one-time, heavy — pulls torch + layout models):

```bash
uv sync --extra docling
```

2. In `config.yml` under `knowledge_base:`:

```yaml
docling_extract_tables_figures: true # default: false
docling_max_pages: 40 # PDFs larger than this skip docling (text-only)
docling_timeout_s: 600 # per-document wall-clock cap; on timeout, keep text, skip extras
```

3. Ingest **local PDF files** (the local-files / dropzone path). Each PDF gets
fitz text **plus** any tables docling extracts, added as searchable chunks
tagged `content_type="table"` (caption + page preserved in metadata).

## Guard behaviour (config knobs)

- `docling_extract_tables_figures` (bool, default `false`) — master switch for
the advanced layer.
- `docling_max_pages` (int, default `40`) — documents with more pages skip
docling and use text-only fitz (avoids the worst-case multi-minute cost).
- `docling_timeout_s` (int, default `600`) — per-document wall-clock cap. docling
runs in a worker process; on timeout it is abandoned and ingest falls back to
the already-extracted fitz text. Every fallback logs one structured
`docling_fallback` event (`reason=oversized|timeout|error`).

## Scope and current limits

- **Wired for the local-file ingest path** (`integrations/local_docs.py`). The
DOI/BibTeX download path is text-only for now (adding table chunks there needs
a `Paper.tables` field — a follow-up).
- **Tables become chunks today; figures are extracted but not yet consumed.**
Docling figure records are produced (caption + image, dimensions populated)
and mapped to the existing multimodal record shape, but feeding figure images
into the answer/vision pipeline is a follow-up.
- **CPU-only in practice** on Apple Silicon (see above). Prefer a CUDA host or a
remote docling service for large batches.

## Implementation pointers

- Converter + record mapping: `src/perspicacite/pipeline/parsers/docling_pdf.py`
(`DoclingPDFParser`, `DoclingTable`, `DoclingFigure`,
`figure_to_multimodal_record`). The converter forces
`AcceleratorDevice.CPU` and enables `generate_picture_images` + `images_scale=2.0`
(without picture-image rendering, `PictureItem.get_image()` returns `None` and
every figure is dropped).
- Backend guard + worker: `src/perspicacite/pipeline/parsers/pdf.py`
(`_should_run_docling_extras`, `_run_docling_with_timeout`, `_docling_importable`).
- Table → chunk: `src/perspicacite/pipeline/chunking_dispatch.py`
(`table_records_to_chunks`).
- Config: `src/perspicacite/config/schema.py` (`KnowledgeBaseConfig`).

## Note on full text vs. abstracts

If a knowledge base shows only abstracts, that is a **source** issue, not a
docling one: a Zotero `.bib` carries abstracts only. To get full text, ingest
the actual **PDFs** (local-file path) — the fast fitz layer already returns the
complete body text, no docling required. Enable docling only when you also want
the papers' **tables** as retrievable content.
12 changes: 12 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -151,6 +151,18 @@ adapters = [
"indicium-adapters-metabolomics>=0.1.0",
]

# docling — high-fidelity PDF -> structured document conversion (layout,
# tables, sections) for the content pipeline. Heavier than the other
# extras (pulls in torch-backed models + pandas). Install only when
# docling-based parsing is needed.
#
# Install with:
# uv sync --extra docling
docling = [
"docling>=2.5,<3",
"pandas>=2.0,<3",
]

[project.scripts]
perspicacite = "perspicacite.cli:main"

Expand Down
19 changes: 19 additions & 0 deletions src/perspicacite/config/schema.py
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,25 @@ class KnowledgeBaseConfig(BaseModel):
embedding_model: str = "text-embedding-3-small"
chunk_size: int = Field(default=1000, ge=100, le=10000)
chunk_overlap: int = Field(default=200, ge=0, le=1000)
docling_extract_tables_figures: bool = Field(
default=False,
description=(
"Advanced: when True, run docling (CPU, slow ~min/page) to extract "
"structured tables + figures from PDFs IN ADDITION to the always-on "
"fitz text extraction. Off by default."
),
)
docling_max_pages: int = Field(
default=40, ge=1,
description="Skip the docling extras pass for PDFs with more pages than this.",
)
docling_timeout_s: int = Field(
default=600, ge=1,
description=(
"Per-document wall-clock cap for the docling extras pass; "
"on timeout, skip extras."
),
)
chunking_method: Literal["token", "semantic", "agentic"] = "token"
default_top_k: int = Field(default=10, ge=1, le=100)
similarity_threshold: float = Field(default=0.7, ge=0.0, le=1.0)
Expand Down
20 changes: 19 additions & 1 deletion src/perspicacite/integrations/local_docs.py
Original file line number Diff line number Diff line change
Expand Up @@ -104,10 +104,11 @@ async def _read_text(path: Path, content_type: str, pdf_parser) -> str | None:
parsed = await pdf_parser.parse(path)
return parsed.text or None
try:
return path.read_text(encoding="utf-8", errors="replace")
raw = path.read_text(encoding="utf-8", errors="replace")
except Exception as exc:
logger.warning("local_docs_read_failed", path=str(path), error=str(exc))
return None
return raw or None


async def _ingest_files(
Expand Down Expand Up @@ -158,6 +159,23 @@ async def _ingest_files(
text, paper,
content_type=content_type, language=language, config=kb_cfg,
)
# R2 advanced: optionally augment with docling-extracted tables.
if content_type == "pdf" and getattr(kb_cfg, "docling_extract_tables_figures", False):
parser = app_state.pdf_parser
pages = parser._page_count(fp)
if parser._should_run_docling_extras(pages, kb_cfg):
pc = parser._run_docling_with_timeout(
fp, int(getattr(kb_cfg, "docling_timeout_s", 600))
)
if pc is not None and pc.tables:
from perspicacite.pipeline.chunking_dispatch import (
table_records_to_chunks,
)
chunks.extend(
table_records_to_chunks(
pc.tables, paper, start_index=len(chunks)
)
)
# ChunkMetadata is frozen — recreate with source_file_path set,
# plus optional external_metadata annotations (Cycle C).
ext_parent = (external_metadata or {}).get("parent_paper_id")
Expand Down
24 changes: 24 additions & 0 deletions src/perspicacite/pipeline/chunking_dispatch.py
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,30 @@
}


def table_records_to_chunks(tables, paper, start_index: int) -> list[DocumentChunk]:
"""Turn ``DoclingTable`` records into retrievable chunks tagged ``content_type='table'``.

No-op when ``tables`` is empty (the fitz path), preserving today's behaviour.
"""
chunks: list[DocumentChunk] = []
for i, t in enumerate(tables):
body = (f"{t.caption}\n\n{t.markdown}" if t.caption else t.markdown).strip()
idx = start_index + i
meta = ChunkMetadata(
paper_id=getattr(paper, "paper_id", "unknown"),
chunk_index=idx,
content_type="table",
page=getattr(t, "page", None),
title=getattr(paper, "title", None),
doi=getattr(paper, "doi", None),
year=getattr(paper, "year", None),
)
chunks.append(
DocumentChunk(id=f"{meta.paper_id}:table:{idx}", text=body, metadata=meta)
)
return chunks


def infer_content_type(path: Path) -> tuple[str, str | None]:
"""Map file extension to ``(content_type, language)``.

Expand Down
Loading
Loading