diff --git a/CLAUDE.md b/CLAUDE.md index c968061a..70ea98c4 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -96,6 +96,10 @@ The `AgenticOrchestrator` ([src/perspicacite/rag/agentic/orchestrator.py](src/pe Publisher API keys are passed as kwargs; missing keys skip that source gracefully. Check `content_type` in the result: `"structured"` > `"full_text"` > `"abstract"` > `"none"`. +### PDF extraction backends + +PDF **text** is always extracted with PyMuPDF (`fitz`) → `pdfplumber` fallback (`pipeline/parsers/pdf.py`) — fast, default, on every ingest. **Structured tables + figures** are an opt-in advanced layer via **docling**, off by default (`knowledge_base.docling_extract_tables_figures`, guarded by `docling_max_pages` / `docling_timeout_s`; needs `uv sync --extra docling`). When enabled on the local-file ingest path, docling adds `content_type="table"` chunks on top of the fitz text; if docling is absent / oversized / times out, the text is unaffected. docling is CPU-bound (~min/page) and the MPS/GPU path is unusable on Apple Silicon. Full details: [docs/pdf-extraction-docling.md](docs/pdf-extraction-docling.md). + ### Retrieval `ChromaVectorStore` ([src/perspicacite/retrieval/chroma_store.py](src/perspicacite/retrieval/chroma_store.py)) wraps ChromaDB. KB collections are named via `chroma_collection_name_for_kb()` from `models/kb.py`. The hybrid retriever ([src/perspicacite/retrieval/hybrid.py](src/perspicacite/retrieval/hybrid.py)) combines ChromaDB cosine scores with BM25Okapi scores; weights default to 0.5/0.5 but can optionally be determined by the LLM at query time. `MultiKBRetriever` ([src/perspicacite/retrieval/multi_kb.py](src/perspicacite/retrieval/multi_kb.py)) fans a query across multiple KB collections, merges by score, deduplicates by `paper_id`, and tags results with `kb_name`; use `check_embedding_compat(kb_metas)` to validate that all queried KBs share the same embedding model before retrieval. diff --git a/README.md b/README.md index efc00cb8..db5b1673 100644 --- a/README.md +++ b/README.md @@ -28,6 +28,7 @@ - **Multi-database search** — Semantic Scholar, OpenAlex, PubMed, arXiv, HAL, DBLP via SciLEx - **Unified content pipeline** — PMC JATS XML, arXiv HTML, OA PDFs, publisher APIs, and institutional-access via browser-cookie replay; quality-priority routing +- **PDF extraction** — fast PyMuPDF text on every ingest, plus optional [docling](docs/pdf-extraction-docling.md) layout extraction for structured tables/figures (advanced, off by default) - **6 RAG modes** — Basic, Advanced, Profound, Agentic, Literature Survey, Contradiction; per-stage LLM tiering (Haiku routing/screening, Sonnet synthesis) - **Knowledge base management** — BibTeX import, DOI bulk-add, local document ingest, Zotero-collection import; async ingestion with SSE progress streaming - **Citation-graph expansion** — forward + backward snowball over OpenAlex; automatic Semantic Scholar fallback for arXiv-seeded papers (see [docs/concepts/citation-graph.md](docs/concepts/citation-graph.md)) diff --git a/config.example.yml b/config.example.yml index ed3760fd..3968d554 100644 --- a/config.example.yml +++ b/config.example.yml @@ -67,6 +67,13 @@ knowledge_base: embedding_model: "text-embedding-3-small" chunk_size: 1000 chunk_overlap: 200 + # Advanced: extract structured tables + figures with docling, IN ADDITION to + # the always-on fast fitz text extraction. Requires `uv sync --extra docling`. + # docling is CPU-only here and slow (~minutes/page), so this is off by default + # — turn it on for high-value PDFs where tables/figures matter. + docling_extract_tables_figures: false + docling_max_pages: 40 # skip docling extras for PDFs larger than this + docling_timeout_s: 600 # per-document docling wall-clock cap; on timeout, skip extras chunking_method: "token" default_top_k: 10 similarity_threshold: 0.7 diff --git a/docs/pdf-extraction-docling.md b/docs/pdf-extraction-docling.md new file mode 100644 index 00000000..9435bcc3 --- /dev/null +++ b/docs/pdf-extraction-docling.md @@ -0,0 +1,89 @@ +# PDF extraction: fast text + optional docling tables/figures + +Perspicacité extracts PDF content in **two independent layers**: + +| Layer | Engine | Runs | Output | Speed | +|-------|--------|------|--------|-------| +| **Text** (always on) | PyMuPDF (`fitz`) → `pdfplumber` fallback | every PDF ingest | full body text + sections | fast (sub-second) | +| **Tables + figures** (opt-in, advanced) | docling layout model | only when enabled | structured tables as retrievable chunks | slow (CPU-bound, ~minutes/page) | + +The layers are decoupled: **text never depends on docling.** If the `[docling]` +extra is not installed, the PDF exceeds the page cap, or docling errors/times +out, you still get the full fitz text — you simply don't get the table chunks. +Enabling docling can only *add* content, never break ingest. + +## Why docling is off by default + +Docling runs the RT-DETR layout model + TableFormer. On CPU this is roughly +**~45–50 s per page (~10 min for a typical paper)**. On Apple Silicon the GPU +(MPS) path is currently **unusable** — the upstream `transformers` RT-DETRv2 +positional embedding hard-codes `float64`, which MPS does not support +(see [huggingface/transformers#28334](https://github.com/huggingface/transformers/issues/28334)); +`PYTORCH_ENABLE_MPS_FALLBACK=1` does not help. So docling here is a deliberate, +batch/offline choice, not a hot-path default. A CUDA machine makes it fast +enough for routine use. + +## Enabling docling + +1. Install the optional extra (one-time, heavy — pulls torch + layout models): + + ```bash + uv sync --extra docling + ``` + +2. In `config.yml` under `knowledge_base:`: + + ```yaml + docling_extract_tables_figures: true # default: false + docling_max_pages: 40 # PDFs larger than this skip docling (text-only) + docling_timeout_s: 600 # per-document wall-clock cap; on timeout, keep text, skip extras + ``` + +3. Ingest **local PDF files** (the local-files / dropzone path). Each PDF gets + fitz text **plus** any tables docling extracts, added as searchable chunks + tagged `content_type="table"` (caption + page preserved in metadata). + +## Guard behaviour (config knobs) + +- `docling_extract_tables_figures` (bool, default `false`) — master switch for + the advanced layer. +- `docling_max_pages` (int, default `40`) — documents with more pages skip + docling and use text-only fitz (avoids the worst-case multi-minute cost). +- `docling_timeout_s` (int, default `600`) — per-document wall-clock cap. docling + runs in a worker process; on timeout it is abandoned and ingest falls back to + the already-extracted fitz text. Every fallback logs one structured + `docling_fallback` event (`reason=oversized|timeout|error`). + +## Scope and current limits + +- **Wired for the local-file ingest path** (`integrations/local_docs.py`). The + DOI/BibTeX download path is text-only for now (adding table chunks there needs + a `Paper.tables` field — a follow-up). +- **Tables become chunks today; figures are extracted but not yet consumed.** + Docling figure records are produced (caption + image, dimensions populated) + and mapped to the existing multimodal record shape, but feeding figure images + into the answer/vision pipeline is a follow-up. +- **CPU-only in practice** on Apple Silicon (see above). Prefer a CUDA host or a + remote docling service for large batches. + +## Implementation pointers + +- Converter + record mapping: `src/perspicacite/pipeline/parsers/docling_pdf.py` + (`DoclingPDFParser`, `DoclingTable`, `DoclingFigure`, + `figure_to_multimodal_record`). The converter forces + `AcceleratorDevice.CPU` and enables `generate_picture_images` + `images_scale=2.0` + (without picture-image rendering, `PictureItem.get_image()` returns `None` and + every figure is dropped). +- Backend guard + worker: `src/perspicacite/pipeline/parsers/pdf.py` + (`_should_run_docling_extras`, `_run_docling_with_timeout`, `_docling_importable`). +- Table → chunk: `src/perspicacite/pipeline/chunking_dispatch.py` + (`table_records_to_chunks`). +- Config: `src/perspicacite/config/schema.py` (`KnowledgeBaseConfig`). + +## Note on full text vs. abstracts + +If a knowledge base shows only abstracts, that is a **source** issue, not a +docling one: a Zotero `.bib` carries abstracts only. To get full text, ingest +the actual **PDFs** (local-file path) — the fast fitz layer already returns the +complete body text, no docling required. Enable docling only when you also want +the papers' **tables** as retrievable content. diff --git a/pyproject.toml b/pyproject.toml index 5bc02e0b..a7a71449 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -151,6 +151,18 @@ adapters = [ "indicium-adapters-metabolomics>=0.1.0", ] +# docling — high-fidelity PDF -> structured document conversion (layout, +# tables, sections) for the content pipeline. Heavier than the other +# extras (pulls in torch-backed models + pandas). Install only when +# docling-based parsing is needed. +# +# Install with: +# uv sync --extra docling +docling = [ + "docling>=2.5,<3", + "pandas>=2.0,<3", +] + [project.scripts] perspicacite = "perspicacite.cli:main" diff --git a/src/perspicacite/config/schema.py b/src/perspicacite/config/schema.py index a9e7d3d3..31c574bf 100644 --- a/src/perspicacite/config/schema.py +++ b/src/perspicacite/config/schema.py @@ -92,6 +92,25 @@ class KnowledgeBaseConfig(BaseModel): embedding_model: str = "text-embedding-3-small" chunk_size: int = Field(default=1000, ge=100, le=10000) chunk_overlap: int = Field(default=200, ge=0, le=1000) + docling_extract_tables_figures: bool = Field( + default=False, + description=( + "Advanced: when True, run docling (CPU, slow ~min/page) to extract " + "structured tables + figures from PDFs IN ADDITION to the always-on " + "fitz text extraction. Off by default." + ), + ) + docling_max_pages: int = Field( + default=40, ge=1, + description="Skip the docling extras pass for PDFs with more pages than this.", + ) + docling_timeout_s: int = Field( + default=600, ge=1, + description=( + "Per-document wall-clock cap for the docling extras pass; " + "on timeout, skip extras." + ), + ) chunking_method: Literal["token", "semantic", "agentic"] = "token" default_top_k: int = Field(default=10, ge=1, le=100) similarity_threshold: float = Field(default=0.7, ge=0.0, le=1.0) diff --git a/src/perspicacite/integrations/local_docs.py b/src/perspicacite/integrations/local_docs.py index f72275c1..e654d468 100644 --- a/src/perspicacite/integrations/local_docs.py +++ b/src/perspicacite/integrations/local_docs.py @@ -104,10 +104,11 @@ async def _read_text(path: Path, content_type: str, pdf_parser) -> str | None: parsed = await pdf_parser.parse(path) return parsed.text or None try: - return path.read_text(encoding="utf-8", errors="replace") + raw = path.read_text(encoding="utf-8", errors="replace") except Exception as exc: logger.warning("local_docs_read_failed", path=str(path), error=str(exc)) return None + return raw or None async def _ingest_files( @@ -158,6 +159,23 @@ async def _ingest_files( text, paper, content_type=content_type, language=language, config=kb_cfg, ) + # R2 advanced: optionally augment with docling-extracted tables. + if content_type == "pdf" and getattr(kb_cfg, "docling_extract_tables_figures", False): + parser = app_state.pdf_parser + pages = parser._page_count(fp) + if parser._should_run_docling_extras(pages, kb_cfg): + pc = parser._run_docling_with_timeout( + fp, int(getattr(kb_cfg, "docling_timeout_s", 600)) + ) + if pc is not None and pc.tables: + from perspicacite.pipeline.chunking_dispatch import ( + table_records_to_chunks, + ) + chunks.extend( + table_records_to_chunks( + pc.tables, paper, start_index=len(chunks) + ) + ) # ChunkMetadata is frozen — recreate with source_file_path set, # plus optional external_metadata annotations (Cycle C). ext_parent = (external_metadata or {}).get("parent_paper_id") diff --git a/src/perspicacite/pipeline/chunking_dispatch.py b/src/perspicacite/pipeline/chunking_dispatch.py index be188396..156c0807 100644 --- a/src/perspicacite/pipeline/chunking_dispatch.py +++ b/src/perspicacite/pipeline/chunking_dispatch.py @@ -58,6 +58,30 @@ } +def table_records_to_chunks(tables, paper, start_index: int) -> list[DocumentChunk]: + """Turn ``DoclingTable`` records into retrievable chunks tagged ``content_type='table'``. + + No-op when ``tables`` is empty (the fitz path), preserving today's behaviour. + """ + chunks: list[DocumentChunk] = [] + for i, t in enumerate(tables): + body = (f"{t.caption}\n\n{t.markdown}" if t.caption else t.markdown).strip() + idx = start_index + i + meta = ChunkMetadata( + paper_id=getattr(paper, "paper_id", "unknown"), + chunk_index=idx, + content_type="table", + page=getattr(t, "page", None), + title=getattr(paper, "title", None), + doi=getattr(paper, "doi", None), + year=getattr(paper, "year", None), + ) + chunks.append( + DocumentChunk(id=f"{meta.paper_id}:table:{idx}", text=body, metadata=meta) + ) + return chunks + + def infer_content_type(path: Path) -> tuple[str, str | None]: """Map file extension to ``(content_type, language)``. diff --git a/src/perspicacite/pipeline/parsers/docling_pdf.py b/src/perspicacite/pipeline/parsers/docling_pdf.py new file mode 100644 index 00000000..80945c9e --- /dev/null +++ b/src/perspicacite/pipeline/parsers/docling_pdf.py @@ -0,0 +1,175 @@ +"""Docling-backed PDF extraction (R2). + +Ports the converter configuration proven in AgenticScienceBuilder's +figures.py: picture images MUST be rendered (generate_picture_images=True) +or PictureItem.get_image() returns None and every figure is dropped; figure +pixel dimensions MUST be read from the rendered image or the size filter +discards them. No dependency on ASB. +""" +from __future__ import annotations + +import importlib.util +import re +from dataclasses import dataclass +from io import BytesIO +from typing import TYPE_CHECKING, Any + +from perspicacite.logging import get_logger +from perspicacite.pipeline.parsers.pdf import ParsedContent + +if TYPE_CHECKING: + from collections.abc import Callable + from pathlib import Path + +logger = get_logger("perspicacite.pipeline.parsers.docling") + +_MIN_AREA_PX = 50_000 # drop logos/icons (mirrors ASB) + + +@dataclass +class DoclingTable: + page: int + caption: str + markdown: str + headers: list[str] + rows: list[list[str]] + + @property + def n_rows(self) -> int: + return len(self.rows) + + @property + def n_cols(self) -> int: + return len(self.headers) + + +@dataclass +class DoclingFigure: + page: int + caption: str + width_px: int + height_px: int + image_bytes: bytes = b"" + + +def docling_importable() -> bool: + return importlib.util.find_spec("docling") is not None + + +def _make_docling_converter(): + # Picture images MUST be enabled or get_image() returns None (zero figures). + from docling.datamodel.base_models import InputFormat + from docling.datamodel.pipeline_options import ( + AcceleratorDevice, + AcceleratorOptions, + PdfPipelineOptions, + ) + from docling.document_converter import DocumentConverter, PdfFormatOption + + opts = PdfPipelineOptions() + opts.generate_picture_images = True + opts.images_scale = 2.0 + # Force CPU. On Apple Silicon docling auto-selects the MPS (Metal) backend, + # which raises "Cannot convert a MPS Tensor to float64 ... MPS doesn't + # support float64" and fails conversion on every page. CPU is portable and + # matches the documented R2 device intent. + opts.accelerator_options = AcceleratorOptions(device=AcceleratorDevice.CPU) + return DocumentConverter( + format_options={InputFormat.PDF: PdfFormatOption(pipeline_options=opts)} + ) + + +def _page_of(item) -> int: + prov = getattr(item, "prov", None) or [] + if prov and getattr(prov[0], "page_no", None) is not None: + return int(prov[0].page_no) + return 1 + + +_FIG_LABEL_RE = re.compile( + r"^\s*((?:supplementary\s+)?(?:fig(?:ure|\.)?|scheme)\s+[A-Za-z]?\d+[A-Za-z]?)", + re.IGNORECASE, +) + + +def figure_to_multimodal_record(fig: DoclingFigure) -> dict: + """Map a DoclingFigure to the existing multimodal record shape + {kind, label, caption, content} used by parsers/multimodal.py. `content` + is left empty: docling supplies the image, not a semantic description.""" + m = _FIG_LABEL_RE.match(fig.caption or "") + label = m.group(1).strip() if m else "" + return {"kind": "figure", "label": label, "caption": fig.caption or "", "content": ""} + + +class DoclingPDFParser: + """Extracts text + structured tables + figures via docling.""" + + def __init__(self, converter_factory: Callable[[], Any] = _make_docling_converter): + self._converter_factory = converter_factory + + def extract(self, source: str | Path) -> ParsedContent: + conv = self._converter_factory() + doc = conv.convert(str(source)).document + figures = self._figures(doc) + tables = self._tables(doc) + text = self._text(doc) + return ParsedContent( + text=text, + sections=None, + metadata={"extractor": "docling"}, + tables=tables, + figures=figures, + ) + + def _text(self, doc) -> str: + try: + return doc.export_to_markdown() + except Exception: + return "" + + def _figures(self, doc) -> list[DoclingFigure]: + out: list[DoclingFigure] = [] + for pic in getattr(doc, "pictures", []) or []: + try: + pil = pic.get_image(doc) + w, h = pil.width, pil.height + buf = BytesIO() + pil.save(buf, "PNG") + image_bytes = buf.getvalue() + except Exception: + continue + if len(image_bytes) < 1024: + continue + try: + caption = pic.caption_text(doc) or "" + except Exception: + caption = "" + out.append( + DoclingFigure( + page=_page_of(pic), caption=caption, + width_px=w, height_px=h, image_bytes=image_bytes, + ) + ) + return out + + def _tables(self, doc) -> list[DoclingTable]: + out: list[DoclingTable] = [] + for tbl in getattr(doc, "tables", []) or []: + try: + df = tbl.export_to_dataframe(doc) + headers = [str(c) for c in df.columns.tolist()] + rows = [[str(v) for v in row] for row in df.values.tolist()] + markdown = tbl.export_to_markdown(doc) + except Exception: + continue + try: + caption = tbl.caption_text(doc) or "" + except Exception: + caption = "" + out.append( + DoclingTable( + page=_page_of(tbl), caption=caption, + markdown=markdown, headers=headers, rows=rows, + ) + ) + return out diff --git a/src/perspicacite/pipeline/parsers/pdf.py b/src/perspicacite/pipeline/parsers/pdf.py index cb503d74..e7456a54 100644 --- a/src/perspicacite/pipeline/parsers/pdf.py +++ b/src/perspicacite/pipeline/parsers/pdf.py @@ -6,15 +6,28 @@ """ import re -from dataclasses import dataclass +from dataclasses import dataclass, field from pathlib import Path -from typing import Any +from typing import TYPE_CHECKING, Any from perspicacite.logging import get_logger +if TYPE_CHECKING: + from perspicacite.pipeline.parsers.docling_pdf import DoclingFigure, DoclingTable + logger = get_logger("perspicacite.pipeline.parsers.pdf") +def _docling_importable() -> bool: + from perspicacite.pipeline.parsers.docling_pdf import docling_importable + return docling_importable() + + +def _docling_extract_worker(path: str): + from perspicacite.pipeline.parsers.docling_pdf import DoclingPDFParser + return DoclingPDFParser().extract(path) + + def _clean_text(text: str, threshold: float = 0.05) -> str: """Collapse excess newlines when they dominate the text. @@ -48,6 +61,9 @@ class ParsedContent: title: str | None = None sections: dict[str, str] | None = None metadata: dict[str, Any] | None = None + # R2 (docling): empty on the fitz path; populated when docling is used. + tables: list["DoclingTable"] = field(default_factory=list) + figures: list["DoclingFigure"] = field(default_factory=list) class PDFParser: @@ -167,6 +183,52 @@ def _extract_with_pdfplumber(self, source: str | Path | bytes) -> tuple[str, dic return "\n\n".join(all_text), sections, page_count + # ------------------------------------------------------------------ + # docling extras pass: guards + worker runner (R2 docling) + # ------------------------------------------------------------------ + + def _page_count(self, source) -> int: + fitz = self._get_fitz() + if fitz is None: + return 0 + try: + doc = ( + fitz.open(str(source)) + if isinstance(source, (str, Path)) + else fitz.open(stream=source, filetype="pdf") + ) + n = doc.page_count + doc.close() + return n + except Exception: + return 0 + + def _should_run_docling_extras(self, page_count: int, config) -> bool: + """True when docling tables/figures extraction should run: the advanced + flag is on, the [docling] extra is importable, and the PDF is within the + page-count cap. The wall-clock timeout is the runtime safety net.""" + if not getattr(config, "docling_extract_tables_figures", False): + return False + if not _docling_importable(): + return False + return page_count <= int(getattr(config, "docling_max_pages", 40)) + + def _run_docling_with_timeout(self, source, timeout_s: int): + """Run docling in a worker process; return ParsedContent or None on + timeout/error (caller falls back to fitz).""" + from concurrent.futures import ProcessPoolExecutor + from concurrent.futures import TimeoutError as FTimeout + try: + with ProcessPoolExecutor(max_workers=1) as ex: + fut = ex.submit(_docling_extract_worker, str(source)) + return fut.result(timeout=timeout_s) + except FTimeout: + logger.warning("docling_fallback", reason="timeout", path=str(source)) + return None + except Exception as exc: + logger.warning("docling_fallback", reason="error", error=str(exc)) + return None + # ------------------------------------------------------------------ # Public API # ------------------------------------------------------------------ diff --git a/tests/unit/test_config.py b/tests/unit/test_config.py index 84b81683..1b1d4e63 100644 --- a/tests/unit/test_config.py +++ b/tests/unit/test_config.py @@ -219,3 +219,13 @@ def test_anchor_config_near_threshold_bounds(): AnchorConfig(near_threshold=1.0) with pytest.raises(ValidationError): AnchorConfig(near_threshold=1.5) + + +def test_docling_extras_config_defaults(): + from perspicacite.config.schema import KnowledgeBaseConfig + kb = KnowledgeBaseConfig() + assert kb.docling_extract_tables_figures is False + assert kb.docling_max_pages == 40 + assert kb.docling_timeout_s == 600 + kb2 = KnowledgeBaseConfig(docling_extract_tables_figures=True) + assert kb2.docling_extract_tables_figures is True diff --git a/tests/unit/test_docling_pdf.py b/tests/unit/test_docling_pdf.py new file mode 100644 index 00000000..cd77f680 --- /dev/null +++ b/tests/unit/test_docling_pdf.py @@ -0,0 +1,106 @@ +import unittest + + +class TestRecordsAndParsedContent(unittest.TestCase): + def test_parsed_content_defaults_empty_tables_figures(self): + from perspicacite.pipeline.parsers.pdf import ParsedContent + pc = ParsedContent(text="hi") + assert pc.tables == [] + assert pc.figures == [] + + def test_record_dataclasses_construct(self): + from perspicacite.pipeline.parsers.docling_pdf import DoclingTable, DoclingFigure + t = DoclingTable(page=2, caption="Table 1.", markdown="| a |", headers=["a"], rows=[["1"]]) + assert t.n_rows == 1 and t.n_cols == 1 + f = DoclingFigure(page=1, caption="Figure 1.", width_px=300, height_px=300, image_bytes=b"x") + assert f.width_px == 300 + + +class _FakeProv: + def __init__(self, page_no): self.page_no = page_no + +class _FakeImg: + def __init__(self, png): self._png = png; self.width = 300; self.height = 300 + def save(self, buf, fmt): buf.write(self._png) + +class _FakePicture: + def __init__(self, page, caption, png): + self.prov = [_FakeProv(page)]; self._caption = caption; self._png = png + def caption_text(self, doc): return self._caption + def get_image(self, doc): return _FakeImg(self._png) + +class _FakeTable: + def __init__(self, page, caption, headers, rows): + self.prov = [_FakeProv(page)]; self._caption = caption + self._headers = headers; self._rows = rows + def caption_text(self, doc): return self._caption + def export_to_markdown(self, doc=None): return "| " + " | ".join(self._headers) + " |" + def export_to_dataframe(self, doc=None): + import pandas as pd + return pd.DataFrame(self._rows, columns=self._headers) + +class _FakeDoc: + def __init__(self, pictures, tables): self.pictures = pictures; self.tables = tables + +class _FakeResult: + def __init__(self, doc): self.document = doc + +class _FakeConverter: + def __init__(self, doc): self._doc = doc + def convert(self, source): return _FakeResult(self._doc) + + +class TestDoclingExtraction(unittest.TestCase): + def test_maps_pictures_and_tables_dims_populated(self): + import importlib.util + if importlib.util.find_spec("pandas") is None: + self.skipTest("pandas required") + from perspicacite.pipeline.parsers import docling_pdf as d + png = b"\x89PNG\r\n\x1a\n" + b"\x00" * 2048 + doc = _FakeDoc( + pictures=[_FakePicture(1, "Figure 1.", png)], + tables=[_FakeTable(2, "Table 1.", ["k", "v"], [["a", "1"]])], + ) + parser = d.DoclingPDFParser(converter_factory=lambda: _FakeConverter(doc)) + res = parser.extract("/x.pdf") + assert len(res.figures) == 1 + assert res.figures[0].width_px == 300 and res.figures[0].height_px == 300 + assert len(res.tables) == 1 + assert res.tables[0].headers == ["k", "v"] and res.tables[0].rows == [["a", "1"]] + assert "k" in res.tables[0].markdown + + +class TestDoclingConverterConfig(unittest.TestCase): + def test_converter_enables_picture_images(self): + import importlib.util + if importlib.util.find_spec("docling") is None: + self.skipTest("docling extra required") + from perspicacite.pipeline.parsers.docling_pdf import _make_docling_converter + from docling.datamodel.base_models import InputFormat + conv = _make_docling_converter() + opts = conv.format_to_options[InputFormat.PDF].pipeline_options + assert opts.generate_picture_images is True + assert opts.images_scale >= 2.0 + + +class TestFigureToMultimodalShape(unittest.TestCase): + def test_figure_maps_to_kind_caption_content(self): + from perspicacite.pipeline.parsers.docling_pdf import ( + DoclingFigure, figure_to_multimodal_record, + ) + f = DoclingFigure(page=1, caption="Figure 2. Workflow.", + width_px=400, height_px=300, image_bytes=b"x") + rec = figure_to_multimodal_record(f) + assert rec["kind"] == "figure" + assert rec["caption"] == "Figure 2. Workflow." + assert rec["label"] == "Figure 2" + assert "content" in rec + + def test_figure_without_label_caption(self): + from perspicacite.pipeline.parsers.docling_pdf import ( + DoclingFigure, figure_to_multimodal_record, + ) + f = DoclingFigure(page=1, caption="An unlabeled panel", width_px=400, height_px=300) + rec = figure_to_multimodal_record(f) + assert rec["kind"] == "figure" + assert rec["label"] == "" diff --git a/tests/unit/test_docling_table_chunks.py b/tests/unit/test_docling_table_chunks.py new file mode 100644 index 00000000..cd5c038c --- /dev/null +++ b/tests/unit/test_docling_table_chunks.py @@ -0,0 +1,25 @@ +import unittest + + +class TestTableChunks(unittest.TestCase): + def test_table_records_become_table_chunks(self): + from perspicacite.pipeline.parsers.docling_pdf import DoclingTable + from perspicacite.pipeline.chunking_dispatch import table_records_to_chunks + + class _Paper: + paper_id = "local:abc" + title = "T"; doi = None; year = None + tables = [DoclingTable(page=3, caption="Table 1. Params.", + markdown="| k | v |\n| a | 1 |", headers=["k", "v"], rows=[["a", "1"]])] + chunks = table_records_to_chunks(tables, _Paper(), start_index=0) + assert len(chunks) == 1 + c = chunks[0] + assert c.metadata.content_type == "table" + assert c.metadata.page == 3 + assert "Table 1" in c.text and "| k | v |" in c.text + + def test_empty_tables_yield_no_chunks(self): + from perspicacite.pipeline.chunking_dispatch import table_records_to_chunks + class _Paper: + paper_id = "p"; title = None; doi = None; year = None + assert table_records_to_chunks([], _Paper(), start_index=5) == [] diff --git a/tests/unit/test_local_docs_docling_wire.py b/tests/unit/test_local_docs_docling_wire.py new file mode 100644 index 00000000..7331d5d1 --- /dev/null +++ b/tests/unit/test_local_docs_docling_wire.py @@ -0,0 +1,40 @@ +import asyncio +import unittest +from pathlib import Path + + +class TestReadTextIsFitzTextOnly(unittest.TestCase): + def test_pdf_returns_text_string(self): + from perspicacite.integrations.local_docs import _read_text + from perspicacite.pipeline.parsers.pdf import ParsedContent + + class _FakeParser: + async def parse(self, source): + return ParsedContent(text="body text") + + out = asyncio.run(_read_text(Path("/x.pdf"), "pdf", _FakeParser())) + assert out == "body text" + + def test_pdf_empty_returns_none(self): + from perspicacite.integrations.local_docs import _read_text + from perspicacite.pipeline.parsers.pdf import ParsedContent + + class _FakeParser: + async def parse(self, source): + return ParsedContent(text="") + + assert asyncio.run(_read_text(Path("/x.pdf"), "pdf", _FakeParser())) is None + + def test_non_pdf_returns_text(self): + import os + import tempfile + + from perspicacite.integrations.local_docs import _read_text + with tempfile.NamedTemporaryFile("w", suffix=".txt", delete=False) as f: + f.write("hello world") + p = Path(f.name) + try: + out = asyncio.run(_read_text(p, "text", None)) + assert "hello world" in out + finally: + os.unlink(p) diff --git a/tests/unit/test_pdf_backend_guard.py b/tests/unit/test_pdf_backend_guard.py new file mode 100644 index 00000000..9912a87d --- /dev/null +++ b/tests/unit/test_pdf_backend_guard.py @@ -0,0 +1,85 @@ +import unittest + + +class _Cfg: + def __init__(self, flag=True, max_pages=40, timeout=600): + self.docling_extract_tables_figures = flag + self.docling_max_pages = max_pages + self.docling_timeout_s = timeout + + +class TestShouldRunDoclingExtras(unittest.TestCase): + def test_flag_off_returns_false(self): + from perspicacite.pipeline.parsers.pdf import PDFParser + assert PDFParser()._should_run_docling_extras(5, _Cfg(flag=False)) is False + + def test_flag_on_importable_small_returns_true(self): + from perspicacite.pipeline.parsers import pdf as m + orig = m._docling_importable + m._docling_importable = lambda: True + try: + assert m.PDFParser()._should_run_docling_extras(5, _Cfg()) is True + finally: + m._docling_importable = orig + + def test_oversized_returns_false(self): + from perspicacite.pipeline.parsers import pdf as m + orig = m._docling_importable + m._docling_importable = lambda: True + try: + assert m.PDFParser()._should_run_docling_extras(999, _Cfg(max_pages=40)) is False + finally: + m._docling_importable = orig + + def test_not_importable_returns_false(self): + from perspicacite.pipeline.parsers import pdf as m + orig = m._docling_importable + m._docling_importable = lambda: False + try: + assert m.PDFParser()._should_run_docling_extras(5, _Cfg()) is False + finally: + m._docling_importable = orig + + +class TestTimeoutFallback(unittest.TestCase): + def test_timeout_branch_via_stub(self): + from concurrent.futures import TimeoutError as FTimeout + + from perspicacite.pipeline.parsers.pdf import PDFParser + p = PDFParser() + + class _Fut: + def result(self, timeout): raise FTimeout() + + class _Ex: + def __enter__(self): return self + def __exit__(self, *a): return False + def submit(self, *a, **k): return _Fut() + + import concurrent.futures as cf + orig_ex = cf.ProcessPoolExecutor + cf.ProcessPoolExecutor = lambda *a, **k: _Ex() + try: + assert p._run_docling_with_timeout("/x.pdf", timeout_s=1) is None + finally: + cf.ProcessPoolExecutor = orig_ex + + def test_error_branch_returns_none(self): + from perspicacite.pipeline.parsers.pdf import PDFParser + p = PDFParser() + + class _Fut: + def result(self, timeout): raise RuntimeError("boom") + + class _Ex: + def __enter__(self): return self + def __exit__(self, *a): return False + def submit(self, *a, **k): return _Fut() + + import concurrent.futures as cf + orig_ex = cf.ProcessPoolExecutor + cf.ProcessPoolExecutor = lambda *a, **k: _Ex() + try: + assert p._run_docling_with_timeout("/x.pdf", timeout_s=1) is None + finally: + cf.ProcessPoolExecutor = orig_ex