Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@ MANIFEST.in

# IDE / Gemini agent
.gemini/
FEATURE_ROADMAP.md

# Logs
*.log
Expand Down
19 changes: 19 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,25 @@ All notable changes to **LongParser** are documented here.
This project follows [Semantic Versioning](https://semver.org/) and
[Keep a Changelog](https://keepachangelog.com/en/1.1.0/).

## [0.1.5] — 2026-05-05

### Added

- **Semantic chunking** — `all-MiniLM-L6-v2` embedding-based boundary detection in `HybridChunker` (optional via `use_semantic_chunking`).
- **Cross-reference resolution** — Highly efficient $O(N)$ resolution for explicit ("Figure 3") and implicit ("the table below") references via spatial proximity.
- **Summary chunks** — Asynchronous ARQ background worker (`enrich_summaries_job`) to auto-generate LLM section summaries for hierarchical RAG retrieval.
- **Chunk quality scorer** — Zero-ML, heuristic-based chunk scoring using block token confidences, Dictionary Word Coverage (`/usr/share/dict/words`), and fastText Lang-ID validation.
- **PII redaction** — Hybrid approach using fast Regex+Luhn (Emails, Phones, SSNs, CCs, IPs) and optional spaCy NER (`en_core_web_sm`) for names, organizations, and locations. Preserves original values in secure block metadata for HITL.

### Changed

- Bumped `marker-pdf` version support in dependencies.
- Added `ner` optional dependency group (`spacy>=3.7.0`) in `pyproject.toml`.
- Expanded `ChunkingConfig` and `ProcessingConfig` with new semantic, summary, and PII toggle options.
- Marked Phase 1 as officially complete in Roadmap.

---

## [0.1.3] — 2026-04-13

### Fixed
Expand Down
150 changes: 0 additions & 150 deletions FEATURE_ROADMAP.md

This file was deleted.

7 changes: 6 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,8 +36,13 @@

| Feature | Detail |
|---------|--------|
| **Multi-format extraction** | PDF, DOCX, PPTX, XLSX, CSV via Docling |
| **Multi-format extraction** | PDF, DOCX, PPTX, XLSX, CSV via Docling & Marker |
| **Hybrid chunking** | Token-aware, heading-hierarchy-aware, table-aware |
| **Semantic chunking** | Embedding-based boundaries using `all-MiniLM-L6-v2` |
| **Cross-referencing** | Deterministic linking of explicit and implicit charts/figures |
| **Quality scoring** | Zero-ML heuristic scoring with dictionary & fastText validation |
| **PII redaction** | Hybrid Regex + NER (spaCy) redaction with secure HITL preservation |
| **Summary chunks** | Async ARQ worker generating hierarchical LLM section summaries |
| **HITL review** | Human-in-the-Loop block & chunk editing before embedding |
| **LangGraph HITL** | `approve / edit / reject` workflow with LangGraph `interrupt()` and MongoDB checkpointer |
| **3-layer memory** | Short-term turns + rolling summary + long-term facts |
Expand Down
21 changes: 20 additions & 1 deletion docs/changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,25 @@ All notable changes to **LongParser** are documented here.
This project follows [Semantic Versioning](https://semver.org/) and
[Keep a Changelog](https://keepachangelog.com/en/1.1.0/).

## [0.1.5] — 2026-05-05

### Added

- **Semantic chunking** — `all-MiniLM-L6-v2` embedding-based boundary detection in `HybridChunker` (optional via `use_semantic_chunking`).
- **Cross-reference resolution** — Highly efficient $O(N)$ resolution for explicit ("Figure 3") and implicit ("the table below") references via spatial proximity.
- **Summary chunks** — Asynchronous ARQ background worker (`enrich_summaries_job`) to auto-generate LLM section summaries for hierarchical RAG retrieval.
- **Chunk quality scorer** — Zero-ML, heuristic-based chunk scoring using block token confidences, Dictionary Word Coverage (`/usr/share/dict/words`), and fastText Lang-ID validation.
- **PII redaction** — Hybrid approach using fast Regex+Luhn (Emails, Phones, SSNs, CCs, IPs) and optional spaCy NER (`en_core_web_sm`) for names, organizations, and locations. Preserves original values in secure block metadata for HITL.

### Changed

- Bumped `marker-pdf` version support in dependencies.
- Added `ner` optional dependency group (`spacy>=3.7.0`) in `pyproject.toml`.
- Expanded `ChunkingConfig` and `ProcessingConfig` with new semantic, summary, and PII toggle options.
- Marked Phase 1 as officially complete in Roadmap.

---

## [0.1.3] — 2026-04-13

### Fixed
Expand Down Expand Up @@ -74,7 +93,7 @@ for production RAG pipelines.
via LangGraph `interrupt()` before embedding
- **3-layer memory chat** — short-term turns + rolling summary + long-term facts,
powered by LCEL chains
- **Multi-provider LLM support** — OpenAI (`gpt-5.3`), Gemini (`gemini-2.5`),
- **Multi-provider LLM support** — OpenAI (`gpt-4o`), Gemini (`gemini-2.0-flash`),
Groq (`llama-3.3-70b-versatile`), OpenRouter
- **Multi-backend vector stores** — Chroma, FAISS, Qdrant
- **Async-first REST API** — FastAPI + Motor (MongoDB) + ARQ (Redis job queue)
Expand Down
8 changes: 8 additions & 0 deletions docs/getting-started/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -58,6 +58,9 @@ config = ProcessingConfig(
formula_ocr=True,
export_images=False,
max_pages=None, # None = all pages
redact_pii=False, # Enable fast Regex/Luhn PII redaction
use_ner_redaction=False, # Enable spaCy NER for contextual PII
ner_model="en_core_web_sm", # Model to use if use_ner_redaction is True
)
```

Expand All @@ -74,5 +77,10 @@ config = ChunkingConfig(
generate_schema_chunks=True, # table schema chunks
table_chunk_format="row_record", # pipe | row_record
wide_table_col_threshold=15,
use_semantic_chunking=False, # Split at semantic shifts
semantic_threshold=0.3, # Cosine similarity threshold
semantic_model="all-MiniLM-L6-v2", # Model for semantic chunking
resolve_cross_references=True, # Resolve explicit & implicit refs
generate_summary_chunks=False, # Use LLM to summarize sections
)
```
2 changes: 1 addition & 1 deletion docs/getting-started/installation.md
Original file line number Diff line number Diff line change
Expand Up @@ -104,5 +104,5 @@ The server starts on `http://localhost:8000`.

```python
import longparser
print(longparser.__version__) # 0.1.4
print(longparser.__version__) # 0.1.5
```
56 changes: 56 additions & 0 deletions docs/guide/chunking.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ config = ChunkingConfig(
detect_equations=True,
table_chunk_format="row_record", # or "pipe"
generate_schema_chunks=True,
use_semantic_chunking=True, # Split on semantic topic shifts
)

chunker = HybridChunker(config)
Expand Down Expand Up @@ -68,3 +69,58 @@ class Chunk:
Chunks respect a hard `max_tokens` ceiling. Equations are kept with their surrounding context using a **glue heuristic**:

- If the *next* block is an equation AND the current window overflows, the last paragraph carries over into the new chunk (so the equation is never split from its context).

## Semantic Chunking

When `use_semantic_chunking=True`, the chunker uses embedding similarity (default: `all-MiniLM-L6-v2`) to detect topic shifts within a section. Instead of splitting purely by token count, it finds natural breakpoints where the semantic content changes.

```python
config = ChunkingConfig(
use_semantic_chunking=True,
semantic_threshold=0.3, # Lower = more splits
semantic_model="all-MiniLM-L6-v2", # or "all-mpnet-base-v2"
)
```

The model is lazily loaded on first use — no memory cost if the feature is disabled.

## Cross-Reference Resolution

When `resolve_cross_references=True` (default), the pipeline automatically links textual references to their target blocks:

- **Explicit references:** `"see Figure 3"`, `"Table 2"`, `"Section 3.1"`, `"Appendix A"` → linked via regex + dictionary lookup.
- **Implicit references:** `"the figure above"`, `"the table below"` → linked via spatial proximity in reading order.

Resolved links appear in chunk metadata:

```json
{
"cross_references": [
{"label": "Figure 3", "target_block_id": "block-uuid-123"},
{"label": "the table below", "target_block_id": "block-uuid-456", "resolution": "proximity"}
]
}
```

## Quality Scoring

Each chunk receives a `quality_score` (0.0–1.0) based on:

- **Block confidence** — OCR confidence from the extraction engine
- **Dictionary word coverage** — percentage of words found in `/usr/share/dict/words` (penalizes garbled OCR)
- **Language ID confidence** — fastText-based language detection score (low confidence = noise)

```python
from longparser.chunkers.quality_scorer import score_chunks
scored = score_chunks(chunks, blocks)
print(scored[0].quality_score) # 0.92
```

## PII Redaction

When `redact_pii=True` in `ProcessingConfig`, the pipeline automatically masks sensitive data **before** any HITL review:

- **Pass 1 (always):** Fast regex + Luhn checksum for Emails, Phones, SSNs, Credit Cards, IPs.
- **Pass 2 (optional):** spaCy NER (`use_ner_redaction=True`) for names, organizations, and locations.

Original values are preserved in `block.pii_redactions` for authorized recovery.
2 changes: 1 addition & 1 deletion docs/guide/parsing.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
# Document Parsing

LongParser uses **Docling** with Tesseract CLI OCR as its extraction engine — supporting PDF, DOCX, PPTX, XLSX, and CSV.
LongParser supports multiple extraction backends — **Docling** (default, with Tesseract OCR), **PyMuPDF4LLM** (10× faster for simple PDFs), and **Marker** (high-fidelity Markdown for academic papers).

## Supported Formats

Expand Down
5 changes: 5 additions & 0 deletions docs/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,11 @@ Most RAG pipelines fail at the data layer. Hallucinations, missed tables, garble
|---|---|
| Multi-format extraction | PDF, DOCX, PPTX, XLSX, CSV |
| Hybrid chunking (6 strategies) | ✅ |
| Semantic chunking (embedding-based) | ✅ |
| Cross-referencing & linking | ✅ |
| Quality scoring (Zero-ML heuristics) | ✅ |
| PII Redaction (Regex + spaCy NER) | ✅ |
| Summary chunks generation | ✅ |
| HITL review workflow | ✅ |
| 3-layer memory chat | ✅ |
| Built-in citation validation | ✅ |
Expand Down
5 changes: 5 additions & 0 deletions docs/reference/chunkers.md
Original file line number Diff line number Diff line change
Expand Up @@ -54,3 +54,8 @@ chunks: list[Chunk] = chunker.chunk(doc.blocks)
| `table_chunk_format` | `"row_record"` | `pipe` or `row_record` |
| `wide_table_col_threshold` | `15` | Split columns into bands above this |
| `min_chunk_tokens` | `20` | Merge chunks smaller than this |
| `use_semantic_chunking` | `False` | Embedding-based topic boundary detection |
| `semantic_threshold` | `0.3` | Cosine similarity threshold for splits |
| `semantic_model` | `"all-MiniLM-L6-v2"` | Sentence-transformer model |
| `resolve_cross_references` | `True` | Link Figure/Table/Section references |
| `generate_summary_chunks` | `False` | LLM-generated section summaries |
28 changes: 28 additions & 0 deletions docs/reference/extractors.md
Original file line number Diff line number Diff line change
Expand Up @@ -70,3 +70,31 @@ latex = ocr.recognize(pil_image) # Returns LaTeX string

!!! note
Requires `pip install "longparser[latex-ocr]"` (`pix2tex`).

## MarkerExtractor

High-fidelity Markdown extractor for complex academic PDFs using `marker-pdf`.

```python
from longparser.extractors.marker_extractor import MarkerExtractor

extractor = MarkerExtractor()
doc = extractor.extract("academic_paper.pdf", ProcessingConfig())
```

!!! note
Requires `pip install "longparser[marker]"` (`marker-pdf`).

## PyMuPDFExtractor

Lightweight, fast alternative for speed-critical pipelines (10× faster than Docling for simple PDFs).

```python
from longparser.extractors.pymupdf_extractor import PyMuPDFExtractor

extractor = PyMuPDFExtractor()
doc = extractor.extract("simple_report.pdf", ProcessingConfig())
```

!!! warning
PyMuPDF4LLM is licensed under AGPL. It is only loaded when explicitly requested.
Loading
Loading