ENDEVSOLS · MUZAMMILPERVAIZ · May 5, 2026 · Apr 8, 2026 · Apr 8, 2026 · Apr 8, 2026
diff --git a/.gitignore b/.gitignore
@@ -61,6 +61,7 @@ MANIFEST.in
 
 # IDE / Gemini agent
 .gemini/
+FEATURE_ROADMAP.md
 
 # Logs
 *.log

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -5,6 +5,25 @@ All notable changes to **LongParser** are documented here.
 This project follows [Semantic Versioning](https://semver.org/) and
 [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
 
+## [0.1.5] — 2026-05-05
+
+### Added
+
+- **Semantic chunking** — `all-MiniLM-L6-v2` embedding-based boundary detection in `HybridChunker` (optional via `use_semantic_chunking`).
+- **Cross-reference resolution** — Highly efficient $O(N)$ resolution for explicit ("Figure 3") and implicit ("the table below") references via spatial proximity.
+- **Summary chunks** — Asynchronous ARQ background worker (`enrich_summaries_job`) to auto-generate LLM section summaries for hierarchical RAG retrieval.
+- **Chunk quality scorer** — Zero-ML, heuristic-based chunk scoring using block token confidences, Dictionary Word Coverage (`/usr/share/dict/words`), and fastText Lang-ID validation.
+- **PII redaction** — Hybrid approach using fast Regex+Luhn (Emails, Phones, SSNs, CCs, IPs) and optional spaCy NER (`en_core_web_sm`) for names, organizations, and locations. Preserves original values in secure block metadata for HITL.
+
+### Changed
+
+- Bumped `marker-pdf` version support in dependencies.
+- Added `ner` optional dependency group (`spacy>=3.7.0`) in `pyproject.toml`.
+- Expanded `ChunkingConfig` and `ProcessingConfig` with new semantic, summary, and PII toggle options.
+- Marked Phase 1 as officially complete in Roadmap.
+
+---
+
 ## [0.1.3] — 2026-04-13
 
 ### Fixed

diff --git a/FEATURE_ROADMAP.md b/FEATURE_ROADMAP.md
diff --git a/README.md b/README.md
@@ -36,8 +36,13 @@
 
 | Feature | Detail |
 |---------|--------|
-| **Multi-format extraction** | PDF, DOCX, PPTX, XLSX, CSV via Docling |
+| **Multi-format extraction** | PDF, DOCX, PPTX, XLSX, CSV via Docling & Marker |
 | **Hybrid chunking** | Token-aware, heading-hierarchy-aware, table-aware |
+| **Semantic chunking** | Embedding-based boundaries using `all-MiniLM-L6-v2` |
+| **Cross-referencing** | Deterministic linking of explicit and implicit charts/figures |
+| **Quality scoring** | Zero-ML heuristic scoring with dictionary & fastText validation |
+| **PII redaction** | Hybrid Regex + NER (spaCy) redaction with secure HITL preservation |
+| **Summary chunks** | Async ARQ worker generating hierarchical LLM section summaries |
 | **HITL review** | Human-in-the-Loop block & chunk editing before embedding |
 | **LangGraph HITL** | `approve / edit / reject` workflow with LangGraph `interrupt()` and MongoDB checkpointer |
 | **3-layer memory** | Short-term turns + rolling summary + long-term facts |

diff --git a/docs/changelog.md b/docs/changelog.md
@@ -5,6 +5,25 @@ All notable changes to **LongParser** are documented here.
 This project follows [Semantic Versioning](https://semver.org/) and
 [Keep a Changelog](https://keepachangelog.com/en/1.1.0/).
 
+## [0.1.5] — 2026-05-05
+
+### Added
+
+- **Semantic chunking** — `all-MiniLM-L6-v2` embedding-based boundary detection in `HybridChunker` (optional via `use_semantic_chunking`).
+- **Cross-reference resolution** — Highly efficient $O(N)$ resolution for explicit ("Figure 3") and implicit ("the table below") references via spatial proximity.
+- **Summary chunks** — Asynchronous ARQ background worker (`enrich_summaries_job`) to auto-generate LLM section summaries for hierarchical RAG retrieval.
+- **Chunk quality scorer** — Zero-ML, heuristic-based chunk scoring using block token confidences, Dictionary Word Coverage (`/usr/share/dict/words`), and fastText Lang-ID validation.
+- **PII redaction** — Hybrid approach using fast Regex+Luhn (Emails, Phones, SSNs, CCs, IPs) and optional spaCy NER (`en_core_web_sm`) for names, organizations, and locations. Preserves original values in secure block metadata for HITL.
+
+### Changed
+
+- Bumped `marker-pdf` version support in dependencies.
+- Added `ner` optional dependency group (`spacy>=3.7.0`) in `pyproject.toml`.
+- Expanded `ChunkingConfig` and `ProcessingConfig` with new semantic, summary, and PII toggle options.
+- Marked Phase 1 as officially complete in Roadmap.
+
+---
+
 ## [0.1.3] — 2026-04-13
 
 ### Fixed
@@ -74,7 +93,7 @@ for production RAG pipelines.
   via LangGraph `interrupt()` before embedding
 - **3-layer memory chat** — short-term turns + rolling summary + long-term facts,
   powered by LCEL chains
-- **Multi-provider LLM support** — OpenAI (`gpt-5.3`), Gemini (`gemini-2.5`),
+- **Multi-provider LLM support** — OpenAI (`gpt-4o`), Gemini (`gemini-2.0-flash`),
   Groq (`llama-3.3-70b-versatile`), OpenRouter
 - **Multi-backend vector stores** — Chroma, FAISS, Qdrant
 - **Async-first REST API** — FastAPI + Motor (MongoDB) + ARQ (Redis job queue)

diff --git a/docs/getting-started/configuration.md b/docs/getting-started/configuration.md
@@ -58,6 +58,9 @@ config = ProcessingConfig(
     formula_ocr=True,
     export_images=False,
     max_pages=None,             # None = all pages
+    redact_pii=False,           # Enable fast Regex/Luhn PII redaction
+    use_ner_redaction=False,    # Enable spaCy NER for contextual PII
+    ner_model="en_core_web_sm", # Model to use if use_ner_redaction is True
 )
 ```
 
@@ -74,5 +77,10 @@ config = ChunkingConfig(
     generate_schema_chunks=True,    # table schema chunks
     table_chunk_format="row_record", # pipe | row_record
     wide_table_col_threshold=15,
+    use_semantic_chunking=False,    # Split at semantic shifts
+    semantic_threshold=0.3,         # Cosine similarity threshold
+    semantic_model="all-MiniLM-L6-v2", # Model for semantic chunking
+    resolve_cross_references=True,  # Resolve explicit & implicit refs
+    generate_summary_chunks=False,  # Use LLM to summarize sections
 )
 ```
diff --git a/docs/getting-started/installation.md b/docs/getting-started/installation.md
@@ -104,5 +104,5 @@ The server starts on `http://localhost:8000`.
 
 ```python
 import longparser
-print(longparser.__version__)  # 0.1.4
+print(longparser.__version__)  # 0.1.5
 ```
diff --git a/docs/guide/chunking.md b/docs/guide/chunking.md
@@ -25,6 +25,7 @@ config = ChunkingConfig(
     detect_equations=True,
     table_chunk_format="row_record",  # or "pipe"
     generate_schema_chunks=True,
+    use_semantic_chunking=True,       # Split on semantic topic shifts
 )
 
 chunker = HybridChunker(config)
@@ -68,3 +69,58 @@ class Chunk:
 Chunks respect a hard `max_tokens` ceiling. Equations are kept with their surrounding context using a **glue heuristic**:
 
 - If the *next* block is an equation AND the current window overflows, the last paragraph carries over into the new chunk (so the equation is never split from its context).
+
+## Semantic Chunking
+
+When `use_semantic_chunking=True`, the chunker uses embedding similarity (default: `all-MiniLM-L6-v2`) to detect topic shifts within a section. Instead of splitting purely by token count, it finds natural breakpoints where the semantic content changes.
+
+```python
+config = ChunkingConfig(
+    use_semantic_chunking=True,
+    semantic_threshold=0.3,              # Lower = more splits
+    semantic_model="all-MiniLM-L6-v2",   # or "all-mpnet-base-v2"
+)
+```
+
+The model is lazily loaded on first use — no memory cost if the feature is disabled.
+
+## Cross-Reference Resolution
+
+When `resolve_cross_references=True` (default), the pipeline automatically links textual references to their target blocks:
+
+- **Explicit references:** `"see Figure 3"`, `"Table 2"`, `"Section 3.1"`, `"Appendix A"` → linked via regex + dictionary lookup.
+- **Implicit references:** `"the figure above"`, `"the table below"` → linked via spatial proximity in reading order.
+
+Resolved links appear in chunk metadata:
+
+```json
+{
+  "cross_references": [
+    {"label": "Figure 3", "target_block_id": "block-uuid-123"},
+    {"label": "the table below", "target_block_id": "block-uuid-456", "resolution": "proximity"}
+  ]
+}
+```
+
+## Quality Scoring
+
+Each chunk receives a `quality_score` (0.0–1.0) based on:
+
+- **Block confidence** — OCR confidence from the extraction engine
+- **Dictionary word coverage** — percentage of words found in `/usr/share/dict/words` (penalizes garbled OCR)
+- **Language ID confidence** — fastText-based language detection score (low confidence = noise)
+
+```python
+from longparser.chunkers.quality_scorer import score_chunks
+scored = score_chunks(chunks, blocks)
+print(scored[0].quality_score)  # 0.92
+```
+
+## PII Redaction
+
+When `redact_pii=True` in `ProcessingConfig`, the pipeline automatically masks sensitive data **before** any HITL review:
+
+- **Pass 1 (always):** Fast regex + Luhn checksum for Emails, Phones, SSNs, Credit Cards, IPs.
+- **Pass 2 (optional):** spaCy NER (`use_ner_redaction=True`) for names, organizations, and locations.
+
+Original values are preserved in `block.pii_redactions` for authorized recovery.
diff --git a/docs/guide/parsing.md b/docs/guide/parsing.md
@@ -1,6 +1,6 @@
 # Document Parsing
 
-LongParser uses **Docling** with Tesseract CLI OCR as its extraction engine — supporting PDF, DOCX, PPTX, XLSX, and CSV.
+LongParser supports multiple extraction backends — **Docling** (default, with Tesseract OCR), **PyMuPDF4LLM** (10× faster for simple PDFs), and **Marker** (high-fidelity Markdown for academic papers).
 
 ## Supported Formats
 

diff --git a/docs/index.md b/docs/index.md
@@ -37,6 +37,11 @@ Most RAG pipelines fail at the data layer. Hallucinations, missed tables, garble
 |---|---|
 | Multi-format extraction | PDF, DOCX, PPTX, XLSX, CSV |
 | Hybrid chunking (6 strategies) | ✅ |
+| Semantic chunking (embedding-based) | ✅ |
+| Cross-referencing & linking | ✅ |
+| Quality scoring (Zero-ML heuristics) | ✅ |
+| PII Redaction (Regex + spaCy NER) | ✅ |
+| Summary chunks generation | ✅ |
 | HITL review workflow | ✅ |
 | 3-layer memory chat | ✅ |
 | Built-in citation validation | ✅ |

diff --git a/docs/reference/chunkers.md b/docs/reference/chunkers.md
@@ -54,3 +54,8 @@ chunks: list[Chunk] = chunker.chunk(doc.blocks)
 | `table_chunk_format` | `"row_record"` | `pipe` or `row_record` |
 | `wide_table_col_threshold` | `15` | Split columns into bands above this |
 | `min_chunk_tokens` | `20` | Merge chunks smaller than this |
+| `use_semantic_chunking` | `False` | Embedding-based topic boundary detection |
+| `semantic_threshold` | `0.3` | Cosine similarity threshold for splits |
+| `semantic_model` | `"all-MiniLM-L6-v2"` | Sentence-transformer model |
+| `resolve_cross_references` | `True` | Link Figure/Table/Section references |
+| `generate_summary_chunks` | `False` | LLM-generated section summaries |
diff --git a/docs/reference/extractors.md b/docs/reference/extractors.md
@@ -70,3 +70,31 @@ latex = ocr.recognize(pil_image)  # Returns LaTeX string
 
 !!! note
     Requires `pip install "longparser[latex-ocr]"` (`pix2tex`).
+
+## MarkerExtractor
+
+High-fidelity Markdown extractor for complex academic PDFs using `marker-pdf`.
+
+```python
+from longparser.extractors.marker_extractor import MarkerExtractor
+
+extractor = MarkerExtractor()
+doc = extractor.extract("academic_paper.pdf", ProcessingConfig())
+```
+
+!!! note
+    Requires `pip install "longparser[marker]"` (`marker-pdf`).
+
+## PyMuPDFExtractor
+
+Lightweight, fast alternative for speed-critical pipelines (10× faster than Docling for simple PDFs).
+
+```python
+from longparser.extractors.pymupdf_extractor import PyMuPDFExtractor
+
+extractor = PyMuPDFExtractor()
+doc = extractor.extract("simple_report.pdf", ProcessingConfig())
+```
+
+!!! warning
+    PyMuPDF4LLM is licensed under AGPL. It is only loaded when explicitly requested.
-Original file line number
+Diff line change
@@ Expand Up / @@ -61,6 +61,7 @@ MANIFEST.in @@
     # IDE / Gemini agent
     .gemini/
+    FEATURE_ROADMAP.md
     # Logs
     *.log
@@ Expand Down @@