feat: implement recursive chunking with chonkie and persist to db by AtharvaPatange · Pull Request #198 · Extralit/extralit

AtharvaPatange · 2026-03-06T13:05:45Z

[Feature] Implement recursive chunking with chonkie and persist to db

Description

This PR adds a text chunking pipeline to the document processing workflow. It introduces an RQ job that takes the JSON output from the OCR/text-extraction worker, extracts plain text from various JSON shapes (markdown, text, Marker-style pages→blocks), chunks it using chonkie's RecursiveChunker, and persists both the extracted text and chunks into documents.metadata_ in the database.

Key changes:

chunker.py — Rewrote to use chonkie.RecursiveChunker (default 512-token chunks) instead of a custom naive whitespace splitter.
metadata.py — Added ChunkMetadata model and chunks field to TextExtractionMetadata; added update_text_extraction_results() helper on DocumentProcessingMetadata.
document_jobs.py — Added process_text_extraction_result_job RQ job and _extract_text_from_result() helper that handles multiple OCR JSON formats.
preprocessing.py — Removed stale chunking logic (chunk_pdf() method) and unused imports.
chunk_store.py — Deleted. Chunks are persisted to DB, not stored in a process-local in-memory dict.
documents.py (workflow) — Imported the new job; ready to be wired via RQ Callback when text-extraction completes.
pyproject.toml — Added chonkie>=1.0.2 dependency.

Related Tickets & Documents

Assigned by @JonnyTran via chat

What type of PR is this? (check all applicable)

Steps to QA

Install dependencies: pip install -e ".[dev]" from extralit-server/.

Run the new unit tests:

python -m pytest tests/unit/contexts/document/ -v --noconftest

Added/updated tests?

Yes
No, and this is why: please replace this line with details on why tests
have not been included
I need help with writing tests

Added/updated documentations?

Yes
No, and this is why: This is an internal backend pipeline addition. API-facing docs are not affected since chunks are stored in the existing documents.metadata_ JSON column with no new endpoints.
I need help with writing docs

Checklist

I have added relevant notes to the CHANGELOG.md file (See https://keepachangelog.com/)

AtharvaPatange added 3 commits March 6, 2026 17:53

feat: implement recursive chunking with chonkie and persist to db

792f2ff

added relevant notes to the CHANGELOG.md file

d39dd7b

added relevant notes to the CHANGELOG.md file

ee084d1

AtharvaPatange requested a review from a team as a code owner March 6, 2026 13:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: implement recursive chunking with chonkie and persist to db#198

feat: implement recursive chunking with chonkie and persist to db#198
AtharvaPatange wants to merge 3 commits intoExtralit:developfrom
AtharvaPatange:feat/text-chunking

AtharvaPatange commented Mar 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

AtharvaPatange commented Mar 6, 2026

Description

Related Tickets & Documents

What type of PR is this? (check all applicable)

Steps to QA

Added/updated tests?

Added/updated documentations?

Checklist

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant