Skip to content

feat: implement recursive chunking with chonkie and persist to db#198

Open
AtharvaPatange wants to merge 3 commits intoExtralit:developfrom
AtharvaPatange:feat/text-chunking
Open

feat: implement recursive chunking with chonkie and persist to db#198
AtharvaPatange wants to merge 3 commits intoExtralit:developfrom
AtharvaPatange:feat/text-chunking

Conversation

@AtharvaPatange
Copy link

[Feature] Implement recursive chunking with chonkie and persist to db

Description

This PR adds a text chunking pipeline to the document processing workflow. It introduces an RQ job that takes the JSON output from the OCR/text-extraction worker, extracts plain text from various JSON shapes (markdown, text, Marker-style pages→blocks), chunks it using chonkie's RecursiveChunker, and persists both the extracted text and chunks into documents.metadata_ in the database.

Key changes:

  • chunker.py — Rewrote to use chonkie.RecursiveChunker (default 512-token chunks) instead of a custom naive whitespace splitter.
  • metadata.py — Added ChunkMetadata model and chunks field to TextExtractionMetadata; added update_text_extraction_results() helper on DocumentProcessingMetadata.
  • document_jobs.py — Added process_text_extraction_result_job RQ job and _extract_text_from_result() helper that handles multiple OCR JSON formats.
  • preprocessing.py — Removed stale chunking logic (chunk_pdf() method) and unused imports.
  • chunk_store.py — Deleted. Chunks are persisted to DB, not stored in a process-local in-memory dict.
  • documents.py (workflow) — Imported the new job; ready to be wired via RQ Callback when text-extraction completes.
  • pyproject.toml — Added chonkie>=1.0.2 dependency.

Related Tickets & Documents

Assigned by @JonnyTran via chat

What type of PR is this? (check all applicable)

  • Refactor
  • Feature
  • Bug Fix
  • Optimization
  • Documentation Update

Steps to QA

  1. Install dependencies: pip install -e ".[dev]" from extralit-server/.
  2. Run the new unit tests:
    python -m pytest tests/unit/contexts/document/ -v --noconftest
    

Added/updated tests?

  • Yes
  • No, and this is why: please replace this line with details on why tests
    have not been included
  • I need help with writing tests

Added/updated documentations?

  • Yes
  • No, and this is why: This is an internal backend pipeline addition. API-facing docs are not affected since chunks are stored in the existing documents.metadata_ JSON column with no new endpoints.
  • I need help with writing docs

Checklist

@AtharvaPatange AtharvaPatange requested a review from a team as a code owner March 6, 2026 13:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant