feat: implement recursive chunking with chonkie and persist to db#198
Open
AtharvaPatange wants to merge 3 commits intoExtralit:developfrom
Open
feat: implement recursive chunking with chonkie and persist to db#198AtharvaPatange wants to merge 3 commits intoExtralit:developfrom
AtharvaPatange wants to merge 3 commits intoExtralit:developfrom
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
[Feature] Implement recursive chunking with chonkie and persist to db
Description
This PR adds a text chunking pipeline to the document processing workflow. It introduces an RQ job that takes the JSON output from the OCR/text-extraction worker, extracts plain text from various JSON shapes (markdown, text, Marker-style pages→blocks), chunks it using chonkie's
RecursiveChunker, and persists both the extracted text and chunks intodocuments.metadata_in the database.Key changes:
chunker.py— Rewrote to usechonkie.RecursiveChunker(default 512-token chunks) instead of a custom naive whitespace splitter.metadata.py— AddedChunkMetadatamodel andchunksfield toTextExtractionMetadata; addedupdate_text_extraction_results()helper onDocumentProcessingMetadata.document_jobs.py— Addedprocess_text_extraction_result_jobRQ job and_extract_text_from_result()helper that handles multiple OCR JSON formats.preprocessing.py— Removed stale chunking logic (chunk_pdf()method) and unused imports.chunk_store.py— Deleted. Chunks are persisted to DB, not stored in a process-local in-memory dict.documents.py(workflow) — Imported the new job; ready to be wired via RQ Callback when text-extraction completes.pyproject.toml— Addedchonkie>=1.0.2dependency.Related Tickets & Documents
Assigned by @JonnyTran via chat
What type of PR is this? (check all applicable)
Steps to QA
pip install -e ".[dev]"fromextralit-server/.Added/updated tests?
have not been included
Added/updated documentations?
Checklist