Skip to content

Improve Sentence Splitting#1

Open
geraldaddey wants to merge 5 commits into
mainfrom
feature/improve-sentence-splitting
Open

Improve Sentence Splitting#1
geraldaddey wants to merge 5 commits into
mainfrom
feature/improve-sentence-splitting

Conversation

@geraldaddey

@geraldaddey geraldaddey commented Nov 22, 2025

Copy link
Copy Markdown
Owner

Replaced the regex-based sentence splitting in 'caveman_compress.py' with spaCy's more robust sentence segmentation.

This improves accuracy by correctly handling cases like abbreviations (e.g., "Mr. Smith") and other complex sentence structures that the previous implementation missed.

Adds 'spacy' to the requirements.

Refactored the OpenAI API key loading logic into a single, reusable function in 'utils.py'.\n\nThis removes code duplication from five different scripts, making future key management simpler and more robust. The new 'load_api_key' function intelligently finds the '.env' file in the project root.
Replace the regex-based sentence splitting in 'caveman_compress.py' with spaCy's more robust sentence segmentation.

This improves accuracy by correctly handling cases like abbreviations (e.g., "Mr. Smith") and other complex sentence structures that the previous implementation missed.

Adds 'spacy' to the requirements.
@geraldaddey geraldaddey self-assigned this Nov 22, 2025
@geraldaddey geraldaddey marked this pull request as ready for review November 22, 2025 11:00
wilpel and others added 3 commits November 22, 2025 16:32
- Add caveman_compress_mlm.py: Predictability-aware compression using masked language models
- Uses RoBERTa to score token predictability and remove top-k most predictable tokens
- Achieves 20-30% token reduction with high accuracy retention
- Free, offline, no API required
- Update README with MLM installation, usage, and comparison
- Add embedding similarity metrics to LLM-based compressor

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants