Simplify segment API; parallel DeepcupTokenizer#102
Conversation
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
There was a problem hiding this comment.
Pull request overview
This PR simplifies the core Tokenizer trait to a default segment(text) / segment_to_string(text) API and introduces a shared chunk-based parallelization configuration that’s reused across tokenizers (including Deepcut), with corresponding updates across tests, benches, bindings, and docs.
Changes:
- Simplified
Tokenizertrait signatures to text-only defaults; moved advanced controls to tokenizer-specific option methods. - Added shared parallel chunking configuration (
ParallelOptions) and helper utilities for chunk splitting/token execution. - Extended tokenizers (NewMM/FST/Deepcut) and bindings/docs/tests to use the new API shape.
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/test_tokenizer.rs | Updates Rust tests to new segment_to_string(...) and options-based APIs. |
| src/tokenizer/tokenizer_trait.rs | Simplifies the Tokenizer trait to default segmentation methods. |
| src/tokenizer/parallel_options.rs | Adds shared chunk-size/parallel enablement heuristics. |
| src/tokenizer/parallel_helper.rs | Adds chunk splitting + (optional) parallel chunk tokenization helpers. |
| src/tokenizer/newmm.rs | Adds segment_with_options / segment_parallel APIs and wires parallel gating to chunk size. |
| src/tokenizer/deepcut.rs | Adds options-based chunked segmentation and auto-parallel helper for Deepcut. |
| src/tokenizer.rs | Registers new tokenizer submodules (parallel_options, parallel_helper). |
| README.md | Rewrites top-level docs to reflect the simplified API and new options. |
| nlpo3-python/tests/test_tokenize.py | Minor cleanup in Python tests data/imports. |
| nlpo3-python/src/lib.rs | Updates Python binding method signatures to use parallel_chunk_size and options APIs. |
| nlpo3-python/README.md | Refreshes Python README to match the class-based API and new options. |
| nlpo3-python/notebooks/nlpo3_segment_benchmarks.ipynb | Notebook JSON formatting changes; still contains benchmark code cells. |
| nlpo3-python/nlpo3/_nlpo3_python_backend.pyi | Minor formatting adjustment in stubs. |
| nlpo3-python/CHANGELOG.md | Condenses v2-line changelog wording and pointers. |
| nlpo3-nodejs/README.md | Updates Node.js README copy/structure. |
| nlpo3-cli/README.md | Updates CLI README copy/structure and flags documentation. |
| docs/impl-notes.md | Adds implementation/design notes for the API redesign and parallel chunking. |
| CHANGELOG.md | Reworks top-level changelog for the v2 line and migration notes. |
| benches/tokenizer.rs | Updates benchmarks to the new segmentation APIs. |
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
Add nlpo3-nodejs test workflow Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
nlpo3-js -> nlpo3-nodejs Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 48 out of 57 changed files in this pull request and generated 4 comments.
Files not reviewed (1)
- nlpo3-nodejs/package-lock.json: Language not supported
Comments suppressed due to low confidence (1)
.github/workflows/test-nlpo3-python.yml:60
- The CI matrix includes Python
3.14and usesactions/setup-python@v6. GitHub Actions may not have a stable3.14build available (and@v6for the setup actions may not exist / may be unsupported), which would break CI. Consider testing against released CPython versions (e.g., 3.9–3.13) and using the latest known major action versions (e.g., checkout/setup-python v4+) unless you’ve confirmedv6and3.14are available.
os: [ubuntu-latest]
python-version: ["3.14", "3.13", "3.12", "3.11", "3.10", "3.9"]
include:
- os: windows-latest
python-version: "3.14"
- os: macos-latest
python-version: "3.14"
runs-on: ${{ matrix.os }}
steps:
- name: Checkout source code
uses: actions/checkout@v6
- name: Setup Rust toolchain
uses: actions-rust-lang/setup-rust-toolchain@v1
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v6
with:
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
Save some resource Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 56 out of 66 changed files in this pull request and generated 1 comment.
Files not reviewed (1)
- nlpo3-nodejs/package-lock.json: Language not supported
Comments suppressed due to low confidence (1)
nlpo3-python/MANIFEST.in:7
nlpo3-python/Cargo.tomldepends onnlpo3viapath = "..", but this MANIFEST no longer includes the parent Rust crate sources (and model assets) in the sdist. Building from an sdist outside the monorepo will fail. Either re-add the neededgraft ../src/graft ../model(and root Cargo files) for sdists, or switch the dependency to a crates.ionlpo3 = "2.0"without a path dependency.
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
|
💯 |
Sync all changes from main (PR #102 simplify-api and subsequent commits): - New parallel API: segment_with_options(), segment_parallel(), ParallelOptions - Simplified Tokenizer trait (segment only) - tests/data/ benchmark dictionaries (500-short, 500-long, 10k, words_th) - docs/ directory (design.md, implementation.md, BENCHMARK_RESULTS.md) - nlpo3-js renamed from nlpo3-nodejs; Node.js/Python binding updates - CLI dict moved to nlpo3-cli/ Branch-specific changes preserved: - TrieChar memory optimization (HashSet removed, word_count counter used) - Updated BENCHMARK_RESULTS with TrieChar optimization numbers CHANGELOG.md conflict resolved: keep main concise summary + add TrieChar optimization bullet. Co-authored-by: bact <128572+bact@users.noreply.github.com>
Uh oh!
There was an error while loading. Please reload this page.