Skip to content

Simplify segment API; parallel DeepcupTokenizer#102

Merged
bact merged 37 commits into
PyThaiNLP:mainfrom
bact:simplify-api
Apr 4, 2026
Merged

Simplify segment API; parallel DeepcupTokenizer#102
bact merged 37 commits into
PyThaiNLP:mainfrom
bact:simplify-api

Conversation

@bact
Copy link
Copy Markdown
Member

@bact bact commented Apr 3, 2026

  • Provide better ergonomic for .segment()
  • Remove .segment_to_string() -- to resolve Clarify method in Tokenizer trait #38
  • All tokenizers now share the same palleral logic
  • And some other restructuring
  • Fix packaging

bact added 2 commits April 3, 2026 02:46
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
@bact bact added the enhancement New feature or request label Apr 3, 2026
@review-notebook-app
Copy link
Copy Markdown

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@bact bact requested a review from Copilot April 3, 2026 01:57
@bact bact marked this pull request as draft April 3, 2026 01:58
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR simplifies the core Tokenizer trait to a default segment(text) / segment_to_string(text) API and introduces a shared chunk-based parallelization configuration that’s reused across tokenizers (including Deepcut), with corresponding updates across tests, benches, bindings, and docs.

Changes:

  • Simplified Tokenizer trait signatures to text-only defaults; moved advanced controls to tokenizer-specific option methods.
  • Added shared parallel chunking configuration (ParallelOptions) and helper utilities for chunk splitting/token execution.
  • Extended tokenizers (NewMM/FST/Deepcut) and bindings/docs/tests to use the new API shape.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
tests/test_tokenizer.rs Updates Rust tests to new segment_to_string(...) and options-based APIs.
src/tokenizer/tokenizer_trait.rs Simplifies the Tokenizer trait to default segmentation methods.
src/tokenizer/parallel_options.rs Adds shared chunk-size/parallel enablement heuristics.
src/tokenizer/parallel_helper.rs Adds chunk splitting + (optional) parallel chunk tokenization helpers.
src/tokenizer/newmm.rs Adds segment_with_options / segment_parallel APIs and wires parallel gating to chunk size.
src/tokenizer/deepcut.rs Adds options-based chunked segmentation and auto-parallel helper for Deepcut.
src/tokenizer.rs Registers new tokenizer submodules (parallel_options, parallel_helper).
README.md Rewrites top-level docs to reflect the simplified API and new options.
nlpo3-python/tests/test_tokenize.py Minor cleanup in Python tests data/imports.
nlpo3-python/src/lib.rs Updates Python binding method signatures to use parallel_chunk_size and options APIs.
nlpo3-python/README.md Refreshes Python README to match the class-based API and new options.
nlpo3-python/notebooks/nlpo3_segment_benchmarks.ipynb Notebook JSON formatting changes; still contains benchmark code cells.
nlpo3-python/nlpo3/_nlpo3_python_backend.pyi Minor formatting adjustment in stubs.
nlpo3-python/CHANGELOG.md Condenses v2-line changelog wording and pointers.
nlpo3-nodejs/README.md Updates Node.js README copy/structure.
nlpo3-cli/README.md Updates CLI README copy/structure and flags documentation.
docs/impl-notes.md Adds implementation/design notes for the API redesign and parallel chunking.
CHANGELOG.md Reworks top-level changelog for the v2 line and migration notes.
benches/tokenizer.rs Updates benchmarks to the new segmentation APIs.

Comment thread benches/tokenizer.rs
Comment thread benches/tokenizer.rs
Comment thread src/tokenizer/tokenizer_trait.rs Outdated
Comment thread src/tokenizer/tokenizer_trait.rs Outdated
Comment thread src/tokenizer/deepcut.rs
Comment thread src/tokenizer/parallel_helper.rs
Comment thread src/tokenizer/parallel_helper.rs Outdated
Comment thread tests/test_tokenizer.rs Outdated
Comment thread nlpo3-python/notebooks/nlpo3_segment_benchmarks.ipynb
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
@bact bact requested a review from Copilot April 3, 2026 02:36
@bact bact marked this pull request as ready for review April 3, 2026 02:37
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 23 out of 24 changed files in this pull request and generated 11 comments.

Comment thread nlpo3-cli/src/main.rs
Comment thread nlpo3-nodejs/src/lib.rs Outdated
Comment thread nlpo3-nodejs/src/lib.rs Outdated
Comment thread nlpo3-nodejs/src/lib.rs Outdated
Comment thread src/tokenizer/parallel_helper.rs Outdated
Comment thread src/tokenizer/parallel_helper.rs
Comment thread src/tokenizer/parallel_helper.rs Outdated
Comment thread src/tokenizer/deepcut.rs
Comment thread CHANGELOG.md
Comment thread nlpo3-python/CHANGELOG.md
bact added 4 commits April 3, 2026 04:04
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 23 out of 24 changed files in this pull request and generated 5 comments.

Comment thread benches/tokenizer.rs
Comment thread src/tokenizer/parallel_helper.rs Outdated
Comment thread src/tokenizer/newmm.rs Outdated
Comment thread README.md
Comment thread nlpo3-python/CHANGELOG.md
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 23 out of 24 changed files in this pull request and generated 4 comments.

Comment thread benches/tokenizer.rs
Comment thread nlpo3-nodejs/src/lib.rs
Comment thread src/tokenizer/newmm.rs Outdated
Comment thread nlpo3-python/CHANGELOG.md
bact added 3 commits April 3, 2026 11:44
Add nlpo3-nodejs test workflow

Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 31 out of 33 changed files in this pull request and generated 3 comments.

Comment thread src/tokenizer/newmm.rs Outdated
Comment thread src/tokenizer/newmm.rs Outdated
Comment thread src/tokenizer/deepcut.rs Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 51 out of 62 changed files in this pull request and generated 5 comments.

Comment thread src/tokenizer/parallel_helper.rs Outdated
Comment thread README.md Outdated
Comment thread nlpo3-python/README.md Outdated
Comment thread nlpo3-python/CHANGELOG.md Outdated
Comment thread docs/design.md Outdated
bact and others added 2 commits April 3, 2026 22:35
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 51 out of 62 changed files in this pull request and generated 1 comment.

Comment thread src/tokenizer/newmm.rs Outdated
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 51 out of 62 changed files in this pull request and generated 1 comment.

Comment thread nlpo3-python/CHANGELOG.md
nlpo3-js -> nlpo3-nodejs

Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 48 out of 57 changed files in this pull request and generated 4 comments.

Files not reviewed (1)
  • nlpo3-nodejs/package-lock.json: Language not supported
Comments suppressed due to low confidence (1)

.github/workflows/test-nlpo3-python.yml:60

  • The CI matrix includes Python 3.14 and uses actions/setup-python@v6. GitHub Actions may not have a stable 3.14 build available (and @v6 for the setup actions may not exist / may be unsupported), which would break CI. Consider testing against released CPython versions (e.g., 3.9–3.13) and using the latest known major action versions (e.g., checkout/setup-python v4+) unless you’ve confirmed v6 and 3.14 are available.
        os: [ubuntu-latest]
        python-version: ["3.14", "3.13", "3.12", "3.11", "3.10", "3.9"]
        include:
          - os: windows-latest
            python-version: "3.14"
          - os: macos-latest
            python-version: "3.14"

    runs-on: ${{ matrix.os }}

    steps:
      - name: Checkout source code
        uses: actions/checkout@v6

      - name: Setup Rust toolchain
        uses: actions-rust-lang/setup-rust-toolchain@v1

      - name: Set up Python ${{ matrix.python-version }}
        uses: actions/setup-python@v6
        with:

Comment thread src/tokenizer/newmm.rs Outdated
Comment thread tests/data/gen-dicts.py
Comment thread nlpo3-nodejs/CHANGELOG.md
Comment thread nlpo3-cli/CHANGELOG.md
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
Save some resource
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 50 out of 59 changed files in this pull request and generated no new comments.

Files not reviewed (1)
  • nlpo3-nodejs/package-lock.json: Language not supported

Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 56 out of 66 changed files in this pull request and generated 1 comment.

Files not reviewed (1)
  • nlpo3-nodejs/package-lock.json: Language not supported
Comments suppressed due to low confidence (1)

nlpo3-python/MANIFEST.in:7

  • nlpo3-python/Cargo.toml depends on nlpo3 via path = "..", but this MANIFEST no longer includes the parent Rust crate sources (and model assets) in the sdist. Building from an sdist outside the monorepo will fail. Either re-add the needed graft ../src / graft ../model (and root Cargo files) for sdists, or switch the dependency to a crates.io nlpo3 = "2.0" without a path dependency.

Comment thread nlpo3-nodejs/README.md Outdated
bact added 2 commits April 4, 2026 02:28
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
Signed-off-by: Arthit Suriyawongkul <arthit@gmail.com>
@bact
Copy link
Copy Markdown
Member Author

bact commented Apr 4, 2026

💯

@bact bact merged commit b95ad66 into PyThaiNLP:main Apr 4, 2026
26 checks passed
@bact bact deleted the simplify-api branch April 4, 2026 01:37
Copilot AI added a commit that referenced this pull request Apr 4, 2026
Sync all changes from main (PR #102 simplify-api and subsequent commits):
- New parallel API: segment_with_options(), segment_parallel(), ParallelOptions
- Simplified Tokenizer trait (segment only)
- tests/data/ benchmark dictionaries (500-short, 500-long, 10k, words_th)
- docs/ directory (design.md, implementation.md, BENCHMARK_RESULTS.md)
- nlpo3-js renamed from nlpo3-nodejs; Node.js/Python binding updates
- CLI dict moved to nlpo3-cli/

Branch-specific changes preserved:
- TrieChar memory optimization (HashSet removed, word_count counter used)
- Updated BENCHMARK_RESULTS with TrieChar optimization numbers

CHANGELOG.md conflict resolved: keep main concise summary + add
TrieChar optimization bullet.

Co-authored-by: bact <128572+bact@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Clarify method in Tokenizer trait

2 participants