[EXPERIMENT][WIP] fix(metaspace): correct pipeline order for prepend_scheme=first with split=True by mlukasze · Pull Request #658 · openvinotoolkit/openvino_tokenizers

mlukasze · 2026-05-13T17:43:09Z

⚠️ AUTOMATICALLY GENERATED BY MEAT AGENT — REQUIRES HUMAN REVIEW ⚠️
This PR was created by an AI agent as part of automated model enablement.
A human maintainer must review and approve it before it can be considered for merge.
Do NOT merge without human review and sign-off.

Description

Two bugs prevented correct conversion of tokenizers that use a Metaspace pre-tokenizer with prepend_scheme="first" AND split=True — specifically SeamlessM4TTokenizerFast (used by facebook/seamless-m4t-v2-large, facebook/nllb-200-distilled-600M, etc.).

Bug 1 — `hf_parser.py`: wrong step inserted at pipeline position 0

Symptom: C++ BPETokenizer raised Incorrect number of inputs (received 17, expected 18).

Root cause: In parse_pre_tokenization_step, parse_metaspace returns [RN_replace_spaces, RN_prepend_first, RegexSplitStep] for the split=True + prepend_scheme=first case. The previous code called steps.pop(), which popped the last element (RegexSplitStep instead of RN_prepend_first) and inserted it at pipeline position 0. This displaced SpecialTokensSplit from position 0, so it was never processed, and the skip tensor was never produced — causing the BPE op to receive the wrong number of inputs.

Fix: Use next(i for ... if isinstance(steps[i], NormalizationStep)) to find the last NormalizationStep (i.e., RN_prepend_first) instead of blindly popping the tail.

Bug 2 — `tokenizer_pipeline.py`: Charsmap normalizer destroys the prepended ▁

Symptom: Token IDs were wrong — the first word was tokenized without its metaspace prefix (e.g. "Hello" instead of "▁Hello"), producing 3 tokens instead of 2 for "Hello world".

Root cause: In get_tokenizer_ov_subgraph, the is_metaspace_prepend_first branch ran the prepend step before normalization. The Precompiled charsmap in SeamlessM4T's normalizer maps U+2581 (▁) back to a regular space — so it silently undid the prepend.

Fix: In the is_metaspace_prepend_first branch, run all normalization_steps on the flat string before the prepend step. The non-prepend path is unchanged.

Correct execution order (after fix)

StringTensorUnpack
  → normalization_steps (Charsmap, Strip, Replace) on flat string
  → RN_prepend_first (adds ▁ to first non-space char)
  → add_ragged_dimension / SpecialTokensSplit
  → pre_tokenization_steps (RN_replace_spaces, RegexSplit)
  → BPETokenizationStep

Test results

Model	Token IDs	Detokenizer
`facebook/seamless-m4t-v2-large`	✅ All match HuggingFace	✅ Pass
`openai-community/gpt2` (regression)	✅ Pass	✅ Pass
`google-bert/bert-base-uncased` (regression)	✅ Pass	✅ Pass

Files changed

python/openvino_tokenizers/hf_parser.py — Fix 1: correct step selection for metaspace+split
python/openvino_tokenizers/tokenizer_pipeline.py — Fix 2: run normalizers before prepend
tests/tokenizers_test.py — add facebook/seamless-m4t-v2-large to bpe_models test list

Fixes: SeamlessM4TTokenizerFast, NllbTokenizerFast and any model using HF fast tokenizer with Metaspace prepend_scheme=first + split=True.

…split=True Two bugs prevented conversion and correct tokenization of models that use a Metaspace pre-tokenizer with prepend_scheme='first' AND split=True (e.g. facebook/seamless-m4t-v2-large, facebook/nllb-200-distilled-600M). Bug 1 — hf_parser.py (parse_pre_tokenization_step): When parse_metaspace returns [RN_replace_spaces, RN_prepend_first, RegexSplitStep] for the split+prepend_first case, the previous code blindly did steps.pop() which popped RegexSplitStep (the last item) and inserted it at pipeline position 0. This shifted SpecialTokensSplit to position 1 where it was never processed, causing the C++ BPE op to receive the wrong number of inputs (17 instead of 18). Fix: find the last NormalizationStep (RN_prepend_first) instead of unconditionally popping the tail. Bug 2 — tokenizer_pipeline.py (get_tokenizer_ov_subgraph): In the is_metaspace_prepend_first branch, the prepend step ran first (inserting U+2581 ▁ before the first non-space character), and then normalization steps ran — including the Precompiled charsmap that maps ▁ back to a regular space. This silently destroyed the prepended prefix, causing the first word to be tokenised without the metaspace prefix and producing wrong token IDs. Fix: run all normalization steps on the flat string *before* the prepend step, so the charsmap cannot undo the prepend. The else branch for non-prepend-first models is unchanged. Tested on: facebook/seamless-m4t-v2-large — all token IDs match HuggingFace ✓ openai-community/gpt2 — regression test passes ✓ google-bert/bert-base-uncased — regression test passes ✓

Covers the Metaspace BPE tokenizer with prepend_scheme=first and split=True combination that was previously broken.

mlukasze added 2 commits May 13, 2026 19:40

tests: add facebook/seamless-m4t-v2-large to bpe_models test list

b3d0175

Covers the Metaspace BPE tokenizer with prepend_scheme=first and split=True combination that was previously broken.

github-actions Bot added tokenizers_py tokenizers_test labels May 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[EXPERIMENT][WIP] fix(metaspace): correct pipeline order for prepend_scheme=first with split=True#658

[EXPERIMENT][WIP] fix(metaspace): correct pipeline order for prepend_scheme=first with split=True#658
mlukasze wants to merge 2 commits into
openvinotoolkit:masterfrom
mlukasze:fix/seamless-m4t-v2-tokenizer

mlukasze commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mlukasze commented May 13, 2026

Description

Bug 1 — hf_parser.py: wrong step inserted at pipeline position 0

Bug 2 — tokenizer_pipeline.py: Charsmap normalizer destroys the prepended ▁

Correct execution order (after fix)

Test results

Files changed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Bug 1 — `hf_parser.py`: wrong step inserted at pipeline position 0

Bug 2 — `tokenizer_pipeline.py`: Charsmap normalizer destroys the prepended ▁