Skip to content

[EXPERIMENT][WIP] fix(metaspace): correct pipeline order for prepend_scheme=first with split=True#658

Draft
mlukasze wants to merge 2 commits into
openvinotoolkit:masterfrom
mlukasze:fix/seamless-m4t-v2-tokenizer
Draft

[EXPERIMENT][WIP] fix(metaspace): correct pipeline order for prepend_scheme=first with split=True#658
mlukasze wants to merge 2 commits into
openvinotoolkit:masterfrom
mlukasze:fix/seamless-m4t-v2-tokenizer

Conversation

@mlukasze
Copy link
Copy Markdown

⚠️ AUTOMATICALLY GENERATED BY MEAT AGENT — REQUIRES HUMAN REVIEW ⚠️
This PR was created by an AI agent as part of automated model enablement.
A human maintainer must review and approve it before it can be considered for merge.
Do NOT merge without human review and sign-off.

Description

Two bugs prevented correct conversion of tokenizers that use a Metaspace pre-tokenizer with prepend_scheme="first" AND split=True — specifically SeamlessM4TTokenizerFast (used by facebook/seamless-m4t-v2-large, facebook/nllb-200-distilled-600M, etc.).

Bug 1 — hf_parser.py: wrong step inserted at pipeline position 0

Symptom: C++ BPETokenizer raised Incorrect number of inputs (received 17, expected 18).

Root cause: In parse_pre_tokenization_step, parse_metaspace returns [RN_replace_spaces, RN_prepend_first, RegexSplitStep] for the split=True + prepend_scheme=first case. The previous code called steps.pop(), which popped the last element (RegexSplitStep instead of RN_prepend_first) and inserted it at pipeline position 0. This displaced SpecialTokensSplit from position 0, so it was never processed, and the skip tensor was never produced — causing the BPE op to receive the wrong number of inputs.

Fix: Use next(i for ... if isinstance(steps[i], NormalizationStep)) to find the last NormalizationStep (i.e., RN_prepend_first) instead of blindly popping the tail.

Bug 2 — tokenizer_pipeline.py: Charsmap normalizer destroys the prepended ▁

Symptom: Token IDs were wrong — the first word was tokenized without its metaspace prefix (e.g. "Hello" instead of "▁Hello"), producing 3 tokens instead of 2 for "Hello world".

Root cause: In get_tokenizer_ov_subgraph, the is_metaspace_prepend_first branch ran the prepend step before normalization. The Precompiled charsmap in SeamlessM4T's normalizer maps U+2581 (▁) back to a regular space — so it silently undid the prepend.

Fix: In the is_metaspace_prepend_first branch, run all normalization_steps on the flat string before the prepend step. The non-prepend path is unchanged.

Correct execution order (after fix)

StringTensorUnpack
  → normalization_steps (Charsmap, Strip, Replace) on flat string
  → RN_prepend_first (adds ▁ to first non-space char)
  → add_ragged_dimension / SpecialTokensSplit
  → pre_tokenization_steps (RN_replace_spaces, RegexSplit)
  → BPETokenizationStep

Test results

Model Token IDs Detokenizer
facebook/seamless-m4t-v2-large ✅ All match HuggingFace ✅ Pass
openai-community/gpt2 (regression) ✅ Pass ✅ Pass
google-bert/bert-base-uncased (regression) ✅ Pass ✅ Pass

Files changed

  • python/openvino_tokenizers/hf_parser.py — Fix 1: correct step selection for metaspace+split
  • python/openvino_tokenizers/tokenizer_pipeline.py — Fix 2: run normalizers before prepend
  • tests/tokenizers_test.py — add facebook/seamless-m4t-v2-large to bpe_models test list

Fixes: SeamlessM4TTokenizerFast, NllbTokenizerFast and any model using HF fast tokenizer with Metaspace prepend_scheme=first + split=True.

mlukasze added 2 commits May 13, 2026 19:40
…split=True

Two bugs prevented conversion and correct tokenization of models that use
a Metaspace pre-tokenizer with prepend_scheme='first' AND split=True
(e.g. facebook/seamless-m4t-v2-large, facebook/nllb-200-distilled-600M).

Bug 1 — hf_parser.py (parse_pre_tokenization_step):
  When parse_metaspace returns [RN_replace_spaces, RN_prepend_first,
  RegexSplitStep] for the split+prepend_first case, the previous code
  blindly did steps.pop() which popped RegexSplitStep (the last item)
  and inserted it at pipeline position 0. This shifted SpecialTokensSplit
  to position 1 where it was never processed, causing the C++ BPE op to
  receive the wrong number of inputs (17 instead of 18).
  Fix: find the last NormalizationStep (RN_prepend_first) instead of
  unconditionally popping the tail.

Bug 2 — tokenizer_pipeline.py (get_tokenizer_ov_subgraph):
  In the is_metaspace_prepend_first branch, the prepend step ran first
  (inserting U+2581 ▁ before the first non-space character), and then
  normalization steps ran — including the Precompiled charsmap that maps
  ▁ back to a regular space. This silently destroyed the prepended prefix,
  causing the first word to be tokenised without the metaspace prefix and
  producing wrong token IDs.
  Fix: run all normalization steps on the flat string *before* the prepend
  step, so the charsmap cannot undo the prepend. The else branch for
  non-prepend-first models is unchanged.

Tested on:
  facebook/seamless-m4t-v2-large  — all token IDs match HuggingFace ✓
  openai-community/gpt2           — regression test passes ✓
  google-bert/bert-base-uncased   — regression test passes ✓
Covers the Metaspace BPE tokenizer with prepend_scheme=first and split=True
combination that was previously broken.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant