[EXPERIMENT][WIP] fix(metaspace): correct pipeline order for prepend_scheme=first with split=True#658
Draft
mlukasze wants to merge 2 commits into
Draft
Conversation
…split=True Two bugs prevented conversion and correct tokenization of models that use a Metaspace pre-tokenizer with prepend_scheme='first' AND split=True (e.g. facebook/seamless-m4t-v2-large, facebook/nllb-200-distilled-600M). Bug 1 — hf_parser.py (parse_pre_tokenization_step): When parse_metaspace returns [RN_replace_spaces, RN_prepend_first, RegexSplitStep] for the split+prepend_first case, the previous code blindly did steps.pop() which popped RegexSplitStep (the last item) and inserted it at pipeline position 0. This shifted SpecialTokensSplit to position 1 where it was never processed, causing the C++ BPE op to receive the wrong number of inputs (17 instead of 18). Fix: find the last NormalizationStep (RN_prepend_first) instead of unconditionally popping the tail. Bug 2 — tokenizer_pipeline.py (get_tokenizer_ov_subgraph): In the is_metaspace_prepend_first branch, the prepend step ran first (inserting U+2581 ▁ before the first non-space character), and then normalization steps ran — including the Precompiled charsmap that maps ▁ back to a regular space. This silently destroyed the prepended prefix, causing the first word to be tokenised without the metaspace prefix and producing wrong token IDs. Fix: run all normalization steps on the flat string *before* the prepend step, so the charsmap cannot undo the prepend. The else branch for non-prepend-first models is unchanged. Tested on: facebook/seamless-m4t-v2-large — all token IDs match HuggingFace ✓ openai-community/gpt2 — regression test passes ✓ google-bert/bert-base-uncased — regression test passes ✓
Covers the Metaspace BPE tokenizer with prepend_scheme=first and split=True combination that was previously broken.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Two bugs prevented correct conversion of tokenizers that use a Metaspace pre-tokenizer with
prepend_scheme="first"ANDsplit=True— specificallySeamlessM4TTokenizerFast(used byfacebook/seamless-m4t-v2-large,facebook/nllb-200-distilled-600M, etc.).Bug 1 —
hf_parser.py: wrong step inserted at pipeline position 0Symptom: C++
BPETokenizerraisedIncorrect number of inputs(received 17, expected 18).Root cause: In
parse_pre_tokenization_step,parse_metaspacereturns[RN_replace_spaces, RN_prepend_first, RegexSplitStep]for thesplit=True + prepend_scheme=firstcase. The previous code calledsteps.pop(), which popped the last element (RegexSplitStepinstead ofRN_prepend_first) and inserted it at pipeline position 0. This displacedSpecialTokensSplitfrom position 0, so it was never processed, and the skip tensor was never produced — causing the BPE op to receive the wrong number of inputs.Fix: Use
next(i for ... if isinstance(steps[i], NormalizationStep))to find the lastNormalizationStep(i.e.,RN_prepend_first) instead of blindly popping the tail.Bug 2 —
tokenizer_pipeline.py: Charsmap normalizer destroys the prepended ▁Symptom: Token IDs were wrong — the first word was tokenized without its metaspace prefix (e.g.
"Hello"instead of"▁Hello"), producing 3 tokens instead of 2 for"Hello world".Root cause: In
get_tokenizer_ov_subgraph, theis_metaspace_prepend_firstbranch ran the prepend step before normalization. ThePrecompiledcharsmap inSeamlessM4T's normalizer maps U+2581 (▁) back to a regular space — so it silently undid the prepend.Fix: In the
is_metaspace_prepend_firstbranch, run allnormalization_stepson the flat string before the prepend step. The non-prepend path is unchanged.Correct execution order (after fix)
Test results
facebook/seamless-m4t-v2-largeopenai-community/gpt2(regression)google-bert/bert-base-uncased(regression)Files changed
python/openvino_tokenizers/hf_parser.py— Fix 1: correct step selection for metaspace+splitpython/openvino_tokenizers/tokenizer_pipeline.py— Fix 2: run normalizers before prependtests/tokenizers_test.py— addfacebook/seamless-m4t-v2-largetobpe_modelstest listFixes:
SeamlessM4TTokenizerFast,NllbTokenizerFastand any model using HF fast tokenizer with Metaspaceprepend_scheme=first+split=True.