Require explicit eos_token in MegatronDocumentTokenizer (and fail clearly on an EOS missing from the vocab)#489
Open
maxsloef-goodfire wants to merge 1 commit into
Conversation
01814a3 to
4b0548f
Compare
MegatronDocumentTokenizer defaulted eos_token to gpt2's "<|endoftext|>" and
applied it to any tokenizer with no validation, producing two failure modes:
- Cryptic crash: tokenizers whose vocab lacks "<|endoftext|>" (Llama 3,
Mistral, Gemma) raise an opaque TypeError from the tokenizers post-processor
("argument 'special_tokens': Expected Union[Tuple[str, int], ...]").
- Silent wrong delimiter: instruct tokenizers that contain "<|endoftext|>" as a
non-EOS token (e.g. Qwen2.5-Instruct, whose declared EOS is "<|im_end|>"=151645)
silently get "<|endoftext|>"=151643 appended as the document separator.
Megatron treats this token as the document boundary (GPTDataset scans the token
stream for tokenizer.eod to reset position ids / attention masks), so a wrong or
mismatched delimiter silently corrupts training data.
Tier 1 (utils/tokenization.py): PipelineStepWithTokenizer raises a clear
ValueError when eos_token is not in the tokenizer vocab, instead of passing
None into TemplateProcessing.
Tier 2 (pipeline/tokens/megatron_tokenizer.py): eos_token is now required -- it
keeps a None default and the constructor raises a ValueError if omitted, so it
can no longer be silently mismatched to the tokenizer. The None-default form is
used (rather than a keyword-only or required-positional arg) to avoid changing
the calling convention.
Adds regression tests; existing tests pass unchanged (they already pass
eos_token explicitly).
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
4b0548f to
393485d
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
MegatronDocumentTokenizerdefaultseos_tokento gpt2's"<|endoftext|>"andPipelineStepWithTokenizerthen applies it to any tokenizer with no validation. Becausetokenizer_name_or_pathandeos_tokenare independent defaults, the natural action — set the tokenizer, leave the EOS default — silently mismatches them. The EOS here is not cosmetic: it's the document separator Megatron relies on (GPTDatasetscans the token stream fortokenizer.eodto reset position ids / attention masks at document boundaries), so a wrong delimiter silently corrupts training data.This produces two failure modes, both reproduced below with real modern tokenizers.
Evidence
Encoding
"hello world"and inspecting the appended delimiter, using the defaulteos_token:<|endoftext|>in vocab?eos_tokenresult<|endoftext|><|endoftext|>(151643)<|end_of_text|>TypeError: argument 'special_tokens': Expected Union[Tuple[str, int], ...]</s>TypeError<eos>TypeError<|im_end|>(151645)<|endoftext|>), not the tokenizer's EOS<|im_end|>(2)<|endoftext|>), not the tokenizer's EOSRoot cause, in
utils/tokenization.py:token_to_idreturnsNonefor a missing string (→ the opaqueTypeErrorabove), and returns a valid-but-wrong id when the default string happens to exist as a non-EOS token (→ the silent case).Changes
Tier 1 —
PipelineStepWithTokenizer(utils/tokenization.py): validate thateos_tokenresolves to a real id; raise a clearValueError(eos_token '<...>' is not in the vocabulary of tokenizer '<...>') instead of lettingNoneflow intoTemplateProcessing. Benefits every tokenizer pipeline step.Tier 2 —
MegatronDocumentTokenizer(pipeline/tokens/megatron_tokenizer.py):eos_tokenis now required in practice — it defaults toNoneand the constructor raises a clearValueErrorif it's omitted — so it can no longer be silently mismatched to the tokenizer.Design note: why
eos_token=None+ an explicit check, rather than a "real" required argThis keeps
eos_tokenas a positional-or-keyword parameter with aNonedefault and validates it at runtime ("fake-optional"), instead of making it genuinely required, specifically to avoid changing the calling convention. The cleaner-looking alternatives both change how the class can be called:*, eos_token: str) would also forcebatch_size/upload_block_sizeto be keyword-only (everything after*), breaking any positional callers of those.MegatronDocumentTokenizer(folder, "myfile")call intendingsave_filenamewould pass"myfile"aseos_token. That's a silent breakage, the worst outcome.The
None-default-plus-raise form leaves every existing call signature working while still failing loudly (with a helpful message) wheneos_tokenis missing. Happy to switch to the keyword-only form if maintainers prefer the cleaner signature and are OK tighteningbatch_size/upload_block_sizeto keyword-only too.Behavior change
Calling convention is unchanged. The behavior change is that omitting
eos_tokennow raises instead of silently using gpt2's<|endoftext|>; callers that relied on the implicit default must pass it explicitly. (Tier 1 alone is non-breaking; Tier 2 is the behavioral change — happy to split if maintainers prefer to land Tier 1 first.)Tests
Adds
TestMegatronTokenizerEos(eos required; eos-not-in-vocab raises a clear error; valid eos still appends correctly). Existingtest_tokenization.pypasses unchanged — those tests already passeos_tokenexplicitly.ruff format+ruff checkclean.🤖 Generated with Claude Code
Note
Medium Risk
Breaking for callers that relied on the implicit gpt2 EOS default; changes affect Megatron document delimiters used in training data, though failures are now loud instead of silent corruption.
Overview
Megatron tokenization no longer silently defaults
eos_tokento gpt2’s<|endoftext|>.MegatronDocumentTokenizernow raises ifeos_tokenis omitted, so document-boundary delimiters must be chosen explicitly to match the training stack.Shared tokenizer setup in
PipelineStepWithTokenizerchecks thateos_tokenexists in the loaded vocabulary before building the post-processor, replacing crypticTypeErrors (or wrong-token appends) with a clearValueError.Regression tests cover required EOS, invalid EOS strings, and correct EOS appends.
Reviewed by Cursor Bugbot for commit 393485d. Bugbot is set up for automated code reviews on this repo. Configure here.