Skip to content

Require explicit eos_token in MegatronDocumentTokenizer (and fail clearly on an EOS missing from the vocab)#489

Open
maxsloef-goodfire wants to merge 1 commit into
huggingface:mainfrom
maxsloef-goodfire:fix/megatron-tokenizer-eos-required
Open

Require explicit eos_token in MegatronDocumentTokenizer (and fail clearly on an EOS missing from the vocab)#489
maxsloef-goodfire wants to merge 1 commit into
huggingface:mainfrom
maxsloef-goodfire:fix/megatron-tokenizer-eos-required

Conversation

@maxsloef-goodfire

@maxsloef-goodfire maxsloef-goodfire commented Jun 3, 2026

Copy link
Copy Markdown

Summary

MegatronDocumentTokenizer defaults eos_token to gpt2's "<|endoftext|>" and PipelineStepWithTokenizer then applies it to any tokenizer with no validation. Because tokenizer_name_or_path and eos_token are independent defaults, the natural action — set the tokenizer, leave the EOS default — silently mismatches them. The EOS here is not cosmetic: it's the document separator Megatron relies on (GPTDataset scans the token stream for tokenizer.eod to reset position ids / attention masks at document boundaries), so a wrong delimiter silently corrupts training data.

This produces two failure modes, both reproduced below with real modern tokenizers.

Evidence

Encoding "hello world" and inspecting the appended delimiter, using the default eos_token:

tokenizer real EOS <|endoftext|> in vocab? default eos_token result
gpt2 / GPT-NeoX-20B <|endoftext|> yes ✅ correct (by coincidence)
Qwen2.5-0.5B (base) <|endoftext|> (151643) yes ✅ correct (by coincidence)
Llama-3.1-8B <|end_of_text|> no 💥 TypeError: argument 'special_tokens': Expected Union[Tuple[str, int], ...]
Mistral-7B-v0.1 </s> no 💥 same cryptic TypeError
Gemma-2-2B <eos> no 💥 same cryptic TypeError
Qwen2.5-0.5B-Instruct <|im_end|> (151645) yes (151643) 🔇 silently appends 151643 (<|endoftext|>), not the tokenizer's EOS
SmolLM2-1.7B-Instruct <|im_end|> (2) yes (0) 🔇 silently appends 0 (<|endoftext|>), not the tokenizer's EOS

Root cause, in utils/tokenization.py:

special_tokens=[("<EOS>", tokenizer.token_to_id(self.eos_token))]

token_to_id returns None for a missing string (→ the opaque TypeError above), and returns a valid-but-wrong id when the default string happens to exist as a non-EOS token (→ the silent case).

Changes

Tier 1 — PipelineStepWithTokenizer (utils/tokenization.py): validate that eos_token resolves to a real id; raise a clear ValueError (eos_token '<...>' is not in the vocabulary of tokenizer '<...>') instead of letting None flow into TemplateProcessing. Benefits every tokenizer pipeline step.

Tier 2 — MegatronDocumentTokenizer (pipeline/tokens/megatron_tokenizer.py): eos_token is now required in practice — it defaults to None and the constructor raises a clear ValueError if it's omitted — so it can no longer be silently mismatched to the tokenizer.

Design note: why eos_token=None + an explicit check, rather than a "real" required arg

This keeps eos_token as a positional-or-keyword parameter with a None default and validates it at runtime ("fake-optional"), instead of making it genuinely required, specifically to avoid changing the calling convention. The cleaner-looking alternatives both change how the class can be called:

  • Keyword-only (*, eos_token: str) would also force batch_size/upload_block_size to be keyword-only (everything after *), breaking any positional callers of those.
  • Relocating it to a required positional (e.g. 2nd arg) would silently change the meaning of existing positional args — e.g. a MegatronDocumentTokenizer(folder, "myfile") call intending save_filename would pass "myfile" as eos_token. That's a silent breakage, the worst outcome.

The None-default-plus-raise form leaves every existing call signature working while still failing loudly (with a helpful message) when eos_token is missing. Happy to switch to the keyword-only form if maintainers prefer the cleaner signature and are OK tightening batch_size/upload_block_size to keyword-only too.

Behavior change

Calling convention is unchanged. The behavior change is that omitting eos_token now raises instead of silently using gpt2's <|endoftext|>; callers that relied on the implicit default must pass it explicitly. (Tier 1 alone is non-breaking; Tier 2 is the behavioral change — happy to split if maintainers prefer to land Tier 1 first.)

Tests

Adds TestMegatronTokenizerEos (eos required; eos-not-in-vocab raises a clear error; valid eos still appends correctly). Existing test_tokenization.py passes unchanged — those tests already pass eos_token explicitly. ruff format + ruff check clean.

🤖 Generated with Claude Code


Note

Medium Risk
Breaking for callers that relied on the implicit gpt2 EOS default; changes affect Megatron document delimiters used in training data, though failures are now loud instead of silent corruption.

Overview
Megatron tokenization no longer silently defaults eos_token to gpt2’s <|endoftext|>. MegatronDocumentTokenizer now raises if eos_token is omitted, so document-boundary delimiters must be chosen explicitly to match the training stack.

Shared tokenizer setup in PipelineStepWithTokenizer checks that eos_token exists in the loaded vocabulary before building the post-processor, replacing cryptic TypeErrors (or wrong-token appends) with a clear ValueError.

Regression tests cover required EOS, invalid EOS strings, and correct EOS appends.

Reviewed by Cursor Bugbot for commit 393485d. Bugbot is set up for automated code reviews on this repo. Configure here.

@maxsloef-goodfire maxsloef-goodfire force-pushed the fix/megatron-tokenizer-eos-required branch 2 times, most recently from 01814a3 to 4b0548f Compare June 3, 2026 23:19
MegatronDocumentTokenizer defaulted eos_token to gpt2's "<|endoftext|>" and
applied it to any tokenizer with no validation, producing two failure modes:

- Cryptic crash: tokenizers whose vocab lacks "<|endoftext|>" (Llama 3,
  Mistral, Gemma) raise an opaque TypeError from the tokenizers post-processor
  ("argument 'special_tokens': Expected Union[Tuple[str, int], ...]").
- Silent wrong delimiter: instruct tokenizers that contain "<|endoftext|>" as a
  non-EOS token (e.g. Qwen2.5-Instruct, whose declared EOS is "<|im_end|>"=151645)
  silently get "<|endoftext|>"=151643 appended as the document separator.

Megatron treats this token as the document boundary (GPTDataset scans the token
stream for tokenizer.eod to reset position ids / attention masks), so a wrong or
mismatched delimiter silently corrupts training data.

Tier 1 (utils/tokenization.py): PipelineStepWithTokenizer raises a clear
ValueError when eos_token is not in the tokenizer vocab, instead of passing
None into TemplateProcessing.

Tier 2 (pipeline/tokens/megatron_tokenizer.py): eos_token is now required -- it
keeps a None default and the constructor raises a ValueError if omitted, so it
can no longer be silently mismatched to the tokenizer. The None-default form is
used (rather than a keyword-only or required-positional arg) to avoid changing
the calling convention.

Adds regression tests; existing tests pass unchanged (they already pass
eos_token explicitly).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@maxsloef-goodfire maxsloef-goodfire force-pushed the fix/megatron-tokenizer-eos-required branch from 4b0548f to 393485d Compare June 3, 2026 23:24
@maxsloef-goodfire maxsloef-goodfire marked this pull request as ready for review June 3, 2026 23:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant