Require explicit eos_token in MegatronDocumentTokenizer (and fail clearly on an EOS missing from the vocab) by maxsloef-goodfire · Pull Request #489 · huggingface/datatrove

maxsloef-goodfire · 2026-06-03T23:11:55Z

Summary

MegatronDocumentTokenizer defaults eos_token to gpt2's "<|endoftext|>" and PipelineStepWithTokenizer then applies it to any tokenizer with no validation. Because tokenizer_name_or_path and eos_token are independent defaults, the natural action — set the tokenizer, leave the EOS default — silently mismatches them. The EOS here is not cosmetic: it's the document separator Megatron relies on (GPTDataset scans the token stream for tokenizer.eod to reset position ids / attention masks at document boundaries), so a wrong delimiter silently corrupts training data.

This produces two failure modes, both reproduced below with real modern tokenizers.

Evidence

Encoding "hello world" and inspecting the appended delimiter, using the default eos_token:

tokenizer	real EOS	`<\|endoftext\|>` in vocab?	default `eos_token` result
gpt2 / GPT-NeoX-20B	`<\|endoftext\|>`	yes	✅ correct (by coincidence)
Qwen2.5-0.5B (base)	`<\|endoftext\|>` (151643)	yes	✅ correct (by coincidence)
Llama-3.1-8B	`<\|end_of_text\|>`	no	💥 `TypeError: argument 'special_tokens': Expected Union[Tuple[str, int], ...]`
Mistral-7B-v0.1	`</s>`	no	💥 same cryptic `TypeError`
Gemma-2-2B	`<eos>`	no	💥 same cryptic `TypeError`
Qwen2.5-0.5B-Instruct	`<\|im_end\|>` (151645)	yes (151643)	🔇 silently appends 151643 (`<\|endoftext\|>`), not the tokenizer's EOS
SmolLM2-1.7B-Instruct	`<\|im_end\|>` (2)	yes (0)	🔇 silently appends 0 (`<\|endoftext\|>`), not the tokenizer's EOS

Root cause, in utils/tokenization.py:

special_tokens=[("<EOS>", tokenizer.token_to_id(self.eos_token))]

token_to_id returns None for a missing string (→ the opaque TypeError above), and returns a valid-but-wrong id when the default string happens to exist as a non-EOS token (→ the silent case).

Changes

Tier 1 — PipelineStepWithTokenizer (utils/tokenization.py): validate that eos_token resolves to a real id; raise a clear ValueError (eos_token '<...>' is not in the vocabulary of tokenizer '<...>') instead of letting None flow into TemplateProcessing. Benefits every tokenizer pipeline step.

Tier 2 — MegatronDocumentTokenizer (pipeline/tokens/megatron_tokenizer.py): eos_token is now required in practice — it defaults to None and the constructor raises a clear ValueError if it's omitted — so it can no longer be silently mismatched to the tokenizer.

Design note: why `eos_token=None` + an explicit check, rather than a "real" required arg

This keeps eos_token as a positional-or-keyword parameter with a None default and validates it at runtime ("fake-optional"), instead of making it genuinely required, specifically to avoid changing the calling convention. The cleaner-looking alternatives both change how the class can be called:

Keyword-only (*, eos_token: str) would also force batch_size/upload_block_size to be keyword-only (everything after *), breaking any positional callers of those.
Relocating it to a required positional (e.g. 2nd arg) would silently change the meaning of existing positional args — e.g. a MegatronDocumentTokenizer(folder, "myfile") call intending save_filename would pass "myfile" as eos_token. That's a silent breakage, the worst outcome.

The None-default-plus-raise form leaves every existing call signature working while still failing loudly (with a helpful message) when eos_token is missing. Happy to switch to the keyword-only form if maintainers prefer the cleaner signature and are OK tightening batch_size/upload_block_size to keyword-only too.

Behavior change

Calling convention is unchanged. The behavior change is that omitting eos_token now raises instead of silently using gpt2's <|endoftext|>; callers that relied on the implicit default must pass it explicitly. (Tier 1 alone is non-breaking; Tier 2 is the behavioral change — happy to split if maintainers prefer to land Tier 1 first.)

Tests

Adds TestMegatronTokenizerEos (eos required; eos-not-in-vocab raises a clear error; valid eos still appends correctly). Existing test_tokenization.py passes unchanged — those tests already pass eos_token explicitly. ruff format + ruff check clean.

🤖 Generated with Claude Code

Note

Medium Risk
Breaking for callers that relied on the implicit gpt2 EOS default; changes affect Megatron document delimiters used in training data, though failures are now loud instead of silent corruption.

Overview
Megatron tokenization no longer silently defaults eos_token to gpt2’s <|endoftext|>. MegatronDocumentTokenizer now raises if eos_token is omitted, so document-boundary delimiters must be chosen explicitly to match the training stack.

Shared tokenizer setup in PipelineStepWithTokenizer checks that eos_token exists in the loaded vocabulary before building the post-processor, replacing cryptic TypeErrors (or wrong-token appends) with a clear ValueError.

Regression tests cover required EOS, invalid EOS strings, and correct EOS appends.

^{Reviewed by Cursor Bugbot for commit 393485d. Bugbot is set up for automated code reviews on this repo. Configure here.}

MegatronDocumentTokenizer defaulted eos_token to gpt2's "<|endoftext|>" and applied it to any tokenizer with no validation, producing two failure modes: - Cryptic crash: tokenizers whose vocab lacks "<|endoftext|>" (Llama 3, Mistral, Gemma) raise an opaque TypeError from the tokenizers post-processor ("argument 'special_tokens': Expected Union[Tuple[str, int], ...]"). - Silent wrong delimiter: instruct tokenizers that contain "<|endoftext|>" as a non-EOS token (e.g. Qwen2.5-Instruct, whose declared EOS is "<|im_end|>"=151645) silently get "<|endoftext|>"=151643 appended as the document separator. Megatron treats this token as the document boundary (GPTDataset scans the token stream for tokenizer.eod to reset position ids / attention masks), so a wrong or mismatched delimiter silently corrupts training data. Tier 1 (utils/tokenization.py): PipelineStepWithTokenizer raises a clear ValueError when eos_token is not in the tokenizer vocab, instead of passing None into TemplateProcessing. Tier 2 (pipeline/tokens/megatron_tokenizer.py): eos_token is now required -- it keeps a None default and the constructor raises a ValueError if omitted, so it can no longer be silently mismatched to the tokenizer. The None-default form is used (rather than a keyword-only or required-positional arg) to avoid changing the calling convention. Adds regression tests; existing tests pass unchanged (they already pass eos_token explicitly). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

maxsloef-goodfire force-pushed the fix/megatron-tokenizer-eos-required branch 2 times, most recently from 01814a3 to 4b0548f Compare June 3, 2026 23:19

maxsloef-goodfire force-pushed the fix/megatron-tokenizer-eos-required branch from 4b0548f to 393485d Compare June 3, 2026 23:24

maxsloef-goodfire marked this pull request as ready for review June 3, 2026 23:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Require explicit eos_token in MegatronDocumentTokenizer (and fail clearly on an EOS missing from the vocab)#489

Require explicit eos_token in MegatronDocumentTokenizer (and fail clearly on an EOS missing from the vocab)#489
maxsloef-goodfire wants to merge 1 commit into
huggingface:mainfrom
maxsloef-goodfire:fix/megatron-tokenizer-eos-required

maxsloef-goodfire commented Jun 3, 2026 •

edited by cursor Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

maxsloef-goodfire commented Jun 3, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Evidence

Changes

Design note: why eos_token=None + an explicit check, rather than a "real" required arg

Behavior change

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

maxsloef-goodfire commented Jun 3, 2026 •

edited by cursor Bot

Loading

Design note: why `eos_token=None` + an explicit check, rather than a "real" required arg