Everything after the closing --- of the frontmatter is the document
body. DLM's body parser splits it into typed sections using fence
markers of the form ::<type>:: on a line by themselves.
Any body text that isn't inside an explicit fence is a prose section. Prose trains via continued pretraining — the model learns the writing style + vocabulary but doesn't get "question → answer" pressure.
# Heading
Prose paragraphs, markdown code blocks, whatever you'd normally write.
Another paragraph after a blank line stays in the same prose section.Code fences (```) inside prose are preserved; the parser doesn't
interpret ::type:: lines that appear inside a code block.
Open with ::instruction:: on its own line. Each Q&A pair uses
### Q and ### A as grammar markers.
::instruction::
### Q
What is a decorator?
### A
A function that takes a function and returns a new function.
### Q
When should I use functools.wraps?
### A
Always, inside decorators.Trains via supervised fine-tuning (SFT): the model sees Q text
as the prompt, A text as the target. This is the pattern that
produces "helpful assistant" behavior.
dlm synth instructions can also write synthesized instruction
sections back into the document. Those keep the same basic body grammar
but add an HTML provenance marker immediately after the fence. See the
instruction section reference for the full
marker shape and validation rules.
Open with ::preference::. Each record has three blocks:
::preference::
### Prompt
Explain recursion to a beginner.
### Chosen
Recursion is when a function calls itself on a smaller piece of the
problem. Imagine matryoshka dolls.
### Rejected
A recursive function is any function that refers to itself in its own
definition using the stack frame protocol.Trains via DPO (direct preference optimization) or ORPO — the
model learns to prefer the Chosen phrasing. The DPO / ORPO trainer
lands in Sprint 17/18.
Schema v10 adds image sections for vision-language bases. The initial
launch covered PaliGemma; later follow-ups added Qwen2-VL,
InternVL2, and Mistral Small 3.1 registry rows. The fence uses
attribute syntax instead of the bare ::type:: form:
::image path="figures/architecture.png" alt="training pipeline diagram"::
Caption text describing the figure. The caption body becomes the "text"
part of the training row; the placeholder expands to the base's image
tokens at collate time.Required attributes: path (the image file, resolved relative to the
.dlm's parent dir). Optional: alt (short description; defaults to
the filename stem on directive-ingested images).
Supported extensions. .png, .jpg, .jpeg, .webp, .gif,
.bmp, .tiff. Other binary types (PDF, archives) stay out of the
training corpus by default.
Content hash. Image sections hash on (type, path, blob_sha)
rather than the body text. Two identical-bytes images at different
paths produce different section_ids — paths carry meaning. Changing
the blob bytes flips the ID even if the path didn't move.
Directive ingest. training.sources directives with image
extensions in their include globs ingest automatically:
training:
sources:
- path: ./paper-figures
include: ["**/*.png", "**/*.jpg"]Each discovered image becomes an ::image:: section with
alt=<filename-stem> and flows through the same row-emission path.
Current InternVL caveat. InternVL-family rows stay visible in the
registry for planning and future work, but the current runtime still
needs a custom processor/collator path for their <image> expansion
and image_flags contract. See the multi-modal training
cookbook and VL memory
guide before picking internvl2-2b.
Base-model requirements. Only vision-language bases accept image
sections at training time. dlm init --multimodal scaffolds a VL
doc pinned to PaliGemma. Text-only bases (Qwen, Llama, SmolLM, Phi)
refuse image sections at train start with a pointer to --multimodal.
- A fence must be the full line —
::instruction::with no leading/ trailing content other than whitespace. - Fences inside triple-backtick code blocks are not active — the parser is aware of the code-fence context.
- An unfenced heading (
# ...,## ...) inside an open instruction or preference section does not close the section. Close with the next section fence or end-of-file. - Section type is case-sensitive;
::Instruction::is rejected. - Sprint 20 introduces a
::type#adapter-name::suffix for multi-adapter routing; the v1 parser accepts the suffix but ignores the#...tail.
Every section gets a content-addressed ID — the first 16 hex chars of
the SHA-256 of the section's canonical text. The manifest's
content_hashes records these IDs and their types so the next dlm train
can compute what's new, unchanged, or removed (Sprint 08's delta system).
You don't write these IDs in the document — they're derived and live
only in the manifest. But if you're debugging "why isn't this section
being picked up as new?", the ID in dlm show --json is the answer.
- API keys, personal data, anything you wouldn't want baked into a model you'll share. The adapter learns from everything in the file.
- JSON / YAML config that the model should emit literally — use instruction Q&A pairs instead. Training on raw config produces noisy generation.
- Massive code dumps (>200 KB). The replay corpus retains everything, and sequence_len is bounded at 32 KB; a single enormous section trains one step and wastes the remaining token budget.
- Instruction section reference
- Preference section reference
- First train walkthrough
- Cookbook: coding tutor — full example of instruction-heavy authoring