Skip to content

Latest commit

 

History

History
172 lines (130 loc) · 6.15 KB

File metadata and controls

172 lines (130 loc) · 6.15 KB

Section grammar

Everything after the closing --- of the frontmatter is the document body. DLM's body parser splits it into typed sections using fence markers of the form ::<type>:: on a line by themselves.

Section types

Prose (default)

Any body text that isn't inside an explicit fence is a prose section. Prose trains via continued pretraining — the model learns the writing style + vocabulary but doesn't get "question → answer" pressure.

# Heading

Prose paragraphs, markdown code blocks, whatever you'd normally write.

Another paragraph after a blank line stays in the same prose section.

Code fences (```) inside prose are preserved; the parser doesn't interpret ::type:: lines that appear inside a code block.

Instruction (::instruction::)

Open with ::instruction:: on its own line. Each Q&A pair uses ### Q and ### A as grammar markers.

::instruction::
### Q
What is a decorator?

### A
A function that takes a function and returns a new function.

### Q
When should I use functools.wraps?

### A
Always, inside decorators.

Trains via supervised fine-tuning (SFT): the model sees Q text as the prompt, A text as the target. This is the pattern that produces "helpful assistant" behavior.

dlm synth instructions can also write synthesized instruction sections back into the document. Those keep the same basic body grammar but add an HTML provenance marker immediately after the fence. See the instruction section reference for the full marker shape and validation rules.

Preference (::preference::)

Open with ::preference::. Each record has three blocks:

::preference::
### Prompt
Explain recursion to a beginner.

### Chosen
Recursion is when a function calls itself on a smaller piece of the
problem. Imagine matryoshka dolls.

### Rejected
A recursive function is any function that refers to itself in its own
definition using the stack frame protocol.

Trains via DPO (direct preference optimization) or ORPO — the model learns to prefer the Chosen phrasing. The DPO / ORPO trainer lands in Sprint 17/18.

Image (::image path="..." alt="..."::)

Schema v10 adds image sections for vision-language bases. The initial launch covered PaliGemma; later follow-ups added Qwen2-VL, InternVL2, and Mistral Small 3.1 registry rows. The fence uses attribute syntax instead of the bare ::type:: form:

::image path="figures/architecture.png" alt="training pipeline diagram"::
Caption text describing the figure. The caption body becomes the "text"
part of the training row; the placeholder expands to the base's image
tokens at collate time.

Required attributes: path (the image file, resolved relative to the .dlm's parent dir). Optional: alt (short description; defaults to the filename stem on directive-ingested images).

Supported extensions. .png, .jpg, .jpeg, .webp, .gif, .bmp, .tiff. Other binary types (PDF, archives) stay out of the training corpus by default.

Content hash. Image sections hash on (type, path, blob_sha) rather than the body text. Two identical-bytes images at different paths produce different section_ids — paths carry meaning. Changing the blob bytes flips the ID even if the path didn't move.

Directive ingest. training.sources directives with image extensions in their include globs ingest automatically:

training:
  sources:
    - path: ./paper-figures
      include: ["**/*.png", "**/*.jpg"]

Each discovered image becomes an ::image:: section with alt=<filename-stem> and flows through the same row-emission path.

Current InternVL caveat. InternVL-family rows stay visible in the registry for planning and future work, but the current runtime still needs a custom processor/collator path for their <image> expansion and image_flags contract. See the multi-modal training cookbook and VL memory guide before picking internvl2-2b.

Base-model requirements. Only vision-language bases accept image sections at training time. dlm init --multimodal scaffolds a VL doc pinned to PaliGemma. Text-only bases (Qwen, Llama, SmolLM, Phi) refuse image sections at train start with a pointer to --multimodal.

Fence rules

  • A fence must be the full line — ::instruction:: with no leading/ trailing content other than whitespace.
  • Fences inside triple-backtick code blocks are not active — the parser is aware of the code-fence context.
  • An unfenced heading (# ..., ## ...) inside an open instruction or preference section does not close the section. Close with the next section fence or end-of-file.
  • Section type is case-sensitive; ::Instruction:: is rejected.
  • Sprint 20 introduces a ::type#adapter-name:: suffix for multi-adapter routing; the v1 parser accepts the suffix but ignores the #... tail.

Section IDs

Every section gets a content-addressed ID — the first 16 hex chars of the SHA-256 of the section's canonical text. The manifest's content_hashes records these IDs and their types so the next dlm train can compute what's new, unchanged, or removed (Sprint 08's delta system).

You don't write these IDs in the document — they're derived and live only in the manifest. But if you're debugging "why isn't this section being picked up as new?", the ID in dlm show --json is the answer.

What NOT to put in sections

  • API keys, personal data, anything you wouldn't want baked into a model you'll share. The adapter learns from everything in the file.
  • JSON / YAML config that the model should emit literally — use instruction Q&A pairs instead. Training on raw config produces noisy generation.
  • Massive code dumps (>200 KB). The replay corpus retains everything, and sequence_len is bounded at 32 KB; a single enormous section trains one step and wastes the remaining token budget.

See also