DLM treats determinism as a contract: same input → same adapter SHA.
The contract is enforced by src/dlm/lock/ (Sprint 15), backed by a
golden integration test, and surfaced to users via three CLI flags.
Given:
- the same
.dlmsource text (SHA-256 match), - the same base model revision,
- the same pinned versions (torch, transformers, peft, trl, bitsandbytes, accelerate, llama.cpp tag),
- the same hardware tier,
- the same seed and determinism flags,
training produces a byte-identical adapter_model.safetensors.
Proved by tests/integration/lock/test_determinism_golden.py, which
runs two fresh training cycles on the tiny model and asserts the
adapter SHAs match. Approved tuple goldens are tracked at the repo
level in .determinism/lock.json.
Each store has a dlm.lock next to manifest.json:
{
"lock_version": 1,
"created_at": "2026-04-19T17:30:00",
"dlm_id": "01HRZYQ2X0MB5K4VN7E9DNT5GH",
"dlm_sha256": "0123…ef",
"base_model_revision": "12fd25f77366fa6b3b4b768ec3050bf629380bac",
"base_model_sha256": null,
"pinned_versions": {
"torch": "2.5.1",
"transformers": "4.46.2",
"peft": "0.14.0",
"trl": "0.12.2",
"bitsandbytes": "0.45.0"
},
"cuda_version": null,
"rocm_version": null,
"hardware_tier": "mps",
"seed": 42,
"determinism_flags": {},
"determinism_class": "best-effort",
"license_acceptance": null,
"last_run_id": 3
}Validated on every dlm train; written on success.
When the live runtime diverges from the recorded lock, each field is classified:
| Field | Severity | Policy |
|---|---|---|
dlm_sha256 |
ALLOW | Editing the doc is the point of DLM. |
base_model_revision |
ERROR | Breaks reproducibility; requires --update-lock to accept. |
torch major version |
ERROR | |
torch minor/patch |
WARN | |
transformers / peft / trl / accelerate / llama_cpp |
WARN | |
bitsandbytes any |
WARN | QLoRA kernels are version-sensitive. |
hardware_tier |
WARN | Re-plan recommended. |
determinism_class |
WARN | |
determinism_flags |
WARN |
WARN mismatches print to stderr but don't block the run. ERROR
mismatches raise LockValidationError → exit code 1 with runbook
hints.
| Flag | Behavior |
|---|---|
| (default) | Validate; abort on ERROR, warn on WARN, proceed + write. |
--strict-lock |
Upgrade every WARN to ERROR. |
--update-lock |
Skip validation, always write. For intentional drift acceptance. |
--ignore-lock |
Skip validation, don't write. For experimentation; the lock on disk stays stale. |
The three flags are mutually exclusive. See CLI reference.
The determinism_class field records what tier the host supports:
strong— CUDA with all deterministic kernels available. Bit-exact reproduction expected across runs.best-effort— MPS, ROCm, or CUDA without the full deterministic kernel set. Loss curves are close but not bit-identical.advisory— CPU-only or a configuration where DLM refuses to promise determinism (some MPS ops fall here).
The golden integration test runs on CPU (tier advisory) and still
passes because SmolLM2-135M doesn't exercise the nondeterministic
kernels. On larger bases the CPU tier stops being bit-exact; that's
honest and documented.
When a pinned version changes deliberately (dep bump, llama.cpp tag move), the recorded adapter SHA must be refreshed:
# Dry run — report the old vs new SHA without writing.
$ uv run python scripts/regen-determinism-golden.py
# Review the diff; then approve:
$ uv run python scripts/regen-determinism-golden.py --approveThe script:
- Samples
capture_runtime_versions()to produce the current tuple. - Runs the tiny-model training twice; confirms the two SHAs match.
- Writes
tests/golden/determinism/tuple-<hash>.jsonkeyed by a SHA-256 of the sorted version tuple + platform. - Upserts
.determinism/lock.jsonwith the tuple path, adapter SHA, platform, and pinned versions.
Each tuple gets its own golden; the tuple file is keyed by content so
running on a new platform simply writes a new golden file. The repo-level
index keeps the checked-in set explicit and avoids overloading the
per-store dlm.lock name with a second meaning. The reviewer checks in
the tuple file and the index update alongside the dep bump.
- Byte-exact reproducibility from pure source. DLM's replay corpus
carries prior-run signal. Reconstructing a specific adapter without
its replay history isn't possible — use
dlm packto archive. - Airgapped reproducibility. The first
dlm trainagainst a new base pulls from HuggingFace. Subsequent runs use the local cache. We don't currently ship a fully-offline path;--include-baseondlm packis the workaround. - MPS bit-exactness for large bases. Apple's Metal kernels aren't
deterministic for every op we use; the
best-efforttier is an honest label, not a TODO.