Skip to content

Metadata-based model fit#768

Draft
i386 wants to merge 29 commits into
mainfrom
codex/model-fit-metadata-validation
Draft

Metadata-based model fit#768
i386 wants to merge 29 commits into
mainfrom
codex/model-fit-metadata-validation

Conversation

@i386
Copy link
Copy Markdown
Collaborator

@i386 i386 commented May 31, 2026

Summary

This PR turns model-fit into a metadata-first fit estimator with a repeatable validation loop against real Skippy single-stage benchmark runs. The branch is broader than the scorer alone: it adds GGUF tensor-group profiling, measured-hardware diagnostics, model download progress, benchmark timing instrumentation, validation binaries, checked-in validation corpora, documentation, and a GitHub workflow that can gate estimator changes on self-hosted GPU runners.

Users and agents can now point model-fit-validate at Hugging Face model refs, local GGUFs, or checked-in manifest files and get a JSON report containing hardware facts, measured GPU bandwidth, GGUF-derived model profiles, workload recommendations, benchmark observations, and per-scenario estimator agreement. model-fit-check-validation can turn that JSON into a Markdown report and fail CI when accuracy, noise, or minimum coverage thresholds are missed.

Branch Scope

area branch change
GGUF metadata Adds tensor-group byte classification in model-artifact so model-fit can distinguish attention, dense FFN, expert FFN, embeddings, output, normalization, and other tensors.
Fit scoring Reworks decode estimates around active bytes, measured bandwidth, benchmark noise, backend-neutral measured-GPU overhead, MoE active expert bytes, low-active-byte overhead, medium-width overhead, uncertainty ranges, prefill throughput, first-token latency, workload scoring, and split-candidate recommendations.
Hardware profiles Carries benchmark noise, bandwidth efficiency, and optional compute diagnostics from mesh-llm gpus benchmark into model-fit hardware profiles.
Validation runner Adds model-fit-validate for model refs, local --model ref=...,path=... inputs, --models-file manifests, automatic or supplied GPU benchmark JSON, progress reporting, repeated Skippy runs, steady decode, first-token, and warm KV reuse scenarios.
Validation checker Adds model-fit-check-validation to render Markdown summaries and enforce thresholds for median error, individual error, noisy samples, and minimum model count.
Skippy benchmarking Extends skippy-bench local-single with request counts, session reuse, per-request observations, and decode-only throughput.
Skippy server timing Adds tokenize, prefill, and decode elapsed milliseconds to the embedded /v1/text response so validation can compare steady decode without tokenization/prefill noise.
Hugging Face downloads Adds progress callbacks to model-hf so validation can show file download and ready-state progress while preparing GGUFs.
Validation corpora Checks in smoke and deep model manifests stratified by estimator behavior: tiny dense, small dense, common local 7B/8B, quant pairs, MoE active experts, embeddings, and rerankers.
CI workflow Adds .github/workflows/model-fit-validation.yml with normal metadata checks plus self-hosted GPU smoke validation and manual smoke/deep dispatch.
Docs Adds crates/model-fit/README.md and crates/model-fit/validation/README.md covering the estimator inputs, report shape, and validation corpora.

Validation Corpora

The model set is stratified by estimator behavior rather than just popularity.

Smoke corpus: crates/model-fit/validation/smoke-models.txt

bucket models
tiny dense / overhead-bound SmolLM2 135M Q8_0, Qwen3 0.6B Q4_K_M
small dense transition EXAONE 1.2B Q4_K_M, Llama 3.2 3B Q4_K_M
3B/4B/7B/8B local serving Gemma 3 4B Q4_K_M, Qwen2.5 Coder 7B Q4_K_M, Qwen3 8B Q4_K_M
quant slope Qwen3 8B Q8_0
MoE active expert bytes Qwen3 30B-A3B Q4_K_M

Deep corpus: crates/model-fit/validation/deep-models.txt

The deep set expands coverage to Q4/Q8 pairs, 12B/14B/32B dense models, more coder models, Qwen3 30B-A3B MoE variants, embedding GGUFs, and reranker GGUFs. It is intended for manual or nightly high-memory self-hosted runners rather than every PR.

Validation Data

Latest local steady-decode validation run:

target/release/model-fit-validate --no-progress \
  --output-json /tmp/model-fit-validation-aggregate-steady-small.json \
  bartowski/LGAI-EXAONE_EXAONE-4.0-1.2B-GGUF:Q4_K_M \
  unsloth/Qwen3-0.6B-GGUF:Q4_K_M \
  unsloth/SmolLM2-135M-Instruct-GGUF:Q8_0 \
  unsloth/Qwen3-8B-GGUF:Q4_K_M
model est tok/s observed tok/s observed/fit spread verdict
EXAONE 1.2B Q4_K_M 113.6 106.8 0.94 5.2% match
Qwen3 0.6B Q4_K_M 129.1 119.4 0.93 4.3% match
SmolLM2 135M Q8_0 142.3 145.1 1.02 3.9% match
Qwen3 8B Q4_K_M 53.4 53.9 1.01 2.7% match

Summary from the JSON report:

metric value
models benchmarked 4
matched 4
noisy 0
slower-than-fit 0
faster-than-fit 0
median observed/fit 0.974
mean observed/fit 0.973
median absolute percent error 3.96%
within tolerance 4

The small-model result is the important one. SmolLM2 initially looked noisy when the validator sampled only the final request of each repeat. Aggregating generated tokens over aggregate decode time per repeat changed the steady-decode observation to 1.02 observed/fit with 3.9% spread, which better reflects sustained decode and avoids tuning the estimator against request jitter.

How The Fit Algorithm Works

model-fit remains metadata-first and deterministic. It does not use filenames, catalog reputation, or model-specific boosts to predict throughput. It consumes a HardwareProfile, a GGUF-derived ModelProfile, and a SelectionConfig.

For memory fit, the selector estimates runtime memory as resident weights plus KV cache plus scratch/backend overhead. KV cache is estimated from layer count, target context, KV heads, key/value widths, and configured KV cache type. If GGUF metadata is incomplete, it falls back to hidden size as a conservative KV width. A model is rejected locally when estimated runtime memory exceeds the usable memory budget after the safety margin.

For dense transformer decode, the primary estimate is active bytes per token divided by effective memory bandwidth, plus fixed and shape overheads. Active bytes come from GGUF tensor groups when available, so embeddings and output tensors do not have to be treated the same as per-token transformer work. Measured hardware bandwidth comes from mesh-llm gpus benchmark, using p90 bandwidth and benchmark noise. Once bandwidth is measured, the decode-efficiency path is backend-neutral so the same measured profile does not get extra Metal/CUDA/ROCm assumptions layered on top.

For sparse MoE models, decode uses base/resident bytes plus only the active routed expert share. This avoids treating every expert as active for every token while still including resident attention, shared FFN, normalization, output, and other tensor groups when available. MoE also carries a per-layer dispatch overhead so routing work is visible in the latency estimate.

For tiny and narrow models, a pure bytes/bandwidth estimate overpredicts because fixed runtime, scheduling, and kernel overhead become a large share of token latency. The estimator applies generic low-active-byte and small-width overheads based on active bytes and hidden width. This is not model-name calibration; it applies to future GGUFs with similar geometry.

For prefill and first-token latency, the selector derives prefill throughput from decode throughput times a shape-based parallelism factor. Prefill exposes much more parallelism than one-token decode, especially for smaller models, but arbitrary GGUF metadata is not rich enough for a stable full prefill simulator. First-token latency combines prompt_tokens / estimated_prefill_tps with one decode step. During validation, first-token scenarios rescore with the observed prompt token count so the comparison uses the prompt that was actually benchmarked.

For non-chat workloads, workload suitability is separate from throughput fit. Embedding, classifier/reranker, chat, tool-use, coding, long-context, summarization, and related workload profiles can weight memory, context, decode, prefill, and capability evidence differently. Embedding-like models are not rejected globally; they are accepted or penalized according to the requested workload.

For oversized models, the selector can return split-candidate recommendations instead of pretending local fit failed completely. The split estimate is memory-budget based and warns that inter-stage activation transfer depends on hidden width, layer count, and network bandwidth.

Validator Shape

model-fit-validate prepares a self-contained report by:

  1. loading a supplied GPU benchmark JSON or running a local bandwidth benchmark
  2. resolving Hugging Face GGUF refs or using local GGUF paths
  3. downloading missing artifacts with progress output
  4. profiling each GGUF through model-fit
  5. scoring the primary workload plus workload variants
  6. running Skippy benchmark scenarios against the same model
  7. comparing predicted and observed steady decode, first-token latency, and warm KV reuse where applicable
  8. writing a schema-versioned JSON report and printing a compact Markdown table

The steady-decode scenario prefers decode-only timings from /v1/text, excluding tokenization and prefill. It also increases the generated-token window for tiny active-byte models and aggregates generated tokens over aggregate decode time per repeat to reduce jitter.

Workflow Shape

The new workflow is .github/workflows/model-fit-validation.yml.

For pull requests touching model-fit or the benchmark/server pieces it runs metadata checks and schedules a self-hosted GPU smoke job. The GPU job:

  1. builds release binaries
  2. runs target/release/mesh-llm gpus benchmark --json
  3. runs model-fit-validate --models-file crates/model-fit/validation/smoke-models.txt
  4. gates the JSON with model-fit-check-validation
  5. writes a Markdown summary to $GITHUB_STEP_SUMMARY
  6. uploads the benchmark JSON, validation JSON, and Markdown report

Manual runs can choose smoke or deep via workflow_dispatch, and the runner labels can be supplied explicitly or through MODEL_FIT_GPU_RUNS_ON.

Protocol / Compatibility

  • No mesh wire protocol, gossip schema, or plugin protocol changes are made in this branch.
  • Skippy ABI gains an additive decode benchmark probe for validation; the ABI version constants are bumped with the patch queue.
  • The embedded Skippy /v1/text JSON response gains additive timing fields: tokenize_elapsed_ms, prefill_elapsed_ms, and decode_elapsed_ms. Older consumers can ignore them; the new validator and benchmark path use them when present and fall back where possible.
  • The model-fit JSON-facing structs gain additive fields for tensor groups, estimate ranges, prefill/first-token estimates, and hardware benchmark diagnostics.

Implementation Notes

  • model-artifact now scans GGUF tensor names into semantically useful byte groups.
  • model-fit exposes TensorGroupBytes, decode and first-token estimate ranges, measured decode efficiency, measured-GPU overhead, and benchmark diagnostics.
  • model-fit-validate supports positional model refs, --model-ref, local --model ref=...,path=..., and --models-file with blank-line and # comment support.
  • model-fit-check-validation reads validator JSON, renders a Markdown table, and enforces thresholds such as median absolute error, individual error, noisy sample count, and minimum model count.
  • model-hf exposes progress events for ensuring, starting, progressing, ready, and complete download states.
  • skippy-bench local-single can issue multiple requests, reuse a session, and report per-request decode-only throughput.
  • The fit crate now has in-code comments explaining the heuristics and their intended boundaries for future developers and agents.

Validation

Passed:

  • cargo check -p model-fit --bins
  • cargo test -p model-fit --lib
  • cargo clippy -p model-fit --all-targets -- -D warnings
  • cargo check -p skippy-server
  • cargo clippy -p skippy-server --all-targets -- -D warnings
  • cargo test -p skippy-server --lib
  • cargo check -p skippy-bench
  • cargo clippy -p skippy-bench --all-targets -- -D warnings
  • cargo test -p skippy-bench
  • cargo fmt --all -- --check
  • cargo run -p model-fit --bin model-fit-check-validation -- --min-models 4 --markdown-out /tmp/model-fit-check.md /tmp/model-fit-validation-aggregate-steady-small.json
  • cargo run -p xtask -- repo-consistency ci-crate-lists

Known existing repo-consistency issue surfaced by release checks:

  • cargo run -p xtask -- repo-consistency release-targets currently fails because model-fit is listed as publishable but has unpublished workspace dependencies missing from scripts/publish-crates.sh (mesh-llm-gpu-bench, then mesh-llm-system, likely more). I did not expand the publish pipeline in this PR.
  • cargo run -p xtask -- repo-consistency publish-crates fails for the same existing publish-chain reason.

Latest ABI Decode Validation

This update adds the denoised Skippy decode ABI probe and replaces the old low-active/width decode penalties with a source-grounded graph-overhead term. The estimator still does not use model names, filenames, catalog reputation, or observed throughput as inputs. It uses GGUF tensor groups, tensor type mix, logical matmul shape counts, layer count, measured decode-shaped bandwidth, and measured fixed backend submission overhead.

Source anchors checked in the pinned llama.cpp tree:

source area evidence used by model-fit
src/llama-arch.cpp Repeating attention/FFN tensors map to GGML_OP_MUL_MAT; MoE expert tensors map to GGML_OP_MUL_MAT_ID.
src/llama-graph.cpp Decode builds a repeated per-layer graph around matmul, attention, norm, RoPE, activation, copy/view, and elementwise work.
ggml/src/ggml-metal/ggml-metal.metal Q8_0, Q4_K, matmul, matrix-vector, and MoE ID paths have distinct kernels.
ggml/src/ggml-cuda CUDA has dedicated quantized matmul and MUL_MAT_ID paths, with different single-token and multi-token behavior.
ggml/src/ggml-cpu/ggml-cpu.c and repack.cpp CPU type traits and repack paths distinguish Q8_0, Q8_K, K-quants, and related vec-dot formats.

Latest two-machine five-model rerun:

machine backend scenario median abs error result
Mac Studio M1 Ultra Metal steady_decode 8.9% Qwen3 8B matched at 0.98 observed/fit; EXAONE matched at 0.91; Qwen3 0.6B remained a 0.79 miss; SmolLM2 and Llama 3.2 3B were noisy.
white.local CUDA steady_decode 5.8% All five steady-decode samples matched within the 10% target band: 1.02, 1.00, 1.08, 0.94, 0.94 observed/fit.
Mac Studio M1 Ultra Metal first_token 12.1% Close but still misses on Qwen3 0.6B, Llama 3.2 3B, and Qwen3 8B.
white.local CUDA first_token 59.7% Still not solved; needs separate prefill/prompt-shape modeling.
Mac Studio M1 Ultra Metal kv_warm_reuse 18.8% Noisy small-model samples plus slower-than-fit misses on 3B/8B.
white.local CUDA kv_warm_reuse 14.0% Stable but slower than fit for SmolLM2, Llama 3.2 3B, and Qwen3 8B.

Validation artifacts from this run:

  • Studio JSON: /tmp/model-fit-validation-studio-graph2.json
  • Studio Markdown: /tmp/model-fit-validation-studio-graph2.md
  • white.local JSON: /tmp/model-fit-validation-white-graph2.json
  • white.local Markdown: /tmp/model-fit-validation-white-graph2.md

Residual misses are reported as misses. They were not hidden by widening thresholds or adding model-specific exceptions.

Broader Validation Rerun

This update also adds the prefill validation scenario and fixes execution-budget selection so a model that fits both CPU and GPU chooses the measured execution path with the better metadata-only throughput estimate before using memory headroom as a tie-breaker. The bug was general: a high-headroom CPU budget could beat a measured GPU budget for one model, causing the recommendation to describe the wrong execution path. The fix is backend-neutral and evidence-ranked; it does not special-case Metal, CUDA, model names, filenames, or current-run measurements.

Broader corpus shape: 18 GGUF refs covering tiny dense models, small dense transition models, a Qwen2.5-Coder 7B Q4/Q5/Q6/Q8 slope, Qwen3 8B, one active-expert MoE, and embedding/reranker metadata cases.

Final broader validation:

machine backend scenario samples median abs error notable result
Mac Studio M1 Ultra Metal steady_decode 16 9.8% Tiny models now select measured Metal instead of optimistic CPU fallback. Qwen2.5-Coder Q5/Q6/Q8 and bge reranker remain slower-than-fit.
white.local RTX 5080 CUDA steady_decode 16 8.6% Qwen2.5-Coder Q4/Q5/Q6 matched; Q8, OLMoE, and bge reranker remain honest misses.
Mac Studio M1 Ultra Metal prefill 15 24.6% Prefill is now measured separately as prompt tokens / prefill_elapsed_ms; still outside the steady-decode target band.
white.local RTX 5080 CUDA prefill 15 108.1% CUDA prefill is much faster than the current prediction on several dense models, so this stays visible as source-work rather than a tuned constant.
Mac Studio M1 Ultra Metal first_token 15 19.2% First-token includes tokenize, prefill, first decode, and request overhead; larger coder quants remain slower-than-fit.
white.local RTX 5080 CUDA first_token 15 58.9% First-token still needs prompt-shape and backend scheduling work.
Mac Studio M1 Ultra Metal kv_warm_reuse 16 13.7% Close, but Qwen2.5-Coder Q4/Q5 and bge reranker remain slower-than-fit.
white.local RTX 5080 CUDA kv_warm_reuse 16 16.4% Stable samples, but several dense and reranker cases remain slower-than-fit.

Representative steady-decode rows:

model Studio fit/obs Studio verdict white fit/obs white verdict
unsloth/SmolLM2-135M-Instruct-GGUF:Q8_0 144.9 / 135.3 match 962.5 / 987.3 match
unsloth/Qwen3-0.6B-GGUF:Q4_K_M 141.8 / 124.7 slower-than-fit 632.8 / 634.7 match
bartowski/LGAI-EXAONE_EXAONE-4.0-1.2B-GGUF:Q4_K_M 118.4 / 116.8 match 497.5 / 536.6 match
unsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q4_K_M 62.6 / 56.6 match 186.1 / 169.9 match
unsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q5_K_M 56.3 / 32.3 slower-than-fit 163.2 / 150.8 match
unsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q6_K 50.7 / 40.0 slower-than-fit 143.7 / 131.4 match
unsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q8_0 58.8 / 36.0 slower-than-fit 181.5 / 108.1 slower-than-fit
unsloth/Qwen3-8B-GGUF:Q4_K_M 51.8 / 49.2 match 162.8 / 152.1 match
bartowski/OLMoE-1B-7B-0924-Instruct-GGUF:Q4_K_M 151.8 / 136.3 inconclusive-noisy 314.6 / 618.5 faster-than-fit
gpustack/bge-reranker-v2-m3-GGUF:Q8_0 175.2 / 119.8 slower-than-fit 1028.6 / 722.8 slower-than-fit

Validation artifacts from the broader run:

  • Studio JSON: /tmp/model-fit-validation-studio-broader-final.json
  • Studio Markdown: /tmp/model-fit-validation-studio-broader-final.md
  • white.local JSON: /tmp/model-fit-validation-white-broader-final.json
  • white.local Markdown: /tmp/model-fit-validation-white-broader-final.md

Focused Q8/MoE Follow-Up

This follow-up addresses the two visible steady-decode misses from the broader run without using model names, filenames, backend-specific constants, or current-run throughput as estimator inputs.

Source-grounded changes:

  • Q8_0, Q5_K, and Q6_K tensor traffic now counts stored GGUF matmul bytes instead of discounting resident bytes. The pinned llama.cpp sources show Metal Q8_0 mat-vec kernels reading block_q8_0 directly, and CUDA MMVQ/MMQ using q8-specific vec-dot helpers; there is no source basis for charging less than the stored quantized bytes.
  • Sparse MoE dispatch overhead now scales from measured fixed submission overhead when measured GPU data is available, capped by the existing conservative fallback prior. The llama.cpp MoE graph contains real router/top-k, MUL_MAT_ID, weighting, and aggregation work, but CUDA can also fuse eligible expert paths and use optimized quantized kernels, so one backend-independent hand-written MoE dispatch constant was too conservative.

Focused corpus: crates/model-fit/validation/q8-moe-models.txt, covering Qwen2.5-Coder 7B Q4/Q5/Q6/Q8 slope, OLMoE active-expert MoE, and a Q8 reranker control.

machine backend scenario samples median abs error notable result
Mac Studio M1 Ultra Metal steady_decode 6 8.8% Qwen2.5-Coder Q8 moved to 1.06 observed/fit; OLMoE 0.88 and noisy; bge reranker remains slower-than-fit.
white.local RTX 5080 CUDA steady_decode 6 8.3% Qwen2.5-Coder Q8 improved from 0.60 to 0.86 observed/fit; OLMoE improved from 1.97 to 0.90 observed/fit.
Mac Studio M1 Ultra Metal prefill 5 12.2% Q4/Q5/Q6 and OLMoE are close; Q8 prefill remains noisy/slower.
white.local RTX 5080 CUDA prefill 5 223.6% CUDA prefill remains the largest residual miss and is reported rather than hidden.
Mac Studio M1 Ultra Metal kv_warm_reuse 6 14.3% Q8 matches warm reuse; Q4/Q5 and bge remain slower/noisy.
white.local RTX 5080 CUDA kv_warm_reuse 6 17.5% Qwen2.5-Coder and bge KV reuse remain slower than steady-decode fit.

MoE Prefill Probe and First-Token Breakdown

This update adds a CUDA prefill_moe_matmul_tflops_fp16 hardware fact from a
strided-batched FP16 cuBLAS GEMM shaped like active-expert MoE prefill. The
field is carried through mesh-llm gpus benchmark, cached system GPU facts,
model-fit HardwareProfile, validator JSON, and the fit-input contract.

The scorer does not treat that raw MoE GEMM probe as a free speedup. It is
used as an upper bound because llama.cpp MoE prefill has router/top-k,
expert-id mapping, GGML_OP_MUL_MAT_ID, weighting, and aggregation work around
the expert GEMMs. When both the measured MoE roofline and the older MoE-aware
fallback exist, the scorer uses the lower throughput. That is the conservative
source-grounded rule, and the unit test locks it in.

The validator now also serializes first-token components:

  • predicted prefill ms
  • predicted decode ms
  • predicted overhead ms
  • observed tokenize ms
  • observed prefill ms
  • observed decode ms
  • observed unattributed ms

Latest focused white.local CUDA rerun after the MoE probe and first-token split:

scenario samples median abs error notable result
steady_decode 6 8.0% Qwen2.5-Coder Q4/Q5/Q6/Q8 and OLMoE remain close enough for the focused set; bge reranker remains a control miss/runtime-error in some scenarios.
prefill 5 5.4% Qwen2.5-Coder Q4/Q5/Q8 are in band; Q6 is 14% slower-than-fit; OLMoE is noisy at 0.85 observed/fit, with the MoE probe bounded rather than over-trusted.
first_token 5 16.9% Tokenize is about 7 ms and unattributed request time is near zero; the miss is mostly first decode after prefill.
kv_warm_reuse 6 17.1% Warm reuse remains slower than steady-decode fit for several samples.

Representative first-token component rows from that run:

model predicted prefill ms predicted decode ms observed tokenize ms observed prefill ms observed decode ms observed unattributed ms
Qwen2.5-Coder Q4_K_M 404.4 5.4 6.9 406.3 54.3 1.0
Qwen2.5-Coder Q5_K_M 404.4 6.3 6.9 416.8 55.7 0.8
Qwen2.5-Coder Q6_K 404.4 7.4 6.9 468.8 59.4 0.5
Qwen2.5-Coder Q8_0 402.4 7.9 7.7 372.6 58.5 0.7
OLMoE Q4_K_M 239.5 1.4 7.9 276.9 82.8 0.0

The next honest target is therefore a first-decode-after-prefill hardware fact
or source-derived model for session/prompt-transition decode, not a wider error
band and not a model-name exception.

Representative focused steady-decode rows:

model Studio fit/obs Studio verdict white fit/obs white verdict
unsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q4_K_M 63.4 / 62.1 match 182.9 / 169.9 match
unsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q5_K_M 56.8 / 51.8 match 158.9 / 150.8 match
unsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q6_K 49.6 / 57.1 faster-than-fit 134.4 / 131.4 match
unsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q8_0 45.5 / 48.2 match 125.6 / 108.2 slower-than-fit
bartowski/OLMoE-1B-7B-0924-Instruct-GGUF:Q4_K_M 152.4 / 133.3 inconclusive-noisy 683.9 / 618.7 match
gpustack/bge-reranker-v2-m3-GGUF:Q8_0 165.1 / 111.7 slower-than-fit 906.8 / 723.2 slower-than-fit

Validation artifacts from the focused run:

  • Studio JSON: /tmp/model-fit-validation-studio-q8-moe.json
  • Studio Markdown: /tmp/model-fit-validation-studio-q8-moe.md
  • white.local JSON: /tmp/model-fit-validation-white-q8-moe.json
  • white.local Markdown: /tmp/model-fit-validation-white-q8-moe.md

Dense Prefill Roofline Follow-Up

This update splits prefill from decode without adding backend-specific rules. mesh-llm gpus benchmark now reports an optional generic prefill_matmul_tflops_fp16 hardware fact. model-fit consumes that field for dense transformer prefill roofline estimates, using GGUF matmul FLOPs, prompt tokens, llama.cpp ubatch shape, active weight bytes, measured memory bandwidth, and measured graph overhead.

Sparse MoE intentionally does not use the dense matmul probe. llama.cpp routes MoE through expert selection and GGML_OP_MUL_MAT_ID plus id mapping, weighting, and aggregation, so dense GEMM throughput is not the right hardware fact for active-expert prefill. MoE stays on the existing fallback until there is a measured MoE-shaped hardware probe.

Focused white.local CUDA rerun after the corrected wiring:

scenario samples median abs error notable result
steady_decode 6 7.9% Decode stayed in the target band; Q8 and bge reranker remain visible slower-than-fit residuals.
prefill 5 5.7% Qwen2.5-Coder Q4/Q5/Q8 matched; Q6 is 14% slower-than-fit; OLMoE uses the MoE fallback and is noisy/slower-than-fit.
first_token 5 17.0% Dense first-token improved but still includes request/setup/tokenize/decode overhead not isolated by the prefill throughput check.
kv_warm_reuse 6 17.0% KV reuse remains slower-than-fit across this focused CUDA set.

Focused Studio Metal rerun with the dense prefill split but without a Metal matmul probe yet:

scenario samples median abs error notable result
steady_decode 6 9.0% Decode stayed in band overall; Q6/Q8 were faster-than-fit in this run.
prefill 5 3.7% Qwen2.5-Coder Q5/Q8 and OLMoE matched; Q4 was 15% faster-than-fit and Q6/Q8 carried noisy samples.
first_token 5 9.5% First-token reached the target band on the focused Metal set.
kv_warm_reuse 6 19.2% KV reuse remains the main Metal residual in this set.

Validation artifacts:

  • Studio JSON: /tmp/model-fit-validation-studio-prefill-roofline.json
  • Studio Markdown: /tmp/model-fit-validation-studio-prefill-roofline.md
  • white.local JSON: /tmp/model-fit-validation-white-prefill-matmul4.json
  • white.local Markdown: /tmp/model-fit-validation-white-prefill-matmul4.md

Cross-Backend Prefill Probe Coverage

This update keeps the estimator backend-neutral while broadening the hardware facts that can feed it:

execution path implementation validation status
CUDA dense FP16 GEMM and MoE-shaped strided-batched FP16 GEMM through cuBLAS Re-ran on white.local RTX 5080; JSON reports prefill_matmul_tflops_fp16: 112.31 and prefill_moe_matmul_tflops_fp16: 84.44.
Metal dense FP16 GEMM and MoE-shaped expert GEMMs through Metal Performance Shaders Re-ran on Mac Studio M1 Ultra; JSON reports prefill_matmul_tflops_fp16: 11.8 and prefill_moe_matmul_tflops_fp16: 9.02.
ROCm/HIP dense FP16 GEMM and MoE-shaped strided-batched FP16 GEMM through hipBLAS Implemented, but not runtime-validated in this session because no ROCm runner was available.
CPU CpuProfile can carry the same optional facts and the scorer consumes them Intentionally not auto-filled; many CPUs do not have a comparable FP16 matmul path, so model-fit leaves these facts absent unless a CPU benchmark supplies them.

The scoring rule is unchanged: model-fit consumes optional measured facts by name and does not branch on backend names. Backend-specific code exists only in benchmark implementations because each native runtime has a different API for measuring the same semantic operation.

Additional validation for this update:

  • mesh-llm gpus benchmark --json on Mac Studio M1 Ultra with Metal emitted both dense and MoE prefill facts.
  • mesh-llm gpus benchmark --json on white.local with CUDA emitted both dense and MoE prefill facts.
  • cargo check -p mesh-llm-gpu-bench --features cuda on white.local passed.
  • cargo test -p model-fit --lib passed.
  • cargo clippy -p mesh-llm --all-targets -- -D warnings passed.
  • cargo fmt --all -- --check and git diff --check passed.

Latest Post-Prefill Validation

This update adds a measured post_prefill_decode_overhead_ms fact to mesh-llm gpus benchmark for CUDA, Metal, and ROCm/HIP. The fit estimator uses it only as a lower-bound first-token transition cost: it measures issuing decode-shaped work after prefill-shaped matmul work without a GGUF model loaded. It does not absorb tokenizer, HTTP, llama.cpp graph/session, sampling, or observed model benchmark residuals.

Current broader validation after this probe:

machine backend scenario samples median abs error note
Mac Studio M1 Ultra Metal steady_decode 16 18.0% Several small/medium dense and reranker cases remain slower-than-fit.
white.local RTX 5080 CUDA steady_decode 16 7.7% Most generation models match; SmolLM2 Q4 is faster-than-fit, Phi/Qwen2.5 Q8/bge remain slower.
Mac Studio M1 Ultra Metal prefill 15 12.8% Improved, still outside the 10% target.
white.local RTX 5080 CUDA prefill 15 29.1% CUDA prefill still needs source-grounded shape work.
Mac Studio M1 Ultra Metal first_token 15 25.8% Post-prefill probe does not explain all first-token residuals.
white.local RTX 5080 CUDA first_token 15 68.7% First-token remains the main unsolved scenario.
Mac Studio M1 Ultra Metal kv_warm_reuse 16 9.5% Within target median, with some per-model misses still visible.
white.local RTX 5080 CUDA kv_warm_reuse 16 13.9% Close but still outside target median.

Full steady-decode comparison from the broader rerun:

model Studio pred/actual Studio ratio Studio verdict white pred/actual white ratio white verdict
unsloth/SmolLM2-135M-Instruct-GGUF:Q4_K_M 163.9 / 170.59 1.04 match 735.3 / 998.92 1.36 faster-than-fit
unsloth/SmolLM2-135M-Instruct-GGUF:Q8_0 161.4 / 171.38 1.06 match 921.1 / 990.24 1.08 match
unsloth/Qwen3-0.6B-GGUF:Q4_K_M 158.5 / 119.10 0.75 slower-than-fit 630.3 / 634.00 1.01 match
unsloth/Qwen3-0.6B-GGUF:Q8_0 146.3 / 113.82 0.78 slower-than-fit 658.2 / 598.71 0.91 match
bartowski/LGAI-EXAONE_EXAONE-4.0-1.2B-GGUF:Q4_K_M 131.0 / 114.06 0.87 inconclusive-noisy 494.3 / 536.60 1.09 match
bartowski/LGAI-EXAONE_EXAONE-4.0-1.2B-GGUF:Q8_0 112.2 / 112.34 1.00 match 442.1 / 407.45 0.92 match
unsloth/Llama-3.2-3B-Instruct-GGUF:Q4_K_M 100.5 / 80.00 0.80 slower-than-fit 316.2 / 301.32 0.95 match
unsloth/Phi-4-mini-instruct-GGUF:Q4_K_M 82.7 / 70.52 0.85 slower-than-fit 308.3 / 265.63 0.86 slower-than-fit
unsloth/Qwen3-4B-GGUF:Q4_K_M 75.1 / 64.54 0.86 inconclusive-noisy 245.6 / 232.88 0.95 match
unsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q4_K_M 65.3 / 58.29 0.89 slower-than-fit 182.2 / 169.88 0.93 match
unsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q5_K_M 58.1 / 47.58 0.82 slower-than-fit 158.3 / 150.76 0.95 match
unsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q6_K 50.3 / 44.37 0.88 inconclusive-noisy 133.8 / 131.27 0.98 match
unsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q8_0 45.9 / 35.36 0.77 slower-than-fit 125.1 / 108.17 0.86 slower-than-fit
unsloth/Qwen3-8B-GGUF:Q4_K_M 54.4 / 49.56 0.91 match 159.5 / 151.67 0.95 match
bartowski/OLMoE-1B-7B-0924-Instruct-GGUF:Q4_K_M 159.2 / 115.18 0.72 slower-than-fit 681.7 / 618.58 0.91 match
gpustack/bge-reranker-v2-m3-GGUF:Q8_0 185.1 / 110.05 0.59 slower-than-fit 905.0 / 719.27 0.79 slower-than-fit

Validation artifacts from the post-prefill broader run:

  • Studio JSON: /tmp/model-fit-validation-studio-postprefill-broader.json
  • Studio Markdown: /tmp/model-fit-validation-studio-postprefill-broader.md
  • white.local JSON: /tmp/model-fit-validation-white-postprefill-broader.json
  • white.local Markdown: /tmp/model-fit-validation-white-postprefill-broader.md

Residual misses are still reported as misses. This branch has not added backend-specific estimator constants or model-specific exceptions to make the table look better.

Expanded FFN Decode Graph Validation

This update fixes two issues found in the Studio Metal follow-up:

  • GGUFs tagged as text-classification, reranker, ranking, or ranker now classify as RerankerOrClassifier before dense transformer shape fallback. This keeps reranker GGUFs out of autoregressive decode validation and lets workload scoring choose the right task shape.
  • Dense decode graph overhead now adds a source-derived expanded-FFN sequential-stage term. The term uses GGUF matmul shape summaries: logical matmul count, layer count, hidden width, and FFN expansion. It is multiplied by the measured decode_fixed_overhead_ms from the hardware profile, so it is not a backend-name branch and it barely moves low-dispatch-latency profiles.

Studio affected rerun:

model predicted tok/s actual tok/s actual/fit verdict
unsloth/Qwen3-0.6B-GGUF:Q4_K_M 117.3 121.96 1.04 match
unsloth/Qwen3-0.6B-GGUF:Q8_0 111.1 110.07 0.99 match
bartowski/LGAI-EXAONE_EXAONE-4.0-1.2B-GGUF:Q4_K_M 132.6 113.18 0.85 inconclusive-noisy
bartowski/LGAI-EXAONE_EXAONE-4.0-1.2B-GGUF:Q8_0 114.9 111.23 0.97 match
unsloth/Llama-3.2-3B-Instruct-GGUF:Q4_K_M 90.1 82.81 0.92 match
unsloth/Phi-4-mini-instruct-GGUF:Q4_K_M 76.3 73.04 0.96 match
unsloth/Qwen3-4B-GGUF:Q4_K_M 64.9 70.82 1.09 match
unsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q4_K_M 59.8 59.80 1.00 match
unsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q5_K_M 54.1 52.39 0.97 match
unsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q6_K 47.8 52.45 1.10 match
unsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q8_0 44.2 30.23 0.68 slower-than-fit
unsloth/Qwen3-8B-GGUF:Q4_K_M 49.9 50.55 1.01 match
gpustack/bge-reranker-v2-m3-GGUF:Q8_0 rejected skipped - skipped

The affected steady-decode median absolute error is 4.0% across accuracy-gated samples, but the report intentionally still fails strict validation because Qwen2.5-Coder 7B Q8_0 was outside tolerance and one sample was noisy. A focused rerun contradicted that specific row: Q8_0 came back faster-than-fit at 1.28 while Q4_K_M came back slower-than-fit at 0.78. I did not add a Q8 correction from contradictory Studio measurements.

First-token remains separate work. The breakdown shows most misses come from long-prompt prefill and first decode after prefill, not steady decode. That should be addressed with a separate source-grounded model or benchmark fact, not by distorting steady-decode fit.

Validation artifacts:

  • Affected Studio JSON: /tmp/model-fit-validation-studio-expanded-ffn-affected.json
  • Affected Studio Markdown: /tmp/model-fit-validation-studio-expanded-ffn-affected.md
  • Q8 rerun JSON: /tmp/model-fit-validation-studio-expanded-ffn-q8-rerun.json

Latest Ubatch Prefill Validation

This update adds prefill_ubatch_matmul_tflops_fp16 to mesh-llm gpus benchmark for CUDA, Metal, and ROCm/HIP implementations and carries it through system GPU facts, validator JSON, and model-fit hardware profiles.

Source rationale: llama.cpp prompt processing is ubatch-shaped (n_ubatch, commonly 512), so a prompt/prefill roofline should prefer a measured ubatch-shaped GEMM over a generic square GEMM when available. The scorer consumes this as a named hardware fact and still does not branch on Metal/CUDA/ROCm names, model names, filenames, catalog reputation, or observed throughput.

Focused Mac Studio M1 Ultra rerun after this change:

model_ref fit est tok/s abi tok/s est range steady median steady/fit steady first-token kv-reuse
unsloth/Qwen3-0.6B-GGUF:Q4_K_M FitsLocal 104.5 142.4 67.9-141.0 123.89 1.19 faster-than-fit faster-than-fit inconclusive-noisy
unsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q4_K_M FitsLocal 53.8 60.9 40.4-67.3 58.46 1.09 match match inconclusive-noisy
unsloth/Qwen3-8B-GGUF:Q4_K_M FitsLocal 44.9 59.7 33.7-56.2 57.56 1.28 faster-than-fit match match

First-token breakdown from the same run:

model predicted total ms observed total ms observed/fit predicted prefill ms observed prefill ms predicted decode ms observed decode ms verdict
Qwen3 0.6B Q4 717.3 601.3 0.84 707.5 551.8 9.5 37.2 faster-than-fit
Qwen2.5-Coder 7B Q4 4217.3 4534.2 1.08 4198.6 4342.0 18.5 182.9 match
Qwen3 8B Q4 4531.1 4820.0 1.06 4508.6 4603.0 22.2 198.5 match

Rendered first-token checker result for this focused subset:

metric value
first-token median absolute error 7.5%
first-token rows in-band 2 / 3
remaining miss Qwen3 0.6B Q4 was faster than fit by 16.2%

Important caveat: the local benchmark for this subset reported high benchmark noise (noise_pct: 26.31). The subset is useful evidence that ubatch-shaped prefill is the right measurement shape, but it is not broad proof that first-token is solved. The first decode after prefill remains visibly under-modeled: total first-token can match because prefill dominates, while the decode component is still much larger than predicted.

Validation commands added for this update:

LLAMA_STAGE_BUILD_DIR=$PWD/.deps/llama-build/build-stage-abi-metal cargo check -p mesh-llm-gpu-bench
LLAMA_STAGE_BUILD_DIR=$PWD/.deps/llama-build/build-stage-abi-metal cargo check -p mesh-llm
LLAMA_STAGE_BUILD_DIR=$PWD/.deps/llama-build/build-stage-abi-metal cargo test -p model-fit --lib
LLAMA_STAGE_BUILD_DIR=$PWD/.deps/llama-build/build-stage-abi-metal cargo build -p model-fit --release --bins
LLAMA_STAGE_BUILD_DIR=$PWD/.deps/llama-build/build-stage-abi-metal cargo run -p mesh-llm -- gpus benchmark --json
target/release/model-fit-validate --no-progress --output-json /tmp/model-fit-validation-studio-ubatch-firsttoken-subset.json \
  unsloth/Qwen3-0.6B-GGUF:Q4_K_M \
  unsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q4_K_M \
  unsloth/Qwen3-8B-GGUF:Q4_K_M
target/release/model-fit-check-validation --scenario first_token --markdown-out /tmp/model-fit-validation-studio-ubatch-firsttoken-subset-first-token.md \
  /tmp/model-fit-validation-studio-ubatch-firsttoken-subset.json
LLAMA_STAGE_BUILD_DIR=$PWD/.deps/llama-build/build-stage-abi-metal cargo clippy -p mesh-llm --all-targets -- -D warnings
cargo fmt --all -- --check
git diff --check

model-fit-check-validation --scenario first_token intentionally exited non-zero for the focused subset because Qwen3 0.6B Q4 remained outside the strict individual tolerance. The report was still rendered and the miss is documented above.

Next validation target: rerun the broader smoke/deep validation on Studio and white.local using the new ubatch-shaped hardware fact under lower-noise benchmark conditions, then add a dedicated source-grounded first-decode-after-prefill model or probe instead of bending steady-decode fit.

Broader Ubatch Rerun

After adding the ubatch-shaped prefill hardware fact, I reran the broader validation corpus on both Mac Studio M1 Ultra/Metal and white.local RTX 5080/CUDA.

Hardware facts from this run:

machine backend decode GB/s prefill square TFLOP/s prefill ubatch TFLOP/s MoE prefill TFLOP/s benchmark noise
Mac Studio M1 Ultra Metal 411.16 9.76 10.82 8.99 24.39%
white.local RTX 5080 CUDA 903.31 112.23 105.81 84.24 0.35%

Scenario summary:

machine backend scenario samples accuracy-gated noisy median abs error headline
Mac Studio M1 Ultra Metal steady_decode 15 14 1 5.8% Median is inside target, but Qwen3 0.6B Q4/Q8 faster-than-fit, Qwen2.5-Coder Q5/Q8 slower-than-fit, and OLMoE faster-than-fit remain visible.
white.local RTX 5080 CUDA steady_decode 15 15 0 6.6% Median is inside target; misses are SmolLM2 Q4 faster-than-fit plus Phi-4-mini and Qwen2.5-Coder Q8 slower-than-fit.
Mac Studio M1 Ultra Metal prefill 15 14 1 22.3% Still outside target. Ubatch shape is the right fact, but Metal prefill remains uneven across tiny, small, coder quant, and MoE shapes.
white.local RTX 5080 CUDA prefill 15 15 0 14.1% Better than earlier CUDA prefill, but still outside target. Small dense is mostly slower-than-fit; coder Q4/Q5/Q8 faster-than-fit.
Mac Studio M1 Ultra Metal first_token 15 13 2 30.4% Still not solved. Larger long-prompt cases are mostly slower-than-fit; Qwen2.5-Coder Q4 is the main good match.
white.local RTX 5080 CUDA first_token 15 15 0 26.0% Better than prior CUDA first-token, but still outside target. Small dense first-token is often slower-than-fit; coder Q4/Q5/Q6 and Qwen3-8B match.
Mac Studio M1 Ultra Metal kv_warm_reuse 15 12 3 11.1% Close but still outside strict target, with tiny-model faster-than-fit and coder Q4/Q5 slower-than-fit.
white.local RTX 5080 CUDA kv_warm_reuse 15 15 0 12.7% Close but still outside strict target; most misses are slower-than-fit on Q8/small/medium and MoE.

Representative steady-decode rows:

model Studio pred/actual Studio ratio Studio verdict white pred/actual white ratio white verdict
unsloth/SmolLM2-135M-Instruct-GGUF:Q4_K_M 153.5 / 155.7 1.01 match 735.2 / 992.3 1.35 faster-than-fit
unsloth/Qwen3-0.6B-GGUF:Q4_K_M 110.8 / 125.7 1.13 faster-than-fit 630.1 / 628.0 1.00 match
bartowski/LGAI-EXAONE_EXAONE-4.0-1.2B-GGUF:Q4_K_M 125.4 / 118.4 0.94 match 494.1 / 538.6 1.09 match
unsloth/Phi-4-mini-instruct-GGUF:Q4_K_M 72.3 / 76.7 1.06 match 308.0 / 268.4 0.87 slower-than-fit
unsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q4_K_M 56.8 / 60.0 1.06 match 182.1 / 171.0 0.94 match
unsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q5_K_M 51.5 / 41.7 0.81 slower-than-fit 158.1 / 151.6 0.96 match
unsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q6_K 45.5 / 47.5 1.04 match 133.7 / 132.0 0.99 match
unsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q8_0 42.1 / 26.9 0.64 slower-than-fit 125.0 / 108.4 0.87 slower-than-fit
unsloth/Qwen3-8B-GGUF:Q4_K_M 47.4 / 48.9 1.03 match 159.3 / 153.0 0.96 match
bartowski/OLMoE-1B-7B-0924-Instruct-GGUF:Q4_K_M 129.8 / 152.9 1.18 faster-than-fit 681.2 / 623.4 0.92 match

Artifacts:

  • Studio JSON: /tmp/model-fit-validation-studio-ubatch-broader.json
  • Studio Markdown: /tmp/model-fit-validation-studio-ubatch-broader-{steady_decode,prefill,first_token,kv_warm_reuse,all}.md
  • white.local JSON: /tmp/model-fit-validation-white-ubatch-broader.json
  • white.local Markdown: /tmp/model-fit-validation-white-ubatch-broader-{steady_decode,prefill,first_token,kv_warm_reuse,all}.md

Interpretation: steady decode is now within the 10% median target on both machines, but the residual per-model misses are real and should stay visible. Prefill improved compared with earlier CUDA results but is not in band broadly. First-token remains a separate problem, and KV warm reuse is close but still outside strict median target. No thresholds were widened and no model/backend-specific estimator constants were added for this rerun.

First-Token Residual Validation

This update adds explicit first-token residual diagnostics instead of hiding the miss inside a tuned estimator constant. The validator now records prompt token count, tokenizer vocab size, predicted prefill/decode/transition pieces, observed tokenize/prefill/decode timing, and sampled-decode residual per prompt token. model-artifact also derives vocab_size from tokenizer.ggml.tokens when GGUFs omit an explicit *.vocab_size field.

Source rationale: Skippy's sampled first decode calls into llama.cpp sampling after prefill, and the chat sampling path syncs sampler history across the prompt before sampling the first generated token. That is a real source-shaped O(prompt tokens) term, separate from steady decode and separate from the lower-bound post_prefill_decode_overhead_ms hardware probe. The current change reports that residual; it does not yet add a guessed sampler-throughput constant to model-fit.

Latest corrected broader rerun:

machine backend scenario samples noisy median abs error note
Mac Studio M1 Ultra Metal steady_decode 15 3 26.7% This Studio run was much noisier/slower than the immediately preceding Studio pass, despite low hardware benchmark noise; treat it as environment evidence, not a fit-tuning target.
Mac Studio M1 Ultra Metal prefill 15 3 22.6% Still outside target; Metal prefill remains shape/noise sensitive.
Mac Studio M1 Ultra Metal first_token 15 2 38.7% First-token residual is visible and large on long-prompt 4B/7B/8B cases.
Mac Studio M1 Ultra Metal kv_warm_reuse 15 2 9.6% Median is inside target, with residual per-model misses still visible.
white.local RTX 5080 CUDA steady_decode 15 0 6.7% Decode remains inside target; residual misses are SmolLM2 Q4 faster-than-fit plus Phi and Qwen2.5-Coder Q8 slower-than-fit.
white.local RTX 5080 CUDA prefill 15 1 14.6% Better than early CUDA prefill work but still outside target.
white.local RTX 5080 CUDA first_token 15 0 26.9% Coder 7B and Qwen3 8B match, but small/medium non-coder first-token remains slower-than-fit.
white.local RTX 5080 CUDA kv_warm_reuse 15 0 12.8% Stable but still outside strict target median.

First-token residual examples from the CUDA rerun:

model prompt toks vocab predicted ms observed ms sampled residual residual us/tok verdict
unsloth/SmolLM2-135M-Instruct-GGUF:Q4_K_M 3455 49152 67.8 99.7 16.3 4.7 slower-than-fit
unsloth/Qwen3-0.6B-GGUF:Q4_K_M 3168 151936 117.5 115.8 15.3 4.8 match
bartowski/LGAI-EXAONE_EXAONE-4.0-1.2B-GGUF:Q4_K_M 3455 102400 79.8 158.8 27.7 8.0 slower-than-fit
unsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q4_K_M 3168 152064 432.4 403.6 23.8 7.5 match
unsloth/Qwen3-8B-GGUF:Q4_K_M 3168 151936 465.1 447.8 26.3 8.3 match
bartowski/OLMoE-1B-7B-0924-Instruct-GGUF:Q4_K_M 3455 50304 241.2 282.4 40.4 11.7 slower-than-fit

Validation artifacts:

  • Studio JSON: /tmp/model-fit-validation-studio-first-token-residual-vocab.json
  • Studio Markdown: /tmp/model-fit-validation-studio-first-token-residual-vocab-{first_token,all}.md
  • white.local JSON: /tmp/model-fit-validation-white-first-token-residual-vocab.json
  • white.local Markdown: /tmp/model-fit-validation-white-first-token-residual-vocab-{first_token,all}.md

Interpretation: steady decode remains good on the clean CUDA host, but the latest Studio pass shows enough run-to-run instability that we should improve smoke-test denoising before using Studio as a strict gate. First-token now has a source-shaped residual column. The next honest estimator step is to add a model-independent sampler-history hardware fact or otherwise measure that source-shaped work without using per-model observed throughput.

@i386 i386 changed the title [codex] Improve metadata-based model fit validation Metadata-based model fit validation May 31, 2026
@i386 i386 changed the title Metadata-based model fit validation Metadata-based model fit May 31, 2026
@i386 i386 force-pushed the codex/model-fit-metadata-validation branch from 45e8ef5 to 8a9d5a2 Compare May 31, 2026 06:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant