Metadata-based model fit by i386 · Pull Request #768 · Mesh-LLM/mesh-llm

i386 · 2026-05-31T03:53:48Z

Summary

This PR turns model-fit into a metadata-first fit estimator with a repeatable validation loop against real Skippy single-stage benchmark runs. The branch is broader than the scorer alone: it adds GGUF tensor-group profiling, measured-hardware diagnostics, model download progress, benchmark timing instrumentation, validation binaries, checked-in validation corpora, documentation, and a GitHub workflow that can gate estimator changes on self-hosted GPU runners.

Users and agents can now point model-fit-validate at Hugging Face model refs, local GGUFs, or checked-in manifest files and get a JSON report containing hardware facts, measured GPU bandwidth, GGUF-derived model profiles, workload recommendations, benchmark observations, and per-scenario estimator agreement. model-fit-check-validation can turn that JSON into a Markdown report and fail CI when accuracy, noise, or minimum coverage thresholds are missed.

Branch Scope

area	branch change
GGUF metadata	Adds tensor-group byte classification in `model-artifact` so `model-fit` can distinguish attention, dense FFN, expert FFN, embeddings, output, normalization, and other tensors.
Fit scoring	Reworks decode estimates around active bytes, measured bandwidth, benchmark noise, backend-neutral measured-GPU overhead, MoE active expert bytes, low-active-byte overhead, medium-width overhead, uncertainty ranges, prefill throughput, first-token latency, workload scoring, and split-candidate recommendations.
Hardware profiles	Carries benchmark noise, bandwidth efficiency, and optional compute diagnostics from `mesh-llm gpus benchmark` into `model-fit` hardware profiles.
Validation runner	Adds `model-fit-validate` for model refs, local `--model ref=...,path=...` inputs, `--models-file` manifests, automatic or supplied GPU benchmark JSON, progress reporting, repeated Skippy runs, steady decode, first-token, and warm KV reuse scenarios.
Validation checker	Adds `model-fit-check-validation` to render Markdown summaries and enforce thresholds for median error, individual error, noisy samples, and minimum model count.
Skippy benchmarking	Extends `skippy-bench local-single` with request counts, session reuse, per-request observations, and decode-only throughput.
Skippy server timing	Adds tokenize, prefill, and decode elapsed milliseconds to the embedded `/v1/text` response so validation can compare steady decode without tokenization/prefill noise.
Hugging Face downloads	Adds progress callbacks to `model-hf` so validation can show file download and ready-state progress while preparing GGUFs.
Validation corpora	Checks in smoke and deep model manifests stratified by estimator behavior: tiny dense, small dense, common local 7B/8B, quant pairs, MoE active experts, embeddings, and rerankers.
CI workflow	Adds `.github/workflows/model-fit-validation.yml` with normal metadata checks plus self-hosted GPU smoke validation and manual smoke/deep dispatch.
Docs	Adds `crates/model-fit/README.md` and `crates/model-fit/validation/README.md` covering the estimator inputs, report shape, and validation corpora.

Validation Corpora

The model set is stratified by estimator behavior rather than just popularity.

Smoke corpus: crates/model-fit/validation/smoke-models.txt

bucket	models
tiny dense / overhead-bound	SmolLM2 135M Q8_0, Qwen3 0.6B Q4_K_M
small dense transition	EXAONE 1.2B Q4_K_M, Llama 3.2 3B Q4_K_M
3B/4B/7B/8B local serving	Gemma 3 4B Q4_K_M, Qwen2.5 Coder 7B Q4_K_M, Qwen3 8B Q4_K_M
quant slope	Qwen3 8B Q8_0
MoE active expert bytes	Qwen3 30B-A3B Q4_K_M

Deep corpus: crates/model-fit/validation/deep-models.txt

The deep set expands coverage to Q4/Q8 pairs, 12B/14B/32B dense models, more coder models, Qwen3 30B-A3B MoE variants, embedding GGUFs, and reranker GGUFs. It is intended for manual or nightly high-memory self-hosted runners rather than every PR.

Validation Data

Latest local steady-decode validation run:

target/release/model-fit-validate --no-progress \
  --output-json /tmp/model-fit-validation-aggregate-steady-small.json \
  bartowski/LGAI-EXAONE_EXAONE-4.0-1.2B-GGUF:Q4_K_M \
  unsloth/Qwen3-0.6B-GGUF:Q4_K_M \
  unsloth/SmolLM2-135M-Instruct-GGUF:Q8_0 \
  unsloth/Qwen3-8B-GGUF:Q4_K_M

model	est tok/s	observed tok/s	observed/fit	spread	verdict
EXAONE 1.2B Q4_K_M	113.6	106.8	0.94	5.2%	match
Qwen3 0.6B Q4_K_M	129.1	119.4	0.93	4.3%	match
SmolLM2 135M Q8_0	142.3	145.1	1.02	3.9%	match
Qwen3 8B Q4_K_M	53.4	53.9	1.01	2.7%	match

Summary from the JSON report:

metric	value
models benchmarked	4
matched	4
noisy	0
slower-than-fit	0
faster-than-fit	0
median observed/fit	0.974
mean observed/fit	0.973
median absolute percent error	3.96%
within tolerance	4

The small-model result is the important one. SmolLM2 initially looked noisy when the validator sampled only the final request of each repeat. Aggregating generated tokens over aggregate decode time per repeat changed the steady-decode observation to 1.02 observed/fit with 3.9% spread, which better reflects sustained decode and avoids tuning the estimator against request jitter.

How The Fit Algorithm Works

model-fit remains metadata-first and deterministic. It does not use filenames, catalog reputation, or model-specific boosts to predict throughput. It consumes a HardwareProfile, a GGUF-derived ModelProfile, and a SelectionConfig.

For memory fit, the selector estimates runtime memory as resident weights plus KV cache plus scratch/backend overhead. KV cache is estimated from layer count, target context, KV heads, key/value widths, and configured KV cache type. If GGUF metadata is incomplete, it falls back to hidden size as a conservative KV width. A model is rejected locally when estimated runtime memory exceeds the usable memory budget after the safety margin.

For dense transformer decode, the primary estimate is active bytes per token divided by effective memory bandwidth, plus fixed and shape overheads. Active bytes come from GGUF tensor groups when available, so embeddings and output tensors do not have to be treated the same as per-token transformer work. Measured hardware bandwidth comes from mesh-llm gpus benchmark, using p90 bandwidth and benchmark noise. Once bandwidth is measured, the decode-efficiency path is backend-neutral so the same measured profile does not get extra Metal/CUDA/ROCm assumptions layered on top.

For sparse MoE models, decode uses base/resident bytes plus only the active routed expert share. This avoids treating every expert as active for every token while still including resident attention, shared FFN, normalization, output, and other tensor groups when available. MoE also carries a per-layer dispatch overhead so routing work is visible in the latency estimate.

For tiny and narrow models, a pure bytes/bandwidth estimate overpredicts because fixed runtime, scheduling, and kernel overhead become a large share of token latency. The estimator applies generic low-active-byte and small-width overheads based on active bytes and hidden width. This is not model-name calibration; it applies to future GGUFs with similar geometry.

For prefill and first-token latency, the selector derives prefill throughput from decode throughput times a shape-based parallelism factor. Prefill exposes much more parallelism than one-token decode, especially for smaller models, but arbitrary GGUF metadata is not rich enough for a stable full prefill simulator. First-token latency combines prompt_tokens / estimated_prefill_tps with one decode step. During validation, first-token scenarios rescore with the observed prompt token count so the comparison uses the prompt that was actually benchmarked.

For non-chat workloads, workload suitability is separate from throughput fit. Embedding, classifier/reranker, chat, tool-use, coding, long-context, summarization, and related workload profiles can weight memory, context, decode, prefill, and capability evidence differently. Embedding-like models are not rejected globally; they are accepted or penalized according to the requested workload.

For oversized models, the selector can return split-candidate recommendations instead of pretending local fit failed completely. The split estimate is memory-budget based and warns that inter-stage activation transfer depends on hidden width, layer count, and network bandwidth.

Validator Shape

model-fit-validate prepares a self-contained report by:

loading a supplied GPU benchmark JSON or running a local bandwidth benchmark
resolving Hugging Face GGUF refs or using local GGUF paths
downloading missing artifacts with progress output
profiling each GGUF through model-fit
scoring the primary workload plus workload variants
running Skippy benchmark scenarios against the same model
comparing predicted and observed steady decode, first-token latency, and warm KV reuse where applicable
writing a schema-versioned JSON report and printing a compact Markdown table

The steady-decode scenario prefers decode-only timings from /v1/text, excluding tokenization and prefill. It also increases the generated-token window for tiny active-byte models and aggregates generated tokens over aggregate decode time per repeat to reduce jitter.

Workflow Shape

The new workflow is .github/workflows/model-fit-validation.yml.

For pull requests touching model-fit or the benchmark/server pieces it runs metadata checks and schedules a self-hosted GPU smoke job. The GPU job:

builds release binaries
runs target/release/mesh-llm gpus benchmark --json
runs model-fit-validate --models-file crates/model-fit/validation/smoke-models.txt
gates the JSON with model-fit-check-validation
writes a Markdown summary to $GITHUB_STEP_SUMMARY
uploads the benchmark JSON, validation JSON, and Markdown report

Manual runs can choose smoke or deep via workflow_dispatch, and the runner labels can be supplied explicitly or through MODEL_FIT_GPU_RUNS_ON.

Protocol / Compatibility

No mesh wire protocol, gossip schema, or plugin protocol changes are made in this branch.
Skippy ABI gains an additive decode benchmark probe for validation; the ABI version constants are bumped with the patch queue.
The embedded Skippy /v1/text JSON response gains additive timing fields: tokenize_elapsed_ms, prefill_elapsed_ms, and decode_elapsed_ms. Older consumers can ignore them; the new validator and benchmark path use them when present and fall back where possible.
The model-fit JSON-facing structs gain additive fields for tensor groups, estimate ranges, prefill/first-token estimates, and hardware benchmark diagnostics.

Implementation Notes

model-artifact now scans GGUF tensor names into semantically useful byte groups.
model-fit exposes TensorGroupBytes, decode and first-token estimate ranges, measured decode efficiency, measured-GPU overhead, and benchmark diagnostics.
model-fit-validate supports positional model refs, --model-ref, local --model ref=...,path=..., and --models-file with blank-line and # comment support.
model-fit-check-validation reads validator JSON, renders a Markdown table, and enforces thresholds such as median absolute error, individual error, noisy sample count, and minimum model count.
model-hf exposes progress events for ensuring, starting, progressing, ready, and complete download states.
skippy-bench local-single can issue multiple requests, reuse a session, and report per-request decode-only throughput.
The fit crate now has in-code comments explaining the heuristics and their intended boundaries for future developers and agents.

Validation

Passed:

cargo check -p model-fit --bins
cargo test -p model-fit --lib
cargo clippy -p model-fit --all-targets -- -D warnings
cargo check -p skippy-server
cargo clippy -p skippy-server --all-targets -- -D warnings
cargo test -p skippy-server --lib
cargo check -p skippy-bench
cargo clippy -p skippy-bench --all-targets -- -D warnings
cargo test -p skippy-bench
cargo fmt --all -- --check
cargo run -p model-fit --bin model-fit-check-validation -- --min-models 4 --markdown-out /tmp/model-fit-check.md /tmp/model-fit-validation-aggregate-steady-small.json
cargo run -p xtask -- repo-consistency ci-crate-lists

Known existing repo-consistency issue surfaced by release checks:

cargo run -p xtask -- repo-consistency release-targets currently fails because model-fit is listed as publishable but has unpublished workspace dependencies missing from scripts/publish-crates.sh (mesh-llm-gpu-bench, then mesh-llm-system, likely more). I did not expand the publish pipeline in this PR.
cargo run -p xtask -- repo-consistency publish-crates fails for the same existing publish-chain reason.

Latest ABI Decode Validation

This update adds the denoised Skippy decode ABI probe and replaces the old low-active/width decode penalties with a source-grounded graph-overhead term. The estimator still does not use model names, filenames, catalog reputation, or observed throughput as inputs. It uses GGUF tensor groups, tensor type mix, logical matmul shape counts, layer count, measured decode-shaped bandwidth, and measured fixed backend submission overhead.

Source anchors checked in the pinned llama.cpp tree:

source area	evidence used by model-fit
`src/llama-arch.cpp`	Repeating attention/FFN tensors map to `GGML_OP_MUL_MAT`; MoE expert tensors map to `GGML_OP_MUL_MAT_ID`.
`src/llama-graph.cpp`	Decode builds a repeated per-layer graph around matmul, attention, norm, RoPE, activation, copy/view, and elementwise work.
`ggml/src/ggml-metal/ggml-metal.metal`	Q8_0, Q4_K, matmul, matrix-vector, and MoE ID paths have distinct kernels.
`ggml/src/ggml-cuda`	CUDA has dedicated quantized matmul and `MUL_MAT_ID` paths, with different single-token and multi-token behavior.
`ggml/src/ggml-cpu/ggml-cpu.c` and `repack.cpp`	CPU type traits and repack paths distinguish Q8_0, Q8_K, K-quants, and related vec-dot formats.

Latest two-machine five-model rerun:

machine	backend	scenario	median abs error	result
Mac Studio M1 Ultra	Metal	steady_decode	8.9%	Qwen3 8B matched at 0.98 observed/fit; EXAONE matched at 0.91; Qwen3 0.6B remained a 0.79 miss; SmolLM2 and Llama 3.2 3B were noisy.
white.local	CUDA	steady_decode	5.8%	All five steady-decode samples matched within the 10% target band: 1.02, 1.00, 1.08, 0.94, 0.94 observed/fit.
Mac Studio M1 Ultra	Metal	first_token	12.1%	Close but still misses on Qwen3 0.6B, Llama 3.2 3B, and Qwen3 8B.
white.local	CUDA	first_token	59.7%	Still not solved; needs separate prefill/prompt-shape modeling.
Mac Studio M1 Ultra	Metal	kv_warm_reuse	18.8%	Noisy small-model samples plus slower-than-fit misses on 3B/8B.
white.local	CUDA	kv_warm_reuse	14.0%	Stable but slower than fit for SmolLM2, Llama 3.2 3B, and Qwen3 8B.

Validation artifacts from this run:

Studio JSON: /tmp/model-fit-validation-studio-graph2.json
Studio Markdown: /tmp/model-fit-validation-studio-graph2.md
white.local JSON: /tmp/model-fit-validation-white-graph2.json
white.local Markdown: /tmp/model-fit-validation-white-graph2.md

Residual misses are reported as misses. They were not hidden by widening thresholds or adding model-specific exceptions.

Broader Validation Rerun

This update also adds the prefill validation scenario and fixes execution-budget selection so a model that fits both CPU and GPU chooses the measured execution path with the better metadata-only throughput estimate before using memory headroom as a tie-breaker. The bug was general: a high-headroom CPU budget could beat a measured GPU budget for one model, causing the recommendation to describe the wrong execution path. The fix is backend-neutral and evidence-ranked; it does not special-case Metal, CUDA, model names, filenames, or current-run measurements.

Broader corpus shape: 18 GGUF refs covering tiny dense models, small dense transition models, a Qwen2.5-Coder 7B Q4/Q5/Q6/Q8 slope, Qwen3 8B, one active-expert MoE, and embedding/reranker metadata cases.

Final broader validation:

machine	backend	scenario	samples	median abs error	notable result
Mac Studio M1 Ultra	Metal	steady_decode	16	9.8%	Tiny models now select measured Metal instead of optimistic CPU fallback. Qwen2.5-Coder Q5/Q6/Q8 and bge reranker remain slower-than-fit.
white.local RTX 5080	CUDA	steady_decode	16	8.6%	Qwen2.5-Coder Q4/Q5/Q6 matched; Q8, OLMoE, and bge reranker remain honest misses.
Mac Studio M1 Ultra	Metal	prefill	15	24.6%	Prefill is now measured separately as prompt tokens / `prefill_elapsed_ms`; still outside the steady-decode target band.
white.local RTX 5080	CUDA	prefill	15	108.1%	CUDA prefill is much faster than the current prediction on several dense models, so this stays visible as source-work rather than a tuned constant.
Mac Studio M1 Ultra	Metal	first_token	15	19.2%	First-token includes tokenize, prefill, first decode, and request overhead; larger coder quants remain slower-than-fit.
white.local RTX 5080	CUDA	first_token	15	58.9%	First-token still needs prompt-shape and backend scheduling work.
Mac Studio M1 Ultra	Metal	kv_warm_reuse	16	13.7%	Close, but Qwen2.5-Coder Q4/Q5 and bge reranker remain slower-than-fit.
white.local RTX 5080	CUDA	kv_warm_reuse	16	16.4%	Stable samples, but several dense and reranker cases remain slower-than-fit.

Representative steady-decode rows:

model	Studio fit/obs	Studio verdict	white fit/obs	white verdict
`unsloth/SmolLM2-135M-Instruct-GGUF:Q8_0`	144.9 / 135.3	match	962.5 / 987.3	match
`unsloth/Qwen3-0.6B-GGUF:Q4_K_M`	141.8 / 124.7	slower-than-fit	632.8 / 634.7	match
`bartowski/LGAI-EXAONE_EXAONE-4.0-1.2B-GGUF:Q4_K_M`	118.4 / 116.8	match	497.5 / 536.6	match
`unsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q4_K_M`	62.6 / 56.6	match	186.1 / 169.9	match
`unsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q5_K_M`	56.3 / 32.3	slower-than-fit	163.2 / 150.8	match
`unsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q6_K`	50.7 / 40.0	slower-than-fit	143.7 / 131.4	match
`unsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q8_0`	58.8 / 36.0	slower-than-fit	181.5 / 108.1	slower-than-fit
`unsloth/Qwen3-8B-GGUF:Q4_K_M`	51.8 / 49.2	match	162.8 / 152.1	match
`bartowski/OLMoE-1B-7B-0924-Instruct-GGUF:Q4_K_M`	151.8 / 136.3	inconclusive-noisy	314.6 / 618.5	faster-than-fit
`gpustack/bge-reranker-v2-m3-GGUF:Q8_0`	175.2 / 119.8	slower-than-fit	1028.6 / 722.8	slower-than-fit

Validation artifacts from the broader run:

Studio JSON: /tmp/model-fit-validation-studio-broader-final.json
Studio Markdown: /tmp/model-fit-validation-studio-broader-final.md
white.local JSON: /tmp/model-fit-validation-white-broader-final.json
white.local Markdown: /tmp/model-fit-validation-white-broader-final.md

Focused Q8/MoE Follow-Up

This follow-up addresses the two visible steady-decode misses from the broader run without using model names, filenames, backend-specific constants, or current-run throughput as estimator inputs.

Source-grounded changes:

Q8_0, Q5_K, and Q6_K tensor traffic now counts stored GGUF matmul bytes instead of discounting resident bytes. The pinned llama.cpp sources show Metal Q8_0 mat-vec kernels reading block_q8_0 directly, and CUDA MMVQ/MMQ using q8-specific vec-dot helpers; there is no source basis for charging less than the stored quantized bytes.
Sparse MoE dispatch overhead now scales from measured fixed submission overhead when measured GPU data is available, capped by the existing conservative fallback prior. The llama.cpp MoE graph contains real router/top-k, MUL_MAT_ID, weighting, and aggregation work, but CUDA can also fuse eligible expert paths and use optimized quantized kernels, so one backend-independent hand-written MoE dispatch constant was too conservative.

Focused corpus: crates/model-fit/validation/q8-moe-models.txt, covering Qwen2.5-Coder 7B Q4/Q5/Q6/Q8 slope, OLMoE active-expert MoE, and a Q8 reranker control.

machine	backend	scenario	samples	median abs error	notable result
Mac Studio M1 Ultra	Metal	steady_decode	6	8.8%	Qwen2.5-Coder Q8 moved to 1.06 observed/fit; OLMoE 0.88 and noisy; bge reranker remains slower-than-fit.
white.local RTX 5080	CUDA	steady_decode	6	8.3%	Qwen2.5-Coder Q8 improved from 0.60 to 0.86 observed/fit; OLMoE improved from 1.97 to 0.90 observed/fit.
Mac Studio M1 Ultra	Metal	prefill	5	12.2%	Q4/Q5/Q6 and OLMoE are close; Q8 prefill remains noisy/slower.
white.local RTX 5080	CUDA	prefill	5	223.6%	CUDA prefill remains the largest residual miss and is reported rather than hidden.
Mac Studio M1 Ultra	Metal	kv_warm_reuse	6	14.3%	Q8 matches warm reuse; Q4/Q5 and bge remain slower/noisy.
white.local RTX 5080	CUDA	kv_warm_reuse	6	17.5%	Qwen2.5-Coder and bge KV reuse remain slower than steady-decode fit.

MoE Prefill Probe and First-Token Breakdown

This update adds a CUDA prefill_moe_matmul_tflops_fp16 hardware fact from a
strided-batched FP16 cuBLAS GEMM shaped like active-expert MoE prefill. The
field is carried through mesh-llm gpus benchmark, cached system GPU facts,
model-fit HardwareProfile, validator JSON, and the fit-input contract.

The scorer does not treat that raw MoE GEMM probe as a free speedup. It is
used as an upper bound because llama.cpp MoE prefill has router/top-k,
expert-id mapping, GGML_OP_MUL_MAT_ID, weighting, and aggregation work around
the expert GEMMs. When both the measured MoE roofline and the older MoE-aware
fallback exist, the scorer uses the lower throughput. That is the conservative
source-grounded rule, and the unit test locks it in.

The validator now also serializes first-token components:

predicted prefill ms
predicted decode ms
predicted overhead ms
observed tokenize ms
observed prefill ms
observed decode ms
observed unattributed ms

Latest focused white.local CUDA rerun after the MoE probe and first-token split:

scenario	samples	median abs error	notable result
steady_decode	6	8.0%	Qwen2.5-Coder Q4/Q5/Q6/Q8 and OLMoE remain close enough for the focused set; bge reranker remains a control miss/runtime-error in some scenarios.
prefill	5	5.4%	Qwen2.5-Coder Q4/Q5/Q8 are in band; Q6 is 14% slower-than-fit; OLMoE is noisy at 0.85 observed/fit, with the MoE probe bounded rather than over-trusted.
first_token	5	16.9%	Tokenize is about 7 ms and unattributed request time is near zero; the miss is mostly first decode after prefill.
kv_warm_reuse	6	17.1%	Warm reuse remains slower than steady-decode fit for several samples.

Representative first-token component rows from that run:

model	predicted prefill ms	predicted decode ms	observed tokenize ms	observed prefill ms	observed decode ms	observed unattributed ms
Qwen2.5-Coder Q4_K_M	404.4	5.4	6.9	406.3	54.3	1.0
Qwen2.5-Coder Q5_K_M	404.4	6.3	6.9	416.8	55.7	0.8
Qwen2.5-Coder Q6_K	404.4	7.4	6.9	468.8	59.4	0.5
Qwen2.5-Coder Q8_0	402.4	7.9	7.7	372.6	58.5	0.7
OLMoE Q4_K_M	239.5	1.4	7.9	276.9	82.8	0.0

The next honest target is therefore a first-decode-after-prefill hardware fact
or source-derived model for session/prompt-transition decode, not a wider error
band and not a model-name exception.

Representative focused steady-decode rows:

model	Studio fit/obs	Studio verdict	white fit/obs	white verdict
`unsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q4_K_M`	63.4 / 62.1	match	182.9 / 169.9	match
`unsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q5_K_M`	56.8 / 51.8	match	158.9 / 150.8	match
`unsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q6_K`	49.6 / 57.1	faster-than-fit	134.4 / 131.4	match
`unsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q8_0`	45.5 / 48.2	match	125.6 / 108.2	slower-than-fit
`bartowski/OLMoE-1B-7B-0924-Instruct-GGUF:Q4_K_M`	152.4 / 133.3	inconclusive-noisy	683.9 / 618.7	match
`gpustack/bge-reranker-v2-m3-GGUF:Q8_0`	165.1 / 111.7	slower-than-fit	906.8 / 723.2	slower-than-fit

Validation artifacts from the focused run:

Studio JSON: /tmp/model-fit-validation-studio-q8-moe.json
Studio Markdown: /tmp/model-fit-validation-studio-q8-moe.md
white.local JSON: /tmp/model-fit-validation-white-q8-moe.json
white.local Markdown: /tmp/model-fit-validation-white-q8-moe.md

Dense Prefill Roofline Follow-Up

This update splits prefill from decode without adding backend-specific rules. mesh-llm gpus benchmark now reports an optional generic prefill_matmul_tflops_fp16 hardware fact. model-fit consumes that field for dense transformer prefill roofline estimates, using GGUF matmul FLOPs, prompt tokens, llama.cpp ubatch shape, active weight bytes, measured memory bandwidth, and measured graph overhead.

Sparse MoE intentionally does not use the dense matmul probe. llama.cpp routes MoE through expert selection and GGML_OP_MUL_MAT_ID plus id mapping, weighting, and aggregation, so dense GEMM throughput is not the right hardware fact for active-expert prefill. MoE stays on the existing fallback until there is a measured MoE-shaped hardware probe.

Focused white.local CUDA rerun after the corrected wiring:

scenario	samples	median abs error	notable result
steady_decode	6	7.9%	Decode stayed in the target band; Q8 and bge reranker remain visible slower-than-fit residuals.
prefill	5	5.7%	Qwen2.5-Coder Q4/Q5/Q8 matched; Q6 is 14% slower-than-fit; OLMoE uses the MoE fallback and is noisy/slower-than-fit.
first_token	5	17.0%	Dense first-token improved but still includes request/setup/tokenize/decode overhead not isolated by the prefill throughput check.
kv_warm_reuse	6	17.0%	KV reuse remains slower-than-fit across this focused CUDA set.

Focused Studio Metal rerun with the dense prefill split but without a Metal matmul probe yet:

scenario	samples	median abs error	notable result
steady_decode	6	9.0%	Decode stayed in band overall; Q6/Q8 were faster-than-fit in this run.
prefill	5	3.7%	Qwen2.5-Coder Q5/Q8 and OLMoE matched; Q4 was 15% faster-than-fit and Q6/Q8 carried noisy samples.
first_token	5	9.5%	First-token reached the target band on the focused Metal set.
kv_warm_reuse	6	19.2%	KV reuse remains the main Metal residual in this set.

Validation artifacts:

Studio JSON: /tmp/model-fit-validation-studio-prefill-roofline.json
Studio Markdown: /tmp/model-fit-validation-studio-prefill-roofline.md
white.local JSON: /tmp/model-fit-validation-white-prefill-matmul4.json
white.local Markdown: /tmp/model-fit-validation-white-prefill-matmul4.md

Cross-Backend Prefill Probe Coverage

This update keeps the estimator backend-neutral while broadening the hardware facts that can feed it:

execution path	implementation	validation status
CUDA	dense FP16 GEMM and MoE-shaped strided-batched FP16 GEMM through cuBLAS	Re-ran on white.local RTX 5080; JSON reports `prefill_matmul_tflops_fp16: 112.31` and `prefill_moe_matmul_tflops_fp16: 84.44`.
Metal	dense FP16 GEMM and MoE-shaped expert GEMMs through Metal Performance Shaders	Re-ran on Mac Studio M1 Ultra; JSON reports `prefill_matmul_tflops_fp16: 11.8` and `prefill_moe_matmul_tflops_fp16: 9.02`.
ROCm/HIP	dense FP16 GEMM and MoE-shaped strided-batched FP16 GEMM through hipBLAS	Implemented, but not runtime-validated in this session because no ROCm runner was available.
CPU	`CpuProfile` can carry the same optional facts and the scorer consumes them	Intentionally not auto-filled; many CPUs do not have a comparable FP16 matmul path, so model-fit leaves these facts absent unless a CPU benchmark supplies them.

The scoring rule is unchanged: model-fit consumes optional measured facts by name and does not branch on backend names. Backend-specific code exists only in benchmark implementations because each native runtime has a different API for measuring the same semantic operation.

Additional validation for this update:

mesh-llm gpus benchmark --json on Mac Studio M1 Ultra with Metal emitted both dense and MoE prefill facts.
mesh-llm gpus benchmark --json on white.local with CUDA emitted both dense and MoE prefill facts.
cargo check -p mesh-llm-gpu-bench --features cuda on white.local passed.
cargo test -p model-fit --lib passed.
cargo clippy -p mesh-llm --all-targets -- -D warnings passed.
cargo fmt --all -- --check and git diff --check passed.

Latest Post-Prefill Validation

This update adds a measured post_prefill_decode_overhead_ms fact to mesh-llm gpus benchmark for CUDA, Metal, and ROCm/HIP. The fit estimator uses it only as a lower-bound first-token transition cost: it measures issuing decode-shaped work after prefill-shaped matmul work without a GGUF model loaded. It does not absorb tokenizer, HTTP, llama.cpp graph/session, sampling, or observed model benchmark residuals.

Current broader validation after this probe:

machine	backend	scenario	samples	median abs error	note
Mac Studio M1 Ultra	Metal	steady_decode	16	18.0%	Several small/medium dense and reranker cases remain slower-than-fit.
white.local RTX 5080	CUDA	steady_decode	16	7.7%	Most generation models match; SmolLM2 Q4 is faster-than-fit, Phi/Qwen2.5 Q8/bge remain slower.
Mac Studio M1 Ultra	Metal	prefill	15	12.8%	Improved, still outside the 10% target.
white.local RTX 5080	CUDA	prefill	15	29.1%	CUDA prefill still needs source-grounded shape work.
Mac Studio M1 Ultra	Metal	first_token	15	25.8%	Post-prefill probe does not explain all first-token residuals.
white.local RTX 5080	CUDA	first_token	15	68.7%	First-token remains the main unsolved scenario.
Mac Studio M1 Ultra	Metal	kv_warm_reuse	16	9.5%	Within target median, with some per-model misses still visible.
white.local RTX 5080	CUDA	kv_warm_reuse	16	13.9%	Close but still outside target median.

Full steady-decode comparison from the broader rerun:

model	Studio pred/actual	Studio ratio	Studio verdict	white pred/actual	white ratio	white verdict
`unsloth/SmolLM2-135M-Instruct-GGUF:Q4_K_M`	163.9 / 170.59	1.04	match	735.3 / 998.92	1.36	faster-than-fit
`unsloth/SmolLM2-135M-Instruct-GGUF:Q8_0`	161.4 / 171.38	1.06	match	921.1 / 990.24	1.08	match
`unsloth/Qwen3-0.6B-GGUF:Q4_K_M`	158.5 / 119.10	0.75	slower-than-fit	630.3 / 634.00	1.01	match
`unsloth/Qwen3-0.6B-GGUF:Q8_0`	146.3 / 113.82	0.78	slower-than-fit	658.2 / 598.71	0.91	match
`bartowski/LGAI-EXAONE_EXAONE-4.0-1.2B-GGUF:Q4_K_M`	131.0 / 114.06	0.87	inconclusive-noisy	494.3 / 536.60	1.09	match
`bartowski/LGAI-EXAONE_EXAONE-4.0-1.2B-GGUF:Q8_0`	112.2 / 112.34	1.00	match	442.1 / 407.45	0.92	match
`unsloth/Llama-3.2-3B-Instruct-GGUF:Q4_K_M`	100.5 / 80.00	0.80	slower-than-fit	316.2 / 301.32	0.95	match
`unsloth/Phi-4-mini-instruct-GGUF:Q4_K_M`	82.7 / 70.52	0.85	slower-than-fit	308.3 / 265.63	0.86	slower-than-fit
`unsloth/Qwen3-4B-GGUF:Q4_K_M`	75.1 / 64.54	0.86	inconclusive-noisy	245.6 / 232.88	0.95	match
`unsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q4_K_M`	65.3 / 58.29	0.89	slower-than-fit	182.2 / 169.88	0.93	match
`unsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q5_K_M`	58.1 / 47.58	0.82	slower-than-fit	158.3 / 150.76	0.95	match
`unsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q6_K`	50.3 / 44.37	0.88	inconclusive-noisy	133.8 / 131.27	0.98	match
`unsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q8_0`	45.9 / 35.36	0.77	slower-than-fit	125.1 / 108.17	0.86	slower-than-fit
`unsloth/Qwen3-8B-GGUF:Q4_K_M`	54.4 / 49.56	0.91	match	159.5 / 151.67	0.95	match
`bartowski/OLMoE-1B-7B-0924-Instruct-GGUF:Q4_K_M`	159.2 / 115.18	0.72	slower-than-fit	681.7 / 618.58	0.91	match
`gpustack/bge-reranker-v2-m3-GGUF:Q8_0`	185.1 / 110.05	0.59	slower-than-fit	905.0 / 719.27	0.79	slower-than-fit

Validation artifacts from the post-prefill broader run:

Studio JSON: /tmp/model-fit-validation-studio-postprefill-broader.json
Studio Markdown: /tmp/model-fit-validation-studio-postprefill-broader.md
white.local JSON: /tmp/model-fit-validation-white-postprefill-broader.json
white.local Markdown: /tmp/model-fit-validation-white-postprefill-broader.md

Residual misses are still reported as misses. This branch has not added backend-specific estimator constants or model-specific exceptions to make the table look better.

Expanded FFN Decode Graph Validation

This update fixes two issues found in the Studio Metal follow-up:

GGUFs tagged as text-classification, reranker, ranking, or ranker now classify as RerankerOrClassifier before dense transformer shape fallback. This keeps reranker GGUFs out of autoregressive decode validation and lets workload scoring choose the right task shape.
Dense decode graph overhead now adds a source-derived expanded-FFN sequential-stage term. The term uses GGUF matmul shape summaries: logical matmul count, layer count, hidden width, and FFN expansion. It is multiplied by the measured decode_fixed_overhead_ms from the hardware profile, so it is not a backend-name branch and it barely moves low-dispatch-latency profiles.

Studio affected rerun:

model	predicted tok/s	actual tok/s	actual/fit	verdict
`unsloth/Qwen3-0.6B-GGUF:Q4_K_M`	117.3	121.96	1.04	match
`unsloth/Qwen3-0.6B-GGUF:Q8_0`	111.1	110.07	0.99	match
`bartowski/LGAI-EXAONE_EXAONE-4.0-1.2B-GGUF:Q4_K_M`	132.6	113.18	0.85	inconclusive-noisy
`bartowski/LGAI-EXAONE_EXAONE-4.0-1.2B-GGUF:Q8_0`	114.9	111.23	0.97	match
`unsloth/Llama-3.2-3B-Instruct-GGUF:Q4_K_M`	90.1	82.81	0.92	match
`unsloth/Phi-4-mini-instruct-GGUF:Q4_K_M`	76.3	73.04	0.96	match
`unsloth/Qwen3-4B-GGUF:Q4_K_M`	64.9	70.82	1.09	match
`unsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q4_K_M`	59.8	59.80	1.00	match
`unsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q5_K_M`	54.1	52.39	0.97	match
`unsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q6_K`	47.8	52.45	1.10	match
`unsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q8_0`	44.2	30.23	0.68	slower-than-fit
`unsloth/Qwen3-8B-GGUF:Q4_K_M`	49.9	50.55	1.01	match
`gpustack/bge-reranker-v2-m3-GGUF:Q8_0`	rejected	skipped	-	skipped

The affected steady-decode median absolute error is 4.0% across accuracy-gated samples, but the report intentionally still fails strict validation because Qwen2.5-Coder 7B Q8_0 was outside tolerance and one sample was noisy. A focused rerun contradicted that specific row: Q8_0 came back faster-than-fit at 1.28 while Q4_K_M came back slower-than-fit at 0.78. I did not add a Q8 correction from contradictory Studio measurements.

First-token remains separate work. The breakdown shows most misses come from long-prompt prefill and first decode after prefill, not steady decode. That should be addressed with a separate source-grounded model or benchmark fact, not by distorting steady-decode fit.

Validation artifacts:

Affected Studio JSON: /tmp/model-fit-validation-studio-expanded-ffn-affected.json
Affected Studio Markdown: /tmp/model-fit-validation-studio-expanded-ffn-affected.md
Q8 rerun JSON: /tmp/model-fit-validation-studio-expanded-ffn-q8-rerun.json

Latest Ubatch Prefill Validation

This update adds prefill_ubatch_matmul_tflops_fp16 to mesh-llm gpus benchmark for CUDA, Metal, and ROCm/HIP implementations and carries it through system GPU facts, validator JSON, and model-fit hardware profiles.

Source rationale: llama.cpp prompt processing is ubatch-shaped (n_ubatch, commonly 512), so a prompt/prefill roofline should prefer a measured ubatch-shaped GEMM over a generic square GEMM when available. The scorer consumes this as a named hardware fact and still does not branch on Metal/CUDA/ROCm names, model names, filenames, catalog reputation, or observed throughput.

Focused Mac Studio M1 Ultra rerun after this change:

model_ref	fit	est tok/s	abi tok/s	est range	steady median	steady/fit	steady	first-token	kv-reuse
`unsloth/Qwen3-0.6B-GGUF:Q4_K_M`	FitsLocal	104.5	142.4	67.9-141.0	123.89	1.19	faster-than-fit	faster-than-fit	inconclusive-noisy
`unsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q4_K_M`	FitsLocal	53.8	60.9	40.4-67.3	58.46	1.09	match	match	inconclusive-noisy
`unsloth/Qwen3-8B-GGUF:Q4_K_M`	FitsLocal	44.9	59.7	33.7-56.2	57.56	1.28	faster-than-fit	match	match

First-token breakdown from the same run:

model	predicted total ms	observed total ms	observed/fit	predicted prefill ms	observed prefill ms	predicted decode ms	observed decode ms	verdict
Qwen3 0.6B Q4	717.3	601.3	0.84	707.5	551.8	9.5	37.2	faster-than-fit
Qwen2.5-Coder 7B Q4	4217.3	4534.2	1.08	4198.6	4342.0	18.5	182.9	match
Qwen3 8B Q4	4531.1	4820.0	1.06	4508.6	4603.0	22.2	198.5	match

Rendered first-token checker result for this focused subset:

metric	value
first-token median absolute error	7.5%
first-token rows in-band	2 / 3
remaining miss	Qwen3 0.6B Q4 was faster than fit by 16.2%

Important caveat: the local benchmark for this subset reported high benchmark noise (noise_pct: 26.31). The subset is useful evidence that ubatch-shaped prefill is the right measurement shape, but it is not broad proof that first-token is solved. The first decode after prefill remains visibly under-modeled: total first-token can match because prefill dominates, while the decode component is still much larger than predicted.

Validation commands added for this update:

LLAMA_STAGE_BUILD_DIR=$PWD/.deps/llama-build/build-stage-abi-metal cargo check -p mesh-llm-gpu-bench
LLAMA_STAGE_BUILD_DIR=$PWD/.deps/llama-build/build-stage-abi-metal cargo check -p mesh-llm
LLAMA_STAGE_BUILD_DIR=$PWD/.deps/llama-build/build-stage-abi-metal cargo test -p model-fit --lib
LLAMA_STAGE_BUILD_DIR=$PWD/.deps/llama-build/build-stage-abi-metal cargo build -p model-fit --release --bins
LLAMA_STAGE_BUILD_DIR=$PWD/.deps/llama-build/build-stage-abi-metal cargo run -p mesh-llm -- gpus benchmark --json
target/release/model-fit-validate --no-progress --output-json /tmp/model-fit-validation-studio-ubatch-firsttoken-subset.json \
  unsloth/Qwen3-0.6B-GGUF:Q4_K_M \
  unsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q4_K_M \
  unsloth/Qwen3-8B-GGUF:Q4_K_M
target/release/model-fit-check-validation --scenario first_token --markdown-out /tmp/model-fit-validation-studio-ubatch-firsttoken-subset-first-token.md \
  /tmp/model-fit-validation-studio-ubatch-firsttoken-subset.json
LLAMA_STAGE_BUILD_DIR=$PWD/.deps/llama-build/build-stage-abi-metal cargo clippy -p mesh-llm --all-targets -- -D warnings
cargo fmt --all -- --check
git diff --check

model-fit-check-validation --scenario first_token intentionally exited non-zero for the focused subset because Qwen3 0.6B Q4 remained outside the strict individual tolerance. The report was still rendered and the miss is documented above.

Next validation target: rerun the broader smoke/deep validation on Studio and white.local using the new ubatch-shaped hardware fact under lower-noise benchmark conditions, then add a dedicated source-grounded first-decode-after-prefill model or probe instead of bending steady-decode fit.

Broader Ubatch Rerun

After adding the ubatch-shaped prefill hardware fact, I reran the broader validation corpus on both Mac Studio M1 Ultra/Metal and white.local RTX 5080/CUDA.

Hardware facts from this run:

machine	backend	decode GB/s	prefill square TFLOP/s	prefill ubatch TFLOP/s	MoE prefill TFLOP/s	benchmark noise
Mac Studio M1 Ultra	Metal	411.16	9.76	10.82	8.99	24.39%
white.local RTX 5080	CUDA	903.31	112.23	105.81	84.24	0.35%

Scenario summary:

machine	backend	scenario	samples	accuracy-gated	noisy	median abs error	headline
Mac Studio M1 Ultra	Metal	steady_decode	15	14	1	5.8%	Median is inside target, but Qwen3 0.6B Q4/Q8 faster-than-fit, Qwen2.5-Coder Q5/Q8 slower-than-fit, and OLMoE faster-than-fit remain visible.
white.local RTX 5080	CUDA	steady_decode	15	15	0	6.6%	Median is inside target; misses are SmolLM2 Q4 faster-than-fit plus Phi-4-mini and Qwen2.5-Coder Q8 slower-than-fit.
Mac Studio M1 Ultra	Metal	prefill	15	14	1	22.3%	Still outside target. Ubatch shape is the right fact, but Metal prefill remains uneven across tiny, small, coder quant, and MoE shapes.
white.local RTX 5080	CUDA	prefill	15	15	0	14.1%	Better than earlier CUDA prefill, but still outside target. Small dense is mostly slower-than-fit; coder Q4/Q5/Q8 faster-than-fit.
Mac Studio M1 Ultra	Metal	first_token	15	13	2	30.4%	Still not solved. Larger long-prompt cases are mostly slower-than-fit; Qwen2.5-Coder Q4 is the main good match.
white.local RTX 5080	CUDA	first_token	15	15	0	26.0%	Better than prior CUDA first-token, but still outside target. Small dense first-token is often slower-than-fit; coder Q4/Q5/Q6 and Qwen3-8B match.
Mac Studio M1 Ultra	Metal	kv_warm_reuse	15	12	3	11.1%	Close but still outside strict target, with tiny-model faster-than-fit and coder Q4/Q5 slower-than-fit.
white.local RTX 5080	CUDA	kv_warm_reuse	15	15	0	12.7%	Close but still outside strict target; most misses are slower-than-fit on Q8/small/medium and MoE.

Representative steady-decode rows:

model	Studio pred/actual	Studio ratio	Studio verdict	white pred/actual	white ratio	white verdict
`unsloth/SmolLM2-135M-Instruct-GGUF:Q4_K_M`	153.5 / 155.7	1.01	match	735.2 / 992.3	1.35	faster-than-fit
`unsloth/Qwen3-0.6B-GGUF:Q4_K_M`	110.8 / 125.7	1.13	faster-than-fit	630.1 / 628.0	1.00	match
`bartowski/LGAI-EXAONE_EXAONE-4.0-1.2B-GGUF:Q4_K_M`	125.4 / 118.4	0.94	match	494.1 / 538.6	1.09	match
`unsloth/Phi-4-mini-instruct-GGUF:Q4_K_M`	72.3 / 76.7	1.06	match	308.0 / 268.4	0.87	slower-than-fit
`unsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q4_K_M`	56.8 / 60.0	1.06	match	182.1 / 171.0	0.94	match
`unsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q5_K_M`	51.5 / 41.7	0.81	slower-than-fit	158.1 / 151.6	0.96	match
`unsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q6_K`	45.5 / 47.5	1.04	match	133.7 / 132.0	0.99	match
`unsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q8_0`	42.1 / 26.9	0.64	slower-than-fit	125.0 / 108.4	0.87	slower-than-fit
`unsloth/Qwen3-8B-GGUF:Q4_K_M`	47.4 / 48.9	1.03	match	159.3 / 153.0	0.96	match
`bartowski/OLMoE-1B-7B-0924-Instruct-GGUF:Q4_K_M`	129.8 / 152.9	1.18	faster-than-fit	681.2 / 623.4	0.92	match

Artifacts:

Studio JSON: /tmp/model-fit-validation-studio-ubatch-broader.json
Studio Markdown: /tmp/model-fit-validation-studio-ubatch-broader-{steady_decode,prefill,first_token,kv_warm_reuse,all}.md
white.local JSON: /tmp/model-fit-validation-white-ubatch-broader.json
white.local Markdown: /tmp/model-fit-validation-white-ubatch-broader-{steady_decode,prefill,first_token,kv_warm_reuse,all}.md

Interpretation: steady decode is now within the 10% median target on both machines, but the residual per-model misses are real and should stay visible. Prefill improved compared with earlier CUDA results but is not in band broadly. First-token remains a separate problem, and KV warm reuse is close but still outside strict median target. No thresholds were widened and no model/backend-specific estimator constants were added for this rerun.

First-Token Residual Validation

This update adds explicit first-token residual diagnostics instead of hiding the miss inside a tuned estimator constant. The validator now records prompt token count, tokenizer vocab size, predicted prefill/decode/transition pieces, observed tokenize/prefill/decode timing, and sampled-decode residual per prompt token. model-artifact also derives vocab_size from tokenizer.ggml.tokens when GGUFs omit an explicit *.vocab_size field.

Source rationale: Skippy's sampled first decode calls into llama.cpp sampling after prefill, and the chat sampling path syncs sampler history across the prompt before sampling the first generated token. That is a real source-shaped O(prompt tokens) term, separate from steady decode and separate from the lower-bound post_prefill_decode_overhead_ms hardware probe. The current change reports that residual; it does not yet add a guessed sampler-throughput constant to model-fit.

Latest corrected broader rerun:

machine	backend	scenario	samples	noisy	median abs error	note
Mac Studio M1 Ultra	Metal	steady_decode	15	3	26.7%	This Studio run was much noisier/slower than the immediately preceding Studio pass, despite low hardware benchmark noise; treat it as environment evidence, not a fit-tuning target.
Mac Studio M1 Ultra	Metal	prefill	15	3	22.6%	Still outside target; Metal prefill remains shape/noise sensitive.
Mac Studio M1 Ultra	Metal	first_token	15	2	38.7%	First-token residual is visible and large on long-prompt 4B/7B/8B cases.
Mac Studio M1 Ultra	Metal	kv_warm_reuse	15	2	9.6%	Median is inside target, with residual per-model misses still visible.
white.local RTX 5080	CUDA	steady_decode	15	0	6.7%	Decode remains inside target; residual misses are SmolLM2 Q4 faster-than-fit plus Phi and Qwen2.5-Coder Q8 slower-than-fit.
white.local RTX 5080	CUDA	prefill	15	1	14.6%	Better than early CUDA prefill work but still outside target.
white.local RTX 5080	CUDA	first_token	15	0	26.9%	Coder 7B and Qwen3 8B match, but small/medium non-coder first-token remains slower-than-fit.
white.local RTX 5080	CUDA	kv_warm_reuse	15	0	12.8%	Stable but still outside strict target median.

First-token residual examples from the CUDA rerun:

model	prompt toks	vocab	predicted ms	observed ms	sampled residual	residual us/tok	verdict
`unsloth/SmolLM2-135M-Instruct-GGUF:Q4_K_M`	3455	49152	67.8	99.7	16.3	4.7	slower-than-fit
`unsloth/Qwen3-0.6B-GGUF:Q4_K_M`	3168	151936	117.5	115.8	15.3	4.8	match
`bartowski/LGAI-EXAONE_EXAONE-4.0-1.2B-GGUF:Q4_K_M`	3455	102400	79.8	158.8	27.7	8.0	slower-than-fit
`unsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q4_K_M`	3168	152064	432.4	403.6	23.8	7.5	match
`unsloth/Qwen3-8B-GGUF:Q4_K_M`	3168	151936	465.1	447.8	26.3	8.3	match
`bartowski/OLMoE-1B-7B-0924-Instruct-GGUF:Q4_K_M`	3455	50304	241.2	282.4	40.4	11.7	slower-than-fit

Validation artifacts:

Studio JSON: /tmp/model-fit-validation-studio-first-token-residual-vocab.json
Studio Markdown: /tmp/model-fit-validation-studio-first-token-residual-vocab-{first_token,all}.md
white.local JSON: /tmp/model-fit-validation-white-first-token-residual-vocab.json
white.local Markdown: /tmp/model-fit-validation-white-first-token-residual-vocab-{first_token,all}.md

Interpretation: steady decode remains good on the clean CUDA host, but the latest Studio pass shows enough run-to-run instability that we should improve smoke-test denoising before using Studio as a strict gate. First-token now has a source-shaped residual column. The next honest estimator step is to add a model-independent sampler-history hardware fact or otherwise measure that source-shaped work without using per-model observed throughput.

i386 changed the title ~~[codex] Improve metadata-based model fit validation~~ Metadata-based model fit validation May 31, 2026

i386 changed the title ~~Metadata-based model fit validation~~ Metadata-based model fit May 31, 2026

Add metadata-driven model fit validation

8a9d5a2

i386 force-pushed the codex/model-fit-metadata-validation branch from 45e8ef5 to 8a9d5a2 Compare May 31, 2026 06:39

i386 added 5 commits May 31, 2026 16:49

Tighten model-fit validation accuracy gates

70d354f

Document empirical heuristic rule

1a84a17

Remove unsupported Q8 fit scaling

37dc4de

Add GGUF matmul profile for model fit

1185309

Generalize GGUF matmul fit profile

6d05f67

i386 added the Do not merge label May 31, 2026

i386 added 20 commits June 1, 2026 09:52

Validate model fit against Skippy decode ABI

41f32b3

Refine model-fit validation scenarios

0bee4fa

Ground Q8 and MoE fit estimates in source behavior

1361b7d

Add measured prefill matmul fit input

4ef322c

Add MoE prefill probe and first-token breakdown

f311861

Add cross-backend prefill matmul facts

b7b8fdb

Document ROCm probe verification TODO

70154e7

Add post-prefill decode overhead probe

d6d833d

Account for expanded FFN decode graph stages

f86c290

Add ubatch-shaped prefill benchmark fact

6eeb591

Expose first-token validation residuals

1e23a01

Add sampler facts to model-fit validation

27283d8

Add model-fit validation heartbeats

e4912c2

Abort model-fit validation on runtime startup failure

b730cc9

Clarify model-fit validation backend handling

2a780d3

Avoid CPU fallback for accelerated generation fits

89382eb

Skip ABI probe for non-local model-fit results

5aab732

Document split candidate validation semantics

96b09a7

Make model-fit local-only and add decode probe schema

9b09a1b

Suppress throughput estimates for rejected local fits

7a0d12d

i386 added 3 commits June 2, 2026 13:07

Prefer fit status in validation skip reasons

d4c090e

Add decode kernel probe diagnostics

0206f25

Add GGML decode kernel probes

2b15111

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metadata-based model fit#768

Metadata-based model fit#768
i386 wants to merge 29 commits into
mainfrom
codex/model-fit-metadata-validation

i386 commented May 31, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

i386 commented May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Branch Scope

Validation Corpora

Validation Data

How The Fit Algorithm Works

Validator Shape

Workflow Shape

Protocol / Compatibility

Implementation Notes

Validation

Latest ABI Decode Validation

Broader Validation Rerun

Focused Q8/MoE Follow-Up

MoE Prefill Probe and First-Token Breakdown

Dense Prefill Roofline Follow-Up

Cross-Backend Prefill Probe Coverage

Latest Post-Prefill Validation

Expanded FFN Decode Graph Validation

Latest Ubatch Prefill Validation

Broader Ubatch Rerun

First-Token Residual Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

i386 commented May 31, 2026 •

edited

Loading