Metadata-based model fit#768
Draft
i386 wants to merge 29 commits into
Draft
Conversation
45e8ef5 to
8a9d5a2
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR turns
model-fitinto a metadata-first fit estimator with a repeatable validation loop against real Skippy single-stage benchmark runs. The branch is broader than the scorer alone: it adds GGUF tensor-group profiling, measured-hardware diagnostics, model download progress, benchmark timing instrumentation, validation binaries, checked-in validation corpora, documentation, and a GitHub workflow that can gate estimator changes on self-hosted GPU runners.Users and agents can now point
model-fit-validateat Hugging Face model refs, local GGUFs, or checked-in manifest files and get a JSON report containing hardware facts, measured GPU bandwidth, GGUF-derived model profiles, workload recommendations, benchmark observations, and per-scenario estimator agreement.model-fit-check-validationcan turn that JSON into a Markdown report and fail CI when accuracy, noise, or minimum coverage thresholds are missed.Branch Scope
model-artifactsomodel-fitcan distinguish attention, dense FFN, expert FFN, embeddings, output, normalization, and other tensors.mesh-llm gpus benchmarkintomodel-fithardware profiles.model-fit-validatefor model refs, local--model ref=...,path=...inputs,--models-filemanifests, automatic or supplied GPU benchmark JSON, progress reporting, repeated Skippy runs, steady decode, first-token, and warm KV reuse scenarios.model-fit-check-validationto render Markdown summaries and enforce thresholds for median error, individual error, noisy samples, and minimum model count.skippy-bench local-singlewith request counts, session reuse, per-request observations, and decode-only throughput./v1/textresponse so validation can compare steady decode without tokenization/prefill noise.model-hfso validation can show file download and ready-state progress while preparing GGUFs..github/workflows/model-fit-validation.ymlwith normal metadata checks plus self-hosted GPU smoke validation and manual smoke/deep dispatch.crates/model-fit/README.mdandcrates/model-fit/validation/README.mdcovering the estimator inputs, report shape, and validation corpora.Validation Corpora
The model set is stratified by estimator behavior rather than just popularity.
Smoke corpus:
crates/model-fit/validation/smoke-models.txtDeep corpus:
crates/model-fit/validation/deep-models.txtThe deep set expands coverage to Q4/Q8 pairs, 12B/14B/32B dense models, more coder models, Qwen3 30B-A3B MoE variants, embedding GGUFs, and reranker GGUFs. It is intended for manual or nightly high-memory self-hosted runners rather than every PR.
Validation Data
Latest local steady-decode validation run:
Summary from the JSON report:
The small-model result is the important one. SmolLM2 initially looked noisy when the validator sampled only the final request of each repeat. Aggregating generated tokens over aggregate decode time per repeat changed the steady-decode observation to
1.02observed/fit with3.9%spread, which better reflects sustained decode and avoids tuning the estimator against request jitter.How The Fit Algorithm Works
model-fitremains metadata-first and deterministic. It does not use filenames, catalog reputation, or model-specific boosts to predict throughput. It consumes aHardwareProfile, a GGUF-derivedModelProfile, and aSelectionConfig.For memory fit, the selector estimates runtime memory as resident weights plus KV cache plus scratch/backend overhead. KV cache is estimated from layer count, target context, KV heads, key/value widths, and configured KV cache type. If GGUF metadata is incomplete, it falls back to hidden size as a conservative KV width. A model is rejected locally when estimated runtime memory exceeds the usable memory budget after the safety margin.
For dense transformer decode, the primary estimate is active bytes per token divided by effective memory bandwidth, plus fixed and shape overheads. Active bytes come from GGUF tensor groups when available, so embeddings and output tensors do not have to be treated the same as per-token transformer work. Measured hardware bandwidth comes from
mesh-llm gpus benchmark, using p90 bandwidth and benchmark noise. Once bandwidth is measured, the decode-efficiency path is backend-neutral so the same measured profile does not get extra Metal/CUDA/ROCm assumptions layered on top.For sparse MoE models, decode uses base/resident bytes plus only the active routed expert share. This avoids treating every expert as active for every token while still including resident attention, shared FFN, normalization, output, and other tensor groups when available. MoE also carries a per-layer dispatch overhead so routing work is visible in the latency estimate.
For tiny and narrow models, a pure bytes/bandwidth estimate overpredicts because fixed runtime, scheduling, and kernel overhead become a large share of token latency. The estimator applies generic low-active-byte and small-width overheads based on active bytes and hidden width. This is not model-name calibration; it applies to future GGUFs with similar geometry.
For prefill and first-token latency, the selector derives prefill throughput from decode throughput times a shape-based parallelism factor. Prefill exposes much more parallelism than one-token decode, especially for smaller models, but arbitrary GGUF metadata is not rich enough for a stable full prefill simulator. First-token latency combines prompt_tokens / estimated_prefill_tps with one decode step. During validation, first-token scenarios rescore with the observed prompt token count so the comparison uses the prompt that was actually benchmarked.
For non-chat workloads, workload suitability is separate from throughput fit. Embedding, classifier/reranker, chat, tool-use, coding, long-context, summarization, and related workload profiles can weight memory, context, decode, prefill, and capability evidence differently. Embedding-like models are not rejected globally; they are accepted or penalized according to the requested workload.
For oversized models, the selector can return split-candidate recommendations instead of pretending local fit failed completely. The split estimate is memory-budget based and warns that inter-stage activation transfer depends on hidden width, layer count, and network bandwidth.
Validator Shape
model-fit-validateprepares a self-contained report by:model-fitThe steady-decode scenario prefers decode-only timings from
/v1/text, excluding tokenization and prefill. It also increases the generated-token window for tiny active-byte models and aggregates generated tokens over aggregate decode time per repeat to reduce jitter.Workflow Shape
The new workflow is
.github/workflows/model-fit-validation.yml.For pull requests touching model-fit or the benchmark/server pieces it runs metadata checks and schedules a self-hosted GPU smoke job. The GPU job:
target/release/mesh-llm gpus benchmark --jsonmodel-fit-validate --models-file crates/model-fit/validation/smoke-models.txtmodel-fit-check-validation$GITHUB_STEP_SUMMARYManual runs can choose
smokeordeepviaworkflow_dispatch, and the runner labels can be supplied explicitly or throughMODEL_FIT_GPU_RUNS_ON.Protocol / Compatibility
/v1/textJSON response gains additive timing fields:tokenize_elapsed_ms,prefill_elapsed_ms, anddecode_elapsed_ms. Older consumers can ignore them; the new validator and benchmark path use them when present and fall back where possible.model-fitJSON-facing structs gain additive fields for tensor groups, estimate ranges, prefill/first-token estimates, and hardware benchmark diagnostics.Implementation Notes
model-artifactnow scans GGUF tensor names into semantically useful byte groups.model-fitexposesTensorGroupBytes, decode and first-token estimate ranges, measured decode efficiency, measured-GPU overhead, and benchmark diagnostics.model-fit-validatesupports positional model refs,--model-ref, local--model ref=...,path=..., and--models-filewith blank-line and#comment support.model-fit-check-validationreads validator JSON, renders a Markdown table, and enforces thresholds such as median absolute error, individual error, noisy sample count, and minimum model count.model-hfexposes progress events for ensuring, starting, progressing, ready, and complete download states.skippy-bench local-singlecan issue multiple requests, reuse a session, and report per-request decode-only throughput.Validation
Passed:
cargo check -p model-fit --binscargo test -p model-fit --libcargo clippy -p model-fit --all-targets -- -D warningscargo check -p skippy-servercargo clippy -p skippy-server --all-targets -- -D warningscargo test -p skippy-server --libcargo check -p skippy-benchcargo clippy -p skippy-bench --all-targets -- -D warningscargo test -p skippy-benchcargo fmt --all -- --checkcargo run -p model-fit --bin model-fit-check-validation -- --min-models 4 --markdown-out /tmp/model-fit-check.md /tmp/model-fit-validation-aggregate-steady-small.jsoncargo run -p xtask -- repo-consistency ci-crate-listsKnown existing repo-consistency issue surfaced by release checks:
cargo run -p xtask -- repo-consistency release-targetscurrently fails becausemodel-fitis listed as publishable but has unpublished workspace dependencies missing fromscripts/publish-crates.sh(mesh-llm-gpu-bench, thenmesh-llm-system, likely more). I did not expand the publish pipeline in this PR.cargo run -p xtask -- repo-consistency publish-cratesfails for the same existing publish-chain reason.Latest ABI Decode Validation
This update adds the denoised Skippy decode ABI probe and replaces the old low-active/width decode penalties with a source-grounded graph-overhead term. The estimator still does not use model names, filenames, catalog reputation, or observed throughput as inputs. It uses GGUF tensor groups, tensor type mix, logical matmul shape counts, layer count, measured decode-shaped bandwidth, and measured fixed backend submission overhead.
Source anchors checked in the pinned llama.cpp tree:
src/llama-arch.cppGGML_OP_MUL_MAT; MoE expert tensors map toGGML_OP_MUL_MAT_ID.src/llama-graph.cppggml/src/ggml-metal/ggml-metal.metalggml/src/ggml-cudaMUL_MAT_IDpaths, with different single-token and multi-token behavior.ggml/src/ggml-cpu/ggml-cpu.candrepack.cppLatest two-machine five-model rerun:
Validation artifacts from this run:
/tmp/model-fit-validation-studio-graph2.json/tmp/model-fit-validation-studio-graph2.md/tmp/model-fit-validation-white-graph2.json/tmp/model-fit-validation-white-graph2.mdResidual misses are reported as misses. They were not hidden by widening thresholds or adding model-specific exceptions.
Broader Validation Rerun
This update also adds the
prefillvalidation scenario and fixes execution-budget selection so a model that fits both CPU and GPU chooses the measured execution path with the better metadata-only throughput estimate before using memory headroom as a tie-breaker. The bug was general: a high-headroom CPU budget could beat a measured GPU budget for one model, causing the recommendation to describe the wrong execution path. The fix is backend-neutral and evidence-ranked; it does not special-case Metal, CUDA, model names, filenames, or current-run measurements.Broader corpus shape: 18 GGUF refs covering tiny dense models, small dense transition models, a Qwen2.5-Coder 7B Q4/Q5/Q6/Q8 slope, Qwen3 8B, one active-expert MoE, and embedding/reranker metadata cases.
Final broader validation:
prefill_elapsed_ms; still outside the steady-decode target band.Representative steady-decode rows:
unsloth/SmolLM2-135M-Instruct-GGUF:Q8_0unsloth/Qwen3-0.6B-GGUF:Q4_K_Mbartowski/LGAI-EXAONE_EXAONE-4.0-1.2B-GGUF:Q4_K_Munsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q4_K_Munsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q5_K_Munsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q6_Kunsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q8_0unsloth/Qwen3-8B-GGUF:Q4_K_Mbartowski/OLMoE-1B-7B-0924-Instruct-GGUF:Q4_K_Mgpustack/bge-reranker-v2-m3-GGUF:Q8_0Validation artifacts from the broader run:
/tmp/model-fit-validation-studio-broader-final.json/tmp/model-fit-validation-studio-broader-final.md/tmp/model-fit-validation-white-broader-final.json/tmp/model-fit-validation-white-broader-final.mdFocused Q8/MoE Follow-Up
This follow-up addresses the two visible steady-decode misses from the broader run without using model names, filenames, backend-specific constants, or current-run throughput as estimator inputs.
Source-grounded changes:
block_q8_0directly, and CUDA MMVQ/MMQ using q8-specific vec-dot helpers; there is no source basis for charging less than the stored quantized bytes.MUL_MAT_ID, weighting, and aggregation work, but CUDA can also fuse eligible expert paths and use optimized quantized kernels, so one backend-independent hand-written MoE dispatch constant was too conservative.Focused corpus:
crates/model-fit/validation/q8-moe-models.txt, covering Qwen2.5-Coder 7B Q4/Q5/Q6/Q8 slope, OLMoE active-expert MoE, and a Q8 reranker control.MoE Prefill Probe and First-Token Breakdown
This update adds a CUDA
prefill_moe_matmul_tflops_fp16hardware fact from astrided-batched FP16 cuBLAS GEMM shaped like active-expert MoE prefill. The
field is carried through
mesh-llm gpus benchmark, cached system GPU facts,model-fitHardwareProfile, validator JSON, and the fit-input contract.The scorer does not treat that raw MoE GEMM probe as a free speedup. It is
used as an upper bound because llama.cpp MoE prefill has router/top-k,
expert-id mapping,
GGML_OP_MUL_MAT_ID, weighting, and aggregation work aroundthe expert GEMMs. When both the measured MoE roofline and the older MoE-aware
fallback exist, the scorer uses the lower throughput. That is the conservative
source-grounded rule, and the unit test locks it in.
The validator now also serializes first-token components:
Latest focused white.local CUDA rerun after the MoE probe and first-token split:
Representative first-token component rows from that run:
The next honest target is therefore a first-decode-after-prefill hardware fact
or source-derived model for session/prompt-transition decode, not a wider error
band and not a model-name exception.
Representative focused steady-decode rows:
unsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q4_K_Munsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q5_K_Munsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q6_Kunsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q8_0bartowski/OLMoE-1B-7B-0924-Instruct-GGUF:Q4_K_Mgpustack/bge-reranker-v2-m3-GGUF:Q8_0Validation artifacts from the focused run:
/tmp/model-fit-validation-studio-q8-moe.json/tmp/model-fit-validation-studio-q8-moe.md/tmp/model-fit-validation-white-q8-moe.json/tmp/model-fit-validation-white-q8-moe.mdDense Prefill Roofline Follow-Up
This update splits prefill from decode without adding backend-specific rules.
mesh-llm gpus benchmarknow reports an optional genericprefill_matmul_tflops_fp16hardware fact.model-fitconsumes that field for dense transformer prefill roofline estimates, using GGUF matmul FLOPs, prompt tokens, llama.cpp ubatch shape, active weight bytes, measured memory bandwidth, and measured graph overhead.Sparse MoE intentionally does not use the dense matmul probe. llama.cpp routes MoE through expert selection and
GGML_OP_MUL_MAT_IDplus id mapping, weighting, and aggregation, so dense GEMM throughput is not the right hardware fact for active-expert prefill. MoE stays on the existing fallback until there is a measured MoE-shaped hardware probe.Focused white.local CUDA rerun after the corrected wiring:
Focused Studio Metal rerun with the dense prefill split but without a Metal matmul probe yet:
Validation artifacts:
/tmp/model-fit-validation-studio-prefill-roofline.json/tmp/model-fit-validation-studio-prefill-roofline.md/tmp/model-fit-validation-white-prefill-matmul4.json/tmp/model-fit-validation-white-prefill-matmul4.mdCross-Backend Prefill Probe Coverage
This update keeps the estimator backend-neutral while broadening the hardware facts that can feed it:
prefill_matmul_tflops_fp16: 112.31andprefill_moe_matmul_tflops_fp16: 84.44.prefill_matmul_tflops_fp16: 11.8andprefill_moe_matmul_tflops_fp16: 9.02.CpuProfilecan carry the same optional facts and the scorer consumes themThe scoring rule is unchanged: model-fit consumes optional measured facts by name and does not branch on backend names. Backend-specific code exists only in benchmark implementations because each native runtime has a different API for measuring the same semantic operation.
Additional validation for this update:
mesh-llm gpus benchmark --jsonon Mac Studio M1 Ultra with Metal emitted both dense and MoE prefill facts.mesh-llm gpus benchmark --jsonon white.local with CUDA emitted both dense and MoE prefill facts.cargo check -p mesh-llm-gpu-bench --features cudaon white.local passed.cargo test -p model-fit --libpassed.cargo clippy -p mesh-llm --all-targets -- -D warningspassed.cargo fmt --all -- --checkandgit diff --checkpassed.Latest Post-Prefill Validation
This update adds a measured
post_prefill_decode_overhead_msfact tomesh-llm gpus benchmarkfor CUDA, Metal, and ROCm/HIP. The fit estimator uses it only as a lower-bound first-token transition cost: it measures issuing decode-shaped work after prefill-shaped matmul work without a GGUF model loaded. It does not absorb tokenizer, HTTP, llama.cpp graph/session, sampling, or observed model benchmark residuals.Current broader validation after this probe:
Full steady-decode comparison from the broader rerun:
unsloth/SmolLM2-135M-Instruct-GGUF:Q4_K_Munsloth/SmolLM2-135M-Instruct-GGUF:Q8_0unsloth/Qwen3-0.6B-GGUF:Q4_K_Munsloth/Qwen3-0.6B-GGUF:Q8_0bartowski/LGAI-EXAONE_EXAONE-4.0-1.2B-GGUF:Q4_K_Mbartowski/LGAI-EXAONE_EXAONE-4.0-1.2B-GGUF:Q8_0unsloth/Llama-3.2-3B-Instruct-GGUF:Q4_K_Munsloth/Phi-4-mini-instruct-GGUF:Q4_K_Munsloth/Qwen3-4B-GGUF:Q4_K_Munsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q4_K_Munsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q5_K_Munsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q6_Kunsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q8_0unsloth/Qwen3-8B-GGUF:Q4_K_Mbartowski/OLMoE-1B-7B-0924-Instruct-GGUF:Q4_K_Mgpustack/bge-reranker-v2-m3-GGUF:Q8_0Validation artifacts from the post-prefill broader run:
/tmp/model-fit-validation-studio-postprefill-broader.json/tmp/model-fit-validation-studio-postprefill-broader.md/tmp/model-fit-validation-white-postprefill-broader.json/tmp/model-fit-validation-white-postprefill-broader.mdResidual misses are still reported as misses. This branch has not added backend-specific estimator constants or model-specific exceptions to make the table look better.
Expanded FFN Decode Graph Validation
This update fixes two issues found in the Studio Metal follow-up:
text-classification,reranker,ranking, orrankernow classify asRerankerOrClassifierbefore dense transformer shape fallback. This keeps reranker GGUFs out of autoregressive decode validation and lets workload scoring choose the right task shape.decode_fixed_overhead_msfrom the hardware profile, so it is not a backend-name branch and it barely moves low-dispatch-latency profiles.Studio affected rerun:
unsloth/Qwen3-0.6B-GGUF:Q4_K_Munsloth/Qwen3-0.6B-GGUF:Q8_0bartowski/LGAI-EXAONE_EXAONE-4.0-1.2B-GGUF:Q4_K_Mbartowski/LGAI-EXAONE_EXAONE-4.0-1.2B-GGUF:Q8_0unsloth/Llama-3.2-3B-Instruct-GGUF:Q4_K_Munsloth/Phi-4-mini-instruct-GGUF:Q4_K_Munsloth/Qwen3-4B-GGUF:Q4_K_Munsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q4_K_Munsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q5_K_Munsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q6_Kunsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q8_0unsloth/Qwen3-8B-GGUF:Q4_K_Mgpustack/bge-reranker-v2-m3-GGUF:Q8_0The affected steady-decode median absolute error is 4.0% across accuracy-gated samples, but the report intentionally still fails strict validation because Qwen2.5-Coder 7B Q8_0 was outside tolerance and one sample was noisy. A focused rerun contradicted that specific row: Q8_0 came back faster-than-fit at 1.28 while Q4_K_M came back slower-than-fit at 0.78. I did not add a Q8 correction from contradictory Studio measurements.
First-token remains separate work. The breakdown shows most misses come from long-prompt prefill and first decode after prefill, not steady decode. That should be addressed with a separate source-grounded model or benchmark fact, not by distorting steady-decode fit.
Validation artifacts:
/tmp/model-fit-validation-studio-expanded-ffn-affected.json/tmp/model-fit-validation-studio-expanded-ffn-affected.md/tmp/model-fit-validation-studio-expanded-ffn-q8-rerun.jsonLatest Ubatch Prefill Validation
This update adds
prefill_ubatch_matmul_tflops_fp16tomesh-llm gpus benchmarkfor CUDA, Metal, and ROCm/HIP implementations and carries it through system GPU facts, validator JSON, andmodel-fithardware profiles.Source rationale: llama.cpp prompt processing is ubatch-shaped (
n_ubatch, commonly 512), so a prompt/prefill roofline should prefer a measured ubatch-shaped GEMM over a generic square GEMM when available. The scorer consumes this as a named hardware fact and still does not branch on Metal/CUDA/ROCm names, model names, filenames, catalog reputation, or observed throughput.Focused Mac Studio M1 Ultra rerun after this change:
unsloth/Qwen3-0.6B-GGUF:Q4_K_Munsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q4_K_Munsloth/Qwen3-8B-GGUF:Q4_K_MFirst-token breakdown from the same run:
Rendered first-token checker result for this focused subset:
Important caveat: the local benchmark for this subset reported high benchmark noise (
noise_pct: 26.31). The subset is useful evidence that ubatch-shaped prefill is the right measurement shape, but it is not broad proof that first-token is solved. The first decode after prefill remains visibly under-modeled: total first-token can match because prefill dominates, while the decode component is still much larger than predicted.Validation commands added for this update:
model-fit-check-validation --scenario first_tokenintentionally exited non-zero for the focused subset because Qwen3 0.6B Q4 remained outside the strict individual tolerance. The report was still rendered and the miss is documented above.Next validation target: rerun the broader smoke/deep validation on Studio and white.local using the new ubatch-shaped hardware fact under lower-noise benchmark conditions, then add a dedicated source-grounded first-decode-after-prefill model or probe instead of bending steady-decode fit.
Broader Ubatch Rerun
After adding the ubatch-shaped prefill hardware fact, I reran the broader validation corpus on both Mac Studio M1 Ultra/Metal and white.local RTX 5080/CUDA.
Hardware facts from this run:
Scenario summary:
Representative steady-decode rows:
unsloth/SmolLM2-135M-Instruct-GGUF:Q4_K_Munsloth/Qwen3-0.6B-GGUF:Q4_K_Mbartowski/LGAI-EXAONE_EXAONE-4.0-1.2B-GGUF:Q4_K_Munsloth/Phi-4-mini-instruct-GGUF:Q4_K_Munsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q4_K_Munsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q5_K_Munsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q6_Kunsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q8_0unsloth/Qwen3-8B-GGUF:Q4_K_Mbartowski/OLMoE-1B-7B-0924-Instruct-GGUF:Q4_K_MArtifacts:
/tmp/model-fit-validation-studio-ubatch-broader.json/tmp/model-fit-validation-studio-ubatch-broader-{steady_decode,prefill,first_token,kv_warm_reuse,all}.md/tmp/model-fit-validation-white-ubatch-broader.json/tmp/model-fit-validation-white-ubatch-broader-{steady_decode,prefill,first_token,kv_warm_reuse,all}.mdInterpretation: steady decode is now within the 10% median target on both machines, but the residual per-model misses are real and should stay visible. Prefill improved compared with earlier CUDA results but is not in band broadly. First-token remains a separate problem, and KV warm reuse is close but still outside strict median target. No thresholds were widened and no model/backend-specific estimator constants were added for this rerun.
First-Token Residual Validation
This update adds explicit first-token residual diagnostics instead of hiding the miss inside a tuned estimator constant. The validator now records prompt token count, tokenizer vocab size, predicted prefill/decode/transition pieces, observed tokenize/prefill/decode timing, and sampled-decode residual per prompt token.
model-artifactalso derivesvocab_sizefromtokenizer.ggml.tokenswhen GGUFs omit an explicit*.vocab_sizefield.Source rationale: Skippy's sampled first decode calls into llama.cpp sampling after prefill, and the chat sampling path syncs sampler history across the prompt before sampling the first generated token. That is a real source-shaped O(prompt tokens) term, separate from steady decode and separate from the lower-bound
post_prefill_decode_overhead_mshardware probe. The current change reports that residual; it does not yet add a guessed sampler-throughput constant to model-fit.Latest corrected broader rerun:
First-token residual examples from the CUDA rerun:
unsloth/SmolLM2-135M-Instruct-GGUF:Q4_K_Munsloth/Qwen3-0.6B-GGUF:Q4_K_Mbartowski/LGAI-EXAONE_EXAONE-4.0-1.2B-GGUF:Q4_K_Munsloth/Qwen2.5-Coder-7B-Instruct-GGUF:Q4_K_Munsloth/Qwen3-8B-GGUF:Q4_K_Mbartowski/OLMoE-1B-7B-0924-Instruct-GGUF:Q4_K_MValidation artifacts:
/tmp/model-fit-validation-studio-first-token-residual-vocab.json/tmp/model-fit-validation-studio-first-token-residual-vocab-{first_token,all}.md/tmp/model-fit-validation-white-first-token-residual-vocab.json/tmp/model-fit-validation-white-first-token-residual-vocab-{first_token,all}.mdInterpretation: steady decode remains good on the clean CUDA host, but the latest Studio pass shows enough run-to-run instability that we should improve smoke-test denoising before using Studio as a strict gate. First-token now has a source-shaped residual column. The next honest estimator step is to add a model-independent sampler-history hardware fact or otherwise measure that source-shaped work without using per-model observed throughput.