Make the constrained decoder the only llama decode path by FuJacob · Pull Request #532 · FuJacob/cotabby

FuJacob · 2026-06-02T04:38:50Z

Summary

The constrained decoder is now validated on device, so it becomes the only llama decode path for every user, no flag required. This removes the dogfood gates and the code they guarded.

Removed the cotabbyConstrainedDecoderEnabled and cotabbyFillInMiddleEnabled feature flags. The constrained decoder and fill-in-middle are now always on (FIM still gated by its real preconditions: a genuine mid-line caret and a model that ships the FIM markers).
Removed the useConstrainedDecoder option and the routing guard; generate() routes straight to the greedy or beam constrained decoder.
Removed the now-unreachable runEngineSampledDecode and its extractPiece helper (the old stochastic sampler path). Net change is -92 lines.
Kept cotabbyConstrainedBeamWidth as a tuning knob (greedy by default), not a feature gate: it keeps the beam path reachable and is the basis for the batched-beam work, which is what beam-by-default needs to avoid per-branch latency.

Validation

xcodebuild ... test ... -only-testing:CotabbyTests/LlamaSuggestionEngineCancellationTests
  -only-testing:CotabbyTests/ConstrainedBeamSearchTests -only-testing:CotabbyTests/FillInMiddlePolicyTests
# ** TEST SUCCEEDED **   swiftlint --strict exit 0

Dogfooded on device against a local base model (greedy) before flipping.

Linked issues

None. Promotes the constrained decoder from dogfood flag to the default decode path.

Risk / rollout notes

Behavior change for all llama users: every local completion now goes through the constrained decoder instead of the stochastic sampler. Intended flip, validated on device first.
Greedy (beam width 1) ships; beam-by-default waits on the batched-beam work to avoid per-branch latency.
Known follow-up (not here): the decoder is deterministic, so the sampling parameters (temperature, top-k, top-p, min-p, seed) are now vestigial for token selection. They are still referenced (they configure the seed via samplingConfig, and feed SamplingFingerprint), so not dead symbols; removing them cleanly cascades into settings and request layers and is a separate change.

Greptile Summary

This PR promotes the constrained decoder from a dogfood-gated opt-in to the sole llama decode path, removing cotabbyConstrainedDecoderEnabled, cotabbyFillInMiddleEnabled, and the useConstrainedDecoder routing guard. The net result is -92 lines and a cleaner, fully deterministic decode path for all users.

Removed runEngineSampledDecode (and its extractPiece helper), the stochastic sampler path — generate() now routes unconditionally to runConstrainedDecode (greedy) or runConstrainedBeamDecode (beam).
Fill-in-middle is now always attempted when the caret is genuinely mid-line (not just when a developer flag was set), with the existing model-capability fallback still in place.
Sampling parameters (temperature, topP, minP, repetitionPenalty, seed) are now vestigial for token selection but remain in LlamaGenerationOptions and SamplingFingerprint; the PR description explicitly calls out cleanup as a follow-up.

Confidence Score: 4/5

The PR cleanly removes the old stochastic sampler path and its gating flags; all users now always go through the constrained decoder, which the author validated on device before this change.

The core decode routing change is straightforward and the constrained decoder already existed and was tested. FIM is now always attempted for mid-line carets, relying on the existing model-capability fallback — this is a subtle behavior change for users with non-FIM models in mid-line positions, but the fallback path appears sound. The vestigial sampling parameters remaining in SamplingFingerprint cause unnecessary KV cache invalidations when those settings change, but this is explicitly acknowledged as a follow-up. No correctness bugs are introduced.

Cotabby/Services/Runtime/LlamaSuggestionEngine.swift — the FIM always-on behavior for mid-line carets is new for all users and depends on the model-capability fallback working correctly for every deployed model configuration.

Important Files Changed

Filename	Overview
Cotabby/Services/Runtime/LlamaRuntimeCore.swift	Removed runEngineSampledDecode, extractPiece, and the useConstrainedDecoder guard; generate() now routes directly to runConstrainedDecode or runConstrainedBeamDecode. samplingConfig is still constructed and passed to the engine for prompt decode seeding, so those parameters are not dead in all paths — but token selection is now entirely deterministic.
Cotabby/Services/Runtime/LlamaSuggestionEngine.swift	Removed the cotabbyConstrainedDecoderEnabled and cotabbyFillInMiddleEnabled UserDefaults keys and their accessor properties. LlamaGenerationOptions is now constructed without useConstrainedDecoder. FIM is now always attempted for mid-line carets; SamplingFingerprint still includes vestigial sampling params — acknowledged as a follow-up cleanup.
Cotabby/Models/LlamaRuntimeModels.swift	Removed useConstrainedDecoder: Bool from LlamaGenerationOptions and updated the beamWidth doc comment. Clean, no logic issues.
Cotabby/Services/Runtime/LlamaRuntimeManager.swift	Comment-only update in generate() to drop the reference to sampleNext calls in the cancellation explanation. No logic changes.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[LlamaSuggestionEngine.generateSuggestion] --> B[LlamaRuntimeManager.generate]
    B --> C[LlamaRuntimeCore.generate]
    C --> D{beamWidth > 1?}
    D -- Yes --> E[runConstrainedBeamDecode]
    D -- No --> F[runConstrainedDecode]
    E --> G[ConstrainedSampler + EngineBeamStepper]
    F --> H[ConstrainedSampler.selectToken greedy argmax]
    G --> I[Return highest-scoring beam]
    H --> J[Return greedy completion]

    style E fill:#d4edda,stroke:#28a745
    style F fill:#d4edda,stroke:#28a745

    subgraph Removed
        R1[runEngineSampledDecode]
        R2[engine.sampleNext]
        R3[extractPiece]
        R4[useConstrainedDecoder guard]
    end

_{Reviews (1): Last reviewed commit: "Make the constrained decoder the only ll..." | Re-trigger Greptile}

Validated on device, so it ships to everyone with no flag. Removes the cotabbyConstrainedDecoderEnabled and cotabbyFillInMiddleEnabled feature gates (both now always on), the useConstrainedDecoder option, and the now-unreachable runEngineSampledDecode plus its extractPiece helper. Generation routes straight to the greedy or beam constrained decoder. Beam width stays as a tuning knob (cotabbyConstrainedBeamWidth, greedy by default), not a feature gate: it keeps the beam path reachable and is the basis for the batched-beam work, which is what beam-by-default needs to avoid per-branch latency. Fill-in-middle is now unconditional, still gated by its real preconditions (a genuine mid-line caret and a model that ships the FIM markers).

…" (#539)

FuJacob merged commit 4ee8edd into main Jun 2, 2026
4 checks passed

FuJacob deleted the chore/enable-constrained-decode branch June 2, 2026 04:42

This was referenced Jun 2, 2026

Fix constrained-decode latency: bounded top-K, skip unused logSumExp #538

Merged

Revert "Make the constrained decoder the only llama decode path (#532)" #539

Merged

FuJacob added a commit that referenced this pull request Jun 2, 2026

Revert "Make the constrained decoder the only llama decode path (#532)…

23ebfbc

…" (#539)

FuJacob mentioned this pull request Jun 2, 2026

gif #540

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Make the constrained decoder the only llama decode path#532

Make the constrained decoder the only llama decode path#532
FuJacob merged 1 commit into
mainfrom
chore/enable-constrained-decode

FuJacob commented Jun 2, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

FuJacob commented Jun 2, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Linked issues

Risk / rollout notes

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

FuJacob commented Jun 2, 2026 •

edited by greptile-apps Bot

Loading