Skip to content

Make the constrained decoder the only llama decode path#532

Merged
FuJacob merged 1 commit into
mainfrom
chore/enable-constrained-decode
Jun 2, 2026
Merged

Make the constrained decoder the only llama decode path#532
FuJacob merged 1 commit into
mainfrom
chore/enable-constrained-decode

Conversation

@FuJacob
Copy link
Copy Markdown
Owner

@FuJacob FuJacob commented Jun 2, 2026

Summary

The constrained decoder is now validated on device, so it becomes the only llama decode path for every user, no flag required. This removes the dogfood gates and the code they guarded.

  • Removed the cotabbyConstrainedDecoderEnabled and cotabbyFillInMiddleEnabled feature flags. The constrained decoder and fill-in-middle are now always on (FIM still gated by its real preconditions: a genuine mid-line caret and a model that ships the FIM markers).
  • Removed the useConstrainedDecoder option and the routing guard; generate() routes straight to the greedy or beam constrained decoder.
  • Removed the now-unreachable runEngineSampledDecode and its extractPiece helper (the old stochastic sampler path). Net change is -92 lines.
  • Kept cotabbyConstrainedBeamWidth as a tuning knob (greedy by default), not a feature gate: it keeps the beam path reachable and is the basis for the batched-beam work, which is what beam-by-default needs to avoid per-branch latency.

Validation

xcodebuild ... test ... -only-testing:CotabbyTests/LlamaSuggestionEngineCancellationTests
  -only-testing:CotabbyTests/ConstrainedBeamSearchTests -only-testing:CotabbyTests/FillInMiddlePolicyTests
# ** TEST SUCCEEDED **   swiftlint --strict exit 0

Dogfooded on device against a local base model (greedy) before flipping.

Linked issues

None. Promotes the constrained decoder from dogfood flag to the default decode path.

Risk / rollout notes

  • Behavior change for all llama users: every local completion now goes through the constrained decoder instead of the stochastic sampler. Intended flip, validated on device first.
  • Greedy (beam width 1) ships; beam-by-default waits on the batched-beam work to avoid per-branch latency.
  • Known follow-up (not here): the decoder is deterministic, so the sampling parameters (temperature, top-k, top-p, min-p, seed) are now vestigial for token selection. They are still referenced (they configure the seed via samplingConfig, and feed SamplingFingerprint), so not dead symbols; removing them cleanly cascades into settings and request layers and is a separate change.

Greptile Summary

This PR promotes the constrained decoder from a dogfood-gated opt-in to the sole llama decode path, removing cotabbyConstrainedDecoderEnabled, cotabbyFillInMiddleEnabled, and the useConstrainedDecoder routing guard. The net result is -92 lines and a cleaner, fully deterministic decode path for all users.

  • Removed runEngineSampledDecode (and its extractPiece helper), the stochastic sampler path — generate() now routes unconditionally to runConstrainedDecode (greedy) or runConstrainedBeamDecode (beam).
  • Fill-in-middle is now always attempted when the caret is genuinely mid-line (not just when a developer flag was set), with the existing model-capability fallback still in place.
  • Sampling parameters (temperature, topP, minP, repetitionPenalty, seed) are now vestigial for token selection but remain in LlamaGenerationOptions and SamplingFingerprint; the PR description explicitly calls out cleanup as a follow-up.

Confidence Score: 4/5

The PR cleanly removes the old stochastic sampler path and its gating flags; all users now always go through the constrained decoder, which the author validated on device before this change.

The core decode routing change is straightforward and the constrained decoder already existed and was tested. FIM is now always attempted for mid-line carets, relying on the existing model-capability fallback — this is a subtle behavior change for users with non-FIM models in mid-line positions, but the fallback path appears sound. The vestigial sampling parameters remaining in SamplingFingerprint cause unnecessary KV cache invalidations when those settings change, but this is explicitly acknowledged as a follow-up. No correctness bugs are introduced.

Cotabby/Services/Runtime/LlamaSuggestionEngine.swift — the FIM always-on behavior for mid-line carets is new for all users and depends on the model-capability fallback working correctly for every deployed model configuration.

Important Files Changed

Filename Overview
Cotabby/Services/Runtime/LlamaRuntimeCore.swift Removed runEngineSampledDecode, extractPiece, and the useConstrainedDecoder guard; generate() now routes directly to runConstrainedDecode or runConstrainedBeamDecode. samplingConfig is still constructed and passed to the engine for prompt decode seeding, so those parameters are not dead in all paths — but token selection is now entirely deterministic.
Cotabby/Services/Runtime/LlamaSuggestionEngine.swift Removed the cotabbyConstrainedDecoderEnabled and cotabbyFillInMiddleEnabled UserDefaults keys and their accessor properties. LlamaGenerationOptions is now constructed without useConstrainedDecoder. FIM is now always attempted for mid-line carets; SamplingFingerprint still includes vestigial sampling params — acknowledged as a follow-up cleanup.
Cotabby/Models/LlamaRuntimeModels.swift Removed useConstrainedDecoder: Bool from LlamaGenerationOptions and updated the beamWidth doc comment. Clean, no logic issues.
Cotabby/Services/Runtime/LlamaRuntimeManager.swift Comment-only update in generate() to drop the reference to sampleNext calls in the cancellation explanation. No logic changes.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[LlamaSuggestionEngine.generateSuggestion] --> B[LlamaRuntimeManager.generate]
    B --> C[LlamaRuntimeCore.generate]
    C --> D{beamWidth > 1?}
    D -- Yes --> E[runConstrainedBeamDecode]
    D -- No --> F[runConstrainedDecode]
    E --> G[ConstrainedSampler + EngineBeamStepper]
    F --> H[ConstrainedSampler.selectToken greedy argmax]
    G --> I[Return highest-scoring beam]
    H --> J[Return greedy completion]

    style E fill:#d4edda,stroke:#28a745
    style F fill:#d4edda,stroke:#28a745

    subgraph Removed
        R1[runEngineSampledDecode]
        R2[engine.sampleNext]
        R3[extractPiece]
        R4[useConstrainedDecoder guard]
    end
Loading

Reviews (1): Last reviewed commit: "Make the constrained decoder the only ll..." | Re-trigger Greptile

Validated on device, so it ships to everyone with no flag. Removes the
cotabbyConstrainedDecoderEnabled and cotabbyFillInMiddleEnabled feature gates
(both now always on), the useConstrainedDecoder option, and the now-unreachable
runEngineSampledDecode plus its extractPiece helper. Generation routes straight to
the greedy or beam constrained decoder.

Beam width stays as a tuning knob (cotabbyConstrainedBeamWidth, greedy by default),
not a feature gate: it keeps the beam path reachable and is the basis for the
batched-beam work, which is what beam-by-default needs to avoid per-branch latency.

Fill-in-middle is now unconditional, still gated by its real preconditions (a
genuine mid-line caret and a model that ships the FIM markers).
@FuJacob FuJacob merged commit 4ee8edd into main Jun 2, 2026
4 checks passed
@FuJacob FuJacob deleted the chore/enable-constrained-decode branch June 2, 2026 04:42
@FuJacob FuJacob mentioned this pull request Jun 2, 2026
Merged
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant