Make the constrained decoder the only llama decode path#532
Merged
Conversation
Validated on device, so it ships to everyone with no flag. Removes the cotabbyConstrainedDecoderEnabled and cotabbyFillInMiddleEnabled feature gates (both now always on), the useConstrainedDecoder option, and the now-unreachable runEngineSampledDecode plus its extractPiece helper. Generation routes straight to the greedy or beam constrained decoder. Beam width stays as a tuning knob (cotabbyConstrainedBeamWidth, greedy by default), not a feature gate: it keeps the beam path reachable and is the basis for the batched-beam work, which is what beam-by-default needs to avoid per-branch latency. Fill-in-middle is now unconditional, still gated by its real preconditions (a genuine mid-line caret and a model that ships the FIM markers).
This was referenced Jun 2, 2026
FuJacob
added a commit
that referenced
this pull request
Jun 2, 2026
Merged
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The constrained decoder is now validated on device, so it becomes the only llama decode path for every user, no flag required. This removes the dogfood gates and the code they guarded.
cotabbyConstrainedDecoderEnabledandcotabbyFillInMiddleEnabledfeature flags. The constrained decoder and fill-in-middle are now always on (FIM still gated by its real preconditions: a genuine mid-line caret and a model that ships the FIM markers).useConstrainedDecoderoption and the routing guard;generate()routes straight to the greedy or beam constrained decoder.runEngineSampledDecodeand itsextractPiecehelper (the old stochastic sampler path). Net change is -92 lines.cotabbyConstrainedBeamWidthas a tuning knob (greedy by default), not a feature gate: it keeps the beam path reachable and is the basis for the batched-beam work, which is what beam-by-default needs to avoid per-branch latency.Validation
Dogfooded on device against a local base model (greedy) before flipping.
Linked issues
None. Promotes the constrained decoder from dogfood flag to the default decode path.
Risk / rollout notes
samplingConfig, and feedSamplingFingerprint), so not dead symbols; removing them cleanly cascades into settings and request layers and is a separate change.Greptile Summary
This PR promotes the constrained decoder from a dogfood-gated opt-in to the sole llama decode path, removing
cotabbyConstrainedDecoderEnabled,cotabbyFillInMiddleEnabled, and theuseConstrainedDecoderrouting guard. The net result is -92 lines and a cleaner, fully deterministic decode path for all users.runEngineSampledDecode(and itsextractPiecehelper), the stochastic sampler path —generate()now routes unconditionally torunConstrainedDecode(greedy) orrunConstrainedBeamDecode(beam).temperature,topP,minP,repetitionPenalty,seed) are now vestigial for token selection but remain inLlamaGenerationOptionsandSamplingFingerprint; the PR description explicitly calls out cleanup as a follow-up.Confidence Score: 4/5
The PR cleanly removes the old stochastic sampler path and its gating flags; all users now always go through the constrained decoder, which the author validated on device before this change.
The core decode routing change is straightforward and the constrained decoder already existed and was tested. FIM is now always attempted for mid-line carets, relying on the existing model-capability fallback — this is a subtle behavior change for users with non-FIM models in mid-line positions, but the fallback path appears sound. The vestigial sampling parameters remaining in SamplingFingerprint cause unnecessary KV cache invalidations when those settings change, but this is explicitly acknowledged as a follow-up. No correctness bugs are introduced.
Cotabby/Services/Runtime/LlamaSuggestionEngine.swift — the FIM always-on behavior for mid-line carets is new for all users and depends on the model-capability fallback working correctly for every deployed model configuration.
Important Files Changed
Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[LlamaSuggestionEngine.generateSuggestion] --> B[LlamaRuntimeManager.generate] B --> C[LlamaRuntimeCore.generate] C --> D{beamWidth > 1?} D -- Yes --> E[runConstrainedBeamDecode] D -- No --> F[runConstrainedDecode] E --> G[ConstrainedSampler + EngineBeamStepper] F --> H[ConstrainedSampler.selectToken greedy argmax] G --> I[Return highest-scoring beam] H --> J[Return greedy completion] style E fill:#d4edda,stroke:#28a745 style F fill:#d4edda,stroke:#28a745 subgraph Removed R1[runEngineSampledDecode] R2[engine.sampleNext] R3[extractPiece] R4[useConstrainedDecoder guard] endReviews (1): Last reviewed commit: "Make the constrained decoder the only ll..." | Re-trigger Greptile