Revert "Make the constrained decoder the only llama decode path (#532)" by FuJacob · Pull Request #539 · FuJacob/cotabby

FuJacob · 2026-06-02T06:28:15Z

Summary

Restore the native sampler (runEngineSampledDecode) as the default llama decode path by reverting #532, and put the constrained decoder back behind its default-off useConstrainedDecoder flag. #538 removed the worst of the constrained-decode latency (the per-token full-vocab sort and an unused per-token logSumExp), but in real use the constrained path is still much slower than the native sampler users had before #532, so we fall back to native now and keep the constrained decoder behind its flag while the remaining slowness is investigated.

Validation

swiftlint lint --quiet <reverted files>
# exit 0

xcodebuild -project Cotabby.xcodeproj -scheme Cotabby -destination 'platform=macOS' \
  build-for-testing -derivedDataPath build/DerivedData CODE_SIGNING_ALLOWED=NO CODE_SIGNING_REQUIRED=NO
# ** TEST BUILD SUCCEEDED **   (restored runEngineSampledDecode compiles against the current engine)

xcodebuild ... test-without-building \
  -only-testing:CotabbyTests/ConstrainedSamplerTests \
  -only-testing:CotabbyTests/ConstrainedBeamSearchTests \
  -only-testing:CotabbyTests/RepetitionGuardTests \
  -only-testing:CotabbyTests/LlamaSuggestionEngineCancellationTests
# Executed 51 tests, with 0 failures

Real-world latency that motivated the revert (from ~/Library/Logs/Cotabby/llm-io.jsonl on the post-#538 build): constrained decode p50 ~311ms, p99 ~2.1s, max ~6.4s on a 262k-vocab model, versus the snappy native path before #532.

Linked issues

Refs #532 (this reverts it). Refs #538 (its ConstrainedSampler / runConstrainedDecode optimizations are intentionally kept).

Risk / rollout notes

Behavior revert to the known-good native sampler. Default generation no longer uses the constrained decoder; it stays available behind the hidden cotabbyConstrainedDecoderEnabled UserDefaults flag (default off), and behind useConstrainedDecoder in LlamaGenerationOptions (default false).
Fix constrained-decode latency: bounded top-K, skip unused logSumExp #538's candidatePool bounded top-K and the logSumExp skip are kept, so the flagged constrained path is fast when enabled.
Deferred for investigation (not blocking this revert): the constrained path appears to over-generate in multi-line fields (it cannot select an EOS token, since EOS is excluded as a control token, so it runs to the token budget instead of stopping early), plus per-token O(vocab) Swift work over a 262k vocab is costly in debug builds. These are why even the optimized constrained path is still slow.
No settings, schema, or pbxproj changes.

Greptile Summary

This PR reverts #532 by restoring the engine's native stochastic sampler (runEngineSampledDecode / sampleNext) as the default llama decode path, and puts the constrained decoder back behind the useConstrainedDecoder flag (default off) and its UserDefaults key. The #538 performance work on ConstrainedSampler is intentionally kept.

LlamaRuntimeCore.swift: Adds a guard options.useConstrainedDecoder else { return runEngineSampledDecode(...) } guard so the fast native path runs by default, and introduces a new extractPiece helper that reads token bytes from the engine's SampleResult into a String.
LlamaSuggestionEngine.swift: Restores constrainedDecoderDefaultsKey / isFillInMiddleEnabled UserDefaults flags (both default-off) and threads useConstrainedDecoder: Self.isConstrainedDecoderEnabled into LlamaGenerationOptions.
LlamaRuntimeModels.swift: Adds useConstrainedDecoder: Bool = false to LlamaGenerationOptions with updated doc comments.

Confidence Score: 4/5

The revert to the native sampler is a well-scoped behavioral rollback that restores a previously working code path; the main risk is that the new extractPiece helper silently drops partial multi-byte UTF-8 token pieces, which could cause CJK or emoji text to be erased rather than rendered.

The routing change and flag restoration are clean and match the stated intent. The one functional concern is extractPiece's per-token failable UTF-8 decode, which mirrors the same pitfall the constrained-decoder code already documents and avoids — if the engine ever returns a token whose bytes are only part of a multi-byte scalar, that token's contribution is silently dropped. This was likely pre-existing behavior before #532, but the helper is newly written code in this PR.

Cotabby/Services/Runtime/LlamaRuntimeCore.swift — specifically the extractPiece helper and the per-token String decoding in runEngineSampledDecode.

Important Files Changed

Filename	Overview
Cotabby/Services/Runtime/LlamaRuntimeCore.swift	Restores runEngineSampledDecode as the default decode path behind a guard on useConstrainedDecoder; adds extractPiece helper that uses failable UTF-8 decoding per token, which can silently drop partial multi-byte sequences (CJK/emoji).
Cotabby/Models/LlamaRuntimeModels.swift	Adds useConstrainedDecoder: Bool = false field and updates beamWidth doc to note it is only consulted when constrained decode is on; documentation changes look correct.
Cotabby/Services/Runtime/LlamaRuntimeManager.swift	Comment-only update to reflect that the decode loop now polls Task.isCancelled between sampleNext calls; no functional change.
Cotabby/Services/Runtime/LlamaSuggestionEngine.swift	Restores constrainedDecoderDefaultsKey / isFillInMiddleEnabled flags (both default-off) and threads useConstrainedDecoder into LlamaGenerationOptions; logic is correct and matches the revert intent.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[generate called] --> B{useConstrainedDecoder?}
    B -- false default --> C[runEngineSampledDecode]
    B -- true --> D{beamWidth gt 1?}
    D -- false --> E[runConstrainedDecode]
    D -- true --> F[runConstrainedBeamDecode]
    C --> G[sampleNext loop\nextractPiece per token]
    E --> H[getNextTokenLogits\nConstrainedSampler.selectToken\nacceptToken manually]
    F --> I[EngineBeamStepper\nmulti-branch beam search]
    G --> J{shouldSuppress?}
    E --> J
    F --> J
    J -- yes --> K[return empty string]
    J -- no --> L[return generatedText]

_{Reviews (1): Last reviewed commit: "Revert "Make the constrained decoder the..." | Re-trigger Greptile}

Greptile also left 1 inline comment on this PR.

Restore the native sampler (runEngineSampledDecode) as the default llama decode path and put the constrained decoder back behind its default-off `useConstrainedDecoder` flag. #538 cut the worst of the constrained-decode latency (the per-token full-vocab sort and an unused per-token logSumExp), but in real use the constrained path is still much slower than the native sampler: p50 ~311ms with a tail into multiple seconds on a 262k-vocab model, versus the snappy native path users had before #532. Rather than keep that as the only path, fall back to the native sampler now and keep the constrained decoder available (and faster, via #538) behind its flag while the remaining slowness is investigated. This reverts commit 4ee8edd. The #538 ConstrainedSampler / runConstrainedDecode optimizations are intentionally kept so the flagged path stays fast.

greptile-apps · 2026-06-02T06:31:51Z

+    private static func extractPiece(_ result: SampleResult) -> String {
+        guard let piece = result.piece, result.piece_length > 0 else { return "" }
+        let buffer = UnsafeBufferPointer(
+            start: UnsafeRawPointer(piece).assumingMemoryBound(to: UInt8.self),
+            count: Int(result.piece_length)
+        )
+        return String(bytes: buffer, encoding: .utf8) ?? ""
+    }


Silent drop of partial-UTF-8 token pieces

String(bytes: buffer, encoding: .utf8) ?? "" returns "" whenever a token's bytes are not a complete, valid UTF-8 sequence. The llama tokenizer can split a multi-byte scalar (CJK character, emoji) across adjacent tokens, so the first token carries only the leading byte(s) and the second carries the continuation byte(s) — neither fragment decodes independently as valid UTF-8. The consequence is that both tokens silently contribute nothing to generatedText, effectively erasing the character. runConstrainedDecode already documents this exact pitfall and works around it by accumulating raw bytes and decoding once at the end with String(decoding:as:). The native sampler path doesn't accumulate bytes, but the per-token fallback to "" means the same class of corruption applies.

FuJacob merged commit 23ebfbc into main Jun 2, 2026
4 checks passed

FuJacob deleted the fix/revert-constrained-only-decode branch June 2, 2026 06:28

greptile-apps Bot reviewed Jun 2, 2026

View reviewed changes

FuJacob mentioned this pull request Jun 2, 2026

gif #540

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Revert "Make the constrained decoder the only llama decode path (#532)"#539

Revert "Make the constrained decoder the only llama decode path (#532)"#539
FuJacob merged 1 commit into
mainfrom
fix/revert-constrained-only-decode

FuJacob commented Jun 2, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

Uh oh!

greptile-apps Bot Jun 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

FuJacob commented Jun 2, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Linked issues

Risk / rollout notes

Greptile Summary

Confidence Score: 4/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

greptile-apps Bot Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

FuJacob commented Jun 2, 2026 •

edited by greptile-apps Bot

Loading