Skip to content

Revert "Make the constrained decoder the only llama decode path (#532)"#539

Merged
FuJacob merged 1 commit into
mainfrom
fix/revert-constrained-only-decode
Jun 2, 2026
Merged

Revert "Make the constrained decoder the only llama decode path (#532)"#539
FuJacob merged 1 commit into
mainfrom
fix/revert-constrained-only-decode

Conversation

@FuJacob
Copy link
Copy Markdown
Owner

@FuJacob FuJacob commented Jun 2, 2026

Summary

Restore the native sampler (runEngineSampledDecode) as the default llama decode path by reverting #532, and put the constrained decoder back behind its default-off useConstrainedDecoder flag. #538 removed the worst of the constrained-decode latency (the per-token full-vocab sort and an unused per-token logSumExp), but in real use the constrained path is still much slower than the native sampler users had before #532, so we fall back to native now and keep the constrained decoder behind its flag while the remaining slowness is investigated.

Validation

swiftlint lint --quiet <reverted files>
# exit 0

xcodebuild -project Cotabby.xcodeproj -scheme Cotabby -destination 'platform=macOS' \
  build-for-testing -derivedDataPath build/DerivedData CODE_SIGNING_ALLOWED=NO CODE_SIGNING_REQUIRED=NO
# ** TEST BUILD SUCCEEDED **   (restored runEngineSampledDecode compiles against the current engine)

xcodebuild ... test-without-building \
  -only-testing:CotabbyTests/ConstrainedSamplerTests \
  -only-testing:CotabbyTests/ConstrainedBeamSearchTests \
  -only-testing:CotabbyTests/RepetitionGuardTests \
  -only-testing:CotabbyTests/LlamaSuggestionEngineCancellationTests
# Executed 51 tests, with 0 failures

Real-world latency that motivated the revert (from ~/Library/Logs/Cotabby/llm-io.jsonl on the post-#538 build): constrained decode p50 ~311ms, p99 ~2.1s, max ~6.4s on a 262k-vocab model, versus the snappy native path before #532.

Linked issues

Refs #532 (this reverts it). Refs #538 (its ConstrainedSampler / runConstrainedDecode optimizations are intentionally kept).

Risk / rollout notes

  • Behavior revert to the known-good native sampler. Default generation no longer uses the constrained decoder; it stays available behind the hidden cotabbyConstrainedDecoderEnabled UserDefaults flag (default off), and behind useConstrainedDecoder in LlamaGenerationOptions (default false).
  • Fix constrained-decode latency: bounded top-K, skip unused logSumExp #538's candidatePool bounded top-K and the logSumExp skip are kept, so the flagged constrained path is fast when enabled.
  • Deferred for investigation (not blocking this revert): the constrained path appears to over-generate in multi-line fields (it cannot select an EOS token, since EOS is excluded as a control token, so it runs to the token budget instead of stopping early), plus per-token O(vocab) Swift work over a 262k vocab is costly in debug builds. These are why even the optimized constrained path is still slow.
  • No settings, schema, or pbxproj changes.

Greptile Summary

This PR reverts #532 by restoring the engine's native stochastic sampler (runEngineSampledDecode / sampleNext) as the default llama decode path, and puts the constrained decoder back behind the useConstrainedDecoder flag (default off) and its UserDefaults key. The #538 performance work on ConstrainedSampler is intentionally kept.

  • LlamaRuntimeCore.swift: Adds a guard options.useConstrainedDecoder else { return runEngineSampledDecode(...) } guard so the fast native path runs by default, and introduces a new extractPiece helper that reads token bytes from the engine's SampleResult into a String.
  • LlamaSuggestionEngine.swift: Restores constrainedDecoderDefaultsKey / isFillInMiddleEnabled UserDefaults flags (both default-off) and threads useConstrainedDecoder: Self.isConstrainedDecoderEnabled into LlamaGenerationOptions.
  • LlamaRuntimeModels.swift: Adds useConstrainedDecoder: Bool = false to LlamaGenerationOptions with updated doc comments.

Confidence Score: 4/5

The revert to the native sampler is a well-scoped behavioral rollback that restores a previously working code path; the main risk is that the new extractPiece helper silently drops partial multi-byte UTF-8 token pieces, which could cause CJK or emoji text to be erased rather than rendered.

The routing change and flag restoration are clean and match the stated intent. The one functional concern is extractPiece's per-token failable UTF-8 decode, which mirrors the same pitfall the constrained-decoder code already documents and avoids — if the engine ever returns a token whose bytes are only part of a multi-byte scalar, that token's contribution is silently dropped. This was likely pre-existing behavior before #532, but the helper is newly written code in this PR.

Cotabby/Services/Runtime/LlamaRuntimeCore.swift — specifically the extractPiece helper and the per-token String decoding in runEngineSampledDecode.

Important Files Changed

Filename Overview
Cotabby/Services/Runtime/LlamaRuntimeCore.swift Restores runEngineSampledDecode as the default decode path behind a guard on useConstrainedDecoder; adds extractPiece helper that uses failable UTF-8 decoding per token, which can silently drop partial multi-byte sequences (CJK/emoji).
Cotabby/Models/LlamaRuntimeModels.swift Adds useConstrainedDecoder: Bool = false field and updates beamWidth doc to note it is only consulted when constrained decode is on; documentation changes look correct.
Cotabby/Services/Runtime/LlamaRuntimeManager.swift Comment-only update to reflect that the decode loop now polls Task.isCancelled between sampleNext calls; no functional change.
Cotabby/Services/Runtime/LlamaSuggestionEngine.swift Restores constrainedDecoderDefaultsKey / isFillInMiddleEnabled flags (both default-off) and threads useConstrainedDecoder into LlamaGenerationOptions; logic is correct and matches the revert intent.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[generate called] --> B{useConstrainedDecoder?}
    B -- false default --> C[runEngineSampledDecode]
    B -- true --> D{beamWidth gt 1?}
    D -- false --> E[runConstrainedDecode]
    D -- true --> F[runConstrainedBeamDecode]
    C --> G[sampleNext loop\nextractPiece per token]
    E --> H[getNextTokenLogits\nConstrainedSampler.selectToken\nacceptToken manually]
    F --> I[EngineBeamStepper\nmulti-branch beam search]
    G --> J{shouldSuppress?}
    E --> J
    F --> J
    J -- yes --> K[return empty string]
    J -- no --> L[return generatedText]
Loading

Fix All in Codex Fix All in Claude Code

Reviews (1): Last reviewed commit: "Revert "Make the constrained decoder the..." | Re-trigger Greptile

Greptile also left 1 inline comment on this PR.

Restore the native sampler (runEngineSampledDecode) as the default llama
decode path and put the constrained decoder back behind its default-off
`useConstrainedDecoder` flag.

#538 cut the worst of the constrained-decode latency (the per-token
full-vocab sort and an unused per-token logSumExp), but in real use the
constrained path is still much slower than the native sampler: p50 ~311ms
with a tail into multiple seconds on a 262k-vocab model, versus the snappy
native path users had before #532. Rather than keep that as the only path,
fall back to the native sampler now and keep the constrained decoder
available (and faster, via #538) behind its flag while the remaining
slowness is investigated.

This reverts commit 4ee8edd. The #538
ConstrainedSampler / runConstrainedDecode optimizations are intentionally
kept so the flagged path stays fast.
@FuJacob FuJacob merged commit 23ebfbc into main Jun 2, 2026
4 checks passed
@FuJacob FuJacob deleted the fix/revert-constrained-only-decode branch June 2, 2026 06:28
Comment on lines +907 to +914
private static func extractPiece(_ result: SampleResult) -> String {
guard let piece = result.piece, result.piece_length > 0 else { return "" }
let buffer = UnsafeBufferPointer(
start: UnsafeRawPointer(piece).assumingMemoryBound(to: UInt8.self),
count: Int(result.piece_length)
)
return String(bytes: buffer, encoding: .utf8) ?? ""
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Silent drop of partial-UTF-8 token pieces

String(bytes: buffer, encoding: .utf8) ?? "" returns "" whenever a token's bytes are not a complete, valid UTF-8 sequence. The llama tokenizer can split a multi-byte scalar (CJK character, emoji) across adjacent tokens, so the first token carries only the leading byte(s) and the second carries the continuation byte(s) — neither fragment decodes independently as valid UTF-8. The consequence is that both tokens silently contribute nothing to generatedText, effectively erasing the character. runConstrainedDecode already documents this exact pitfall and works around it by accumulating raw bytes and decoding once at the end with String(decoding:as:). The native sampler path doesn't accumulate bytes, but the per-token fallback to "" means the same class of corruption applies.

Fix in Codex Fix in Claude Code

@FuJacob FuJacob mentioned this pull request Jun 2, 2026
Merged
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant