Revert "Make the constrained decoder the only llama decode path (#532)"#539
Conversation
Restore the native sampler (runEngineSampledDecode) as the default llama decode path and put the constrained decoder back behind its default-off `useConstrainedDecoder` flag. #538 cut the worst of the constrained-decode latency (the per-token full-vocab sort and an unused per-token logSumExp), but in real use the constrained path is still much slower than the native sampler: p50 ~311ms with a tail into multiple seconds on a 262k-vocab model, versus the snappy native path users had before #532. Rather than keep that as the only path, fall back to the native sampler now and keep the constrained decoder available (and faster, via #538) behind its flag while the remaining slowness is investigated. This reverts commit 4ee8edd. The #538 ConstrainedSampler / runConstrainedDecode optimizations are intentionally kept so the flagged path stays fast.
| private static func extractPiece(_ result: SampleResult) -> String { | ||
| guard let piece = result.piece, result.piece_length > 0 else { return "" } | ||
| let buffer = UnsafeBufferPointer( | ||
| start: UnsafeRawPointer(piece).assumingMemoryBound(to: UInt8.self), | ||
| count: Int(result.piece_length) | ||
| ) | ||
| return String(bytes: buffer, encoding: .utf8) ?? "" | ||
| } |
There was a problem hiding this comment.
Silent drop of partial-UTF-8 token pieces
String(bytes: buffer, encoding: .utf8) ?? "" returns "" whenever a token's bytes are not a complete, valid UTF-8 sequence. The llama tokenizer can split a multi-byte scalar (CJK character, emoji) across adjacent tokens, so the first token carries only the leading byte(s) and the second carries the continuation byte(s) — neither fragment decodes independently as valid UTF-8. The consequence is that both tokens silently contribute nothing to generatedText, effectively erasing the character. runConstrainedDecode already documents this exact pitfall and works around it by accumulating raw bytes and decoding once at the end with String(decoding:as:). The native sampler path doesn't accumulate bytes, but the per-token fallback to "" means the same class of corruption applies.
Summary
Restore the native sampler (
runEngineSampledDecode) as the default llama decode path by reverting #532, and put the constrained decoder back behind its default-offuseConstrainedDecoderflag. #538 removed the worst of the constrained-decode latency (the per-token full-vocab sort and an unused per-tokenlogSumExp), but in real use the constrained path is still much slower than the native sampler users had before #532, so we fall back to native now and keep the constrained decoder behind its flag while the remaining slowness is investigated.Validation
Real-world latency that motivated the revert (from
~/Library/Logs/Cotabby/llm-io.jsonlon the post-#538 build): constrained decode p50 ~311ms, p99 ~2.1s, max ~6.4s on a 262k-vocab model, versus the snappy native path before #532.Linked issues
Refs #532 (this reverts it). Refs #538 (its ConstrainedSampler /
runConstrainedDecodeoptimizations are intentionally kept).Risk / rollout notes
cotabbyConstrainedDecoderEnabledUserDefaults flag (default off), and behinduseConstrainedDecoderinLlamaGenerationOptions(default false).candidatePoolbounded top-K and thelogSumExpskip are kept, so the flagged constrained path is fast when enabled.Greptile Summary
This PR reverts #532 by restoring the engine's native stochastic sampler (
runEngineSampledDecode/sampleNext) as the default llama decode path, and puts the constrained decoder back behind theuseConstrainedDecoderflag (default off) and its UserDefaults key. The #538 performance work onConstrainedSampleris intentionally kept.LlamaRuntimeCore.swift: Adds aguard options.useConstrainedDecoder else { return runEngineSampledDecode(...) }guard so the fast native path runs by default, and introduces a newextractPiecehelper that reads token bytes from the engine'sSampleResultinto aString.LlamaSuggestionEngine.swift: RestoresconstrainedDecoderDefaultsKey/isFillInMiddleEnabledUserDefaults flags (both default-off) and threadsuseConstrainedDecoder: Self.isConstrainedDecoderEnabledintoLlamaGenerationOptions.LlamaRuntimeModels.swift: AddsuseConstrainedDecoder: Bool = falsetoLlamaGenerationOptionswith updated doc comments.Confidence Score: 4/5
The revert to the native sampler is a well-scoped behavioral rollback that restores a previously working code path; the main risk is that the new extractPiece helper silently drops partial multi-byte UTF-8 token pieces, which could cause CJK or emoji text to be erased rather than rendered.
The routing change and flag restoration are clean and match the stated intent. The one functional concern is extractPiece's per-token failable UTF-8 decode, which mirrors the same pitfall the constrained-decoder code already documents and avoids — if the engine ever returns a token whose bytes are only part of a multi-byte scalar, that token's contribution is silently dropped. This was likely pre-existing behavior before #532, but the helper is newly written code in this PR.
Cotabby/Services/Runtime/LlamaRuntimeCore.swift — specifically the extractPiece helper and the per-token String decoding in runEngineSampledDecode.
Important Files Changed
Flowchart
%%{init: {'theme': 'neutral'}}%% flowchart TD A[generate called] --> B{useConstrainedDecoder?} B -- false default --> C[runEngineSampledDecode] B -- true --> D{beamWidth gt 1?} D -- false --> E[runConstrainedDecode] D -- true --> F[runConstrainedBeamDecode] C --> G[sampleNext loop\nextractPiece per token] E --> H[getNextTokenLogits\nConstrainedSampler.selectToken\nacceptToken manually] F --> I[EngineBeamStepper\nmulti-branch beam search] G --> J{shouldSuppress?} E --> J F --> J J -- yes --> K[return empty string] J -- no --> L[return generatedText]Reviews (1): Last reviewed commit: "Revert "Make the constrained decoder the..." | Re-trigger Greptile