Fix constrained-decode latency: bounded top-K, skip unused logSumExp by FuJacob · Pull Request #538 · FuJacob/cotabby

FuJacob · 2026-06-02T06:16:36Z

Summary

Making the constrained decoder the only llama decode path (#532) silently moved every suggestion onto the Swift constrained decoder, which ran two O(vocab) operations per generated token that the deleted native sampler never paid. ConstrainedSampler.candidatePool sorted the entire vocabulary (150k-256k tokens) every step just to take the top topK, and runConstrainedDecode scored every token with a full-vocab logSumExp to feed a confidence floor that defaults to -infinity (suppression off). With a 25-token budget that was dozens of full-vocab sorts per suggestion, so generation took seconds. This replaces the full sort with a single-pass bounded top-K selection and skips the per-token logSumExp at the default floor, with no change to which tokens get selected.

Validation

swiftlint lint --quiet
# exit 0, no violations

xcodebuild -project Cotabby.xcodeproj -scheme Cotabby -destination 'platform=macOS' \
  build-for-testing -derivedDataPath build/DerivedData CODE_SIGNING_ALLOWED=NO CODE_SIGNING_REQUIRED=NO
# ** TEST BUILD SUCCEEDED **

xcodebuild ... test-without-building \
  -only-testing:CotabbyTests/ConstrainedSamplerTests \
  -only-testing:CotabbyTests/ConstrainedBeamSearchTests \
  -only-testing:CotabbyTests/RepetitionGuardTests \
  -only-testing:CotabbyTests/LlamaSuggestionEngineCancellationTests
# ConstrainedSamplerTests: Executed 27 tests, 0 failures
#   (incl. test_select_matchesFullSortReferenceAcrossRandomInputs, a 4000-trial randomized
#    equivalence sweep proving the bounded top-K matches the old full sort bit-for-bit)
# ConstrainedBeamSearchTests / RepetitionGuardTests / LlamaSuggestionEngineCancellationTests: all passed

A throwaway micro-benchmark over a representative 200k vocab and 25-token budget (debug build) measured token selection at ~8.0s before vs ~0.55s after (14.5x), with identical selected-token sums confirming output is unchanged. The removed logSumExp is additional savings on top of that. Not verified end-to-end on device (needs a model + Accessibility + a live field); the function-level equivalence test plus the benchmark cover the change.

Linked issues

Refs #532 (introduced the regression by making the constrained decoder the only decode path).

Risk / rollout notes

Performance only; no behavior change on the default path. The new candidatePool returns the same token set with the same lower-id tie-break as the old full sort, proven by the 4000-trial randomized equivalence test.
The logSumExp skip is gated on confidenceFloor == -.infinity (the shipped default). When a caller raises the floor, per-token scoring runs exactly as before, so confidence suppression is unaffected.
No settings, schema, or pbxproj changes. No public API changes. The beam path (non-default, beamWidth > 1) is untouched.

Greptile Summary

This PR fixes a latency regression introduced in #532, which moved all suggestion generation onto the Swift constrained decoder and exposed two expensive O(vocab) operations per token. The fix replaces the full-vocabulary sort in candidatePool with a bounded top-K scan and skips the per-token logSumExp when the confidence floor is at its default (-.infinity), with no change to which tokens are selected.

Bounded top-K in candidatePool: replaces (0..<count).sorted (O(vocab log vocab)) with a single O(vocab) scan over a limit-sized fixed buffer, evicting the worst candidate on each improvement; tie-breaking (lower id wins) is preserved exactly by evicting the larger id on equal logits.
logSumExp skip in runConstrainedDecode: the per-token softmax computation is now guarded by options.confidenceFloor > -.infinity; when the shipped default is in effect, sumLogprob stays at 0.0 and shouldSuppress still fires correctly because the policy treats -.infinity as "never suppress".
New tests: a 4000-trial deterministic equivalence sweep against the old full-sort reference (using a seeded SplitMix64 RNG with heavy tie-heavy cases) and a large-vocab equal-logit cut-line test both accompany the change.

Confidence Score: 5/5

Safe to merge — pure performance change with no behavioral difference on the default configuration path.

The bounded top-K scan in candidatePool is algorithmically equivalent to the old full sort: tie-breaking (lower id wins) is reproduced exactly by evicting the larger id on equal logits, and a 4000-trial deterministic sweep against the old reference confirms bit-for-bit agreement. The logSumExp skip is correctly gated on confidenceFloor > -.infinity so shouldSuppress still receives the right inputs when a caller raises the floor. No schema, API, or behavioral changes are introduced.

No files require special attention.

Important Files Changed

Filename	Overview
Cotabby/Support/ConstrainedSampler.swift	Replaces O(vocab log vocab) full sort in `candidatePool` with an O(vocab) bounded top-K scan plus an O(limit) `worstCandidateIndex` helper; tie-breaking logic is correct and bit-for-bit equivalent to the old sort.
Cotabby/Services/Runtime/LlamaRuntimeCore.swift	Adds a `confidenceFloor > -.infinity` guard before the per-token `logSumExp` call; when the floor is at its default the computation is skipped entirely and `shouldSuppress` still evaluates correctly with the zero-initialized `sumLogprob`.
CotabbyTests/ConstrainedSamplerTests.swift	Adds a seeded 4000-trial randomized equivalence sweep against the old full-sort reference and a large-vocab equal-logit cut-line test; both new tests are deterministic and exercise the tie-heavy edge cases most likely to expose divergence.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[runConstrainedDecode called] --> B[Get logits from engine]
    B --> C[RepetitionGuard: compute blockedTokenIDs]
    C --> D[ConstrainedSampler.selectToken]
    D --> E[candidatePool: limit < count?]
    E -->|No - return all IDs| F[id-ordered full vocab]
    E -->|Yes| G[O-vocab scan: fixed-size buffer, worstCandidateIndex evicts on better logit]
    G --> F
    F --> H[argmax over surviving admissible/unblocked tokens]
    H --> I{Token found?}
    I -->|nil| J[stopReason = no_admissible_token]
    I -->|tokenID| K[preCommitStopReason check]
    K -->|stop| L[break loop]
    K -->|continue| M{confidenceFloor > -.infinity?}
    M -->|No - skip logSumExp| N[Append bytes, tokensGenerated++]
    M -->|Yes| O[logProb: logSumExp over full vocab]
    O --> P[sumLogprob += logProb]
    P --> N
    N --> Q[engine.acceptToken]
    Q --> R{Sentence boundary?}
    R -->|Yes| S[break loop]
    R -->|No| B
    J --> T[shouldSuppress]
    L --> T
    S --> T
    T -->|suppress| U[return empty string]
    T -->|pass| V[return generatedText]

_{Reviews (1): Last reviewed commit: "Fix constrained-decode latency: bounded ..." | Re-trigger Greptile}

Making the constrained decoder the only llama decode path (#532) moved every suggestion onto two O(vocab) operations per generated token that the deleted native sampler never paid: - ConstrainedSampler.candidatePool sorted the full vocabulary (150k-256k tokens) every step just to take the top topK. Replace the full sort with a single-pass bounded top-K selection that keeps the same membership and the same lower-id tie-break. - runConstrainedDecode scored every token with a full-vocab logSumExp to feed the confidence floor, which shouldSuppress treats as a no-op at the default floor of -infinity. Skip it unless a caller raises the floor. Token selection dropped from ~8.0s to ~0.55s per suggestion in a debug build (200k vocab, 25-token budget) with identical selected tokens. A 4000-trial randomized equivalence test pins the fast path to the old full-sort behavior bit-for-bit.

FuJacob merged commit a3e5fe7 into main Jun 2, 2026
4 checks passed

FuJacob deleted the fix/llm-generation-slow-regression branch June 2, 2026 06:18

This was referenced Jun 2, 2026

Revert "Make the constrained decoder the only llama decode path (#532)" #539

Merged

gif #540

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix constrained-decode latency: bounded top-K, skip unused logSumExp#538

Fix constrained-decode latency: bounded top-K, skip unused logSumExp#538
FuJacob merged 1 commit into
mainfrom
fix/llm-generation-slow-regression

FuJacob commented Jun 2, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

FuJacob commented Jun 2, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Validation

Linked issues

Risk / rollout notes

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

FuJacob commented Jun 2, 2026 •

edited by greptile-apps Bot

Loading