Skip to content

Fix constrained-decode latency: bounded top-K, skip unused logSumExp#538

Merged
FuJacob merged 1 commit into
mainfrom
fix/llm-generation-slow-regression
Jun 2, 2026
Merged

Fix constrained-decode latency: bounded top-K, skip unused logSumExp#538
FuJacob merged 1 commit into
mainfrom
fix/llm-generation-slow-regression

Conversation

@FuJacob
Copy link
Copy Markdown
Owner

@FuJacob FuJacob commented Jun 2, 2026

Summary

Making the constrained decoder the only llama decode path (#532) silently moved every suggestion onto the Swift constrained decoder, which ran two O(vocab) operations per generated token that the deleted native sampler never paid. ConstrainedSampler.candidatePool sorted the entire vocabulary (150k-256k tokens) every step just to take the top topK, and runConstrainedDecode scored every token with a full-vocab logSumExp to feed a confidence floor that defaults to -infinity (suppression off). With a 25-token budget that was dozens of full-vocab sorts per suggestion, so generation took seconds. This replaces the full sort with a single-pass bounded top-K selection and skips the per-token logSumExp at the default floor, with no change to which tokens get selected.

Validation

swiftlint lint --quiet
# exit 0, no violations

xcodebuild -project Cotabby.xcodeproj -scheme Cotabby -destination 'platform=macOS' \
  build-for-testing -derivedDataPath build/DerivedData CODE_SIGNING_ALLOWED=NO CODE_SIGNING_REQUIRED=NO
# ** TEST BUILD SUCCEEDED **

xcodebuild ... test-without-building \
  -only-testing:CotabbyTests/ConstrainedSamplerTests \
  -only-testing:CotabbyTests/ConstrainedBeamSearchTests \
  -only-testing:CotabbyTests/RepetitionGuardTests \
  -only-testing:CotabbyTests/LlamaSuggestionEngineCancellationTests
# ConstrainedSamplerTests: Executed 27 tests, 0 failures
#   (incl. test_select_matchesFullSortReferenceAcrossRandomInputs, a 4000-trial randomized
#    equivalence sweep proving the bounded top-K matches the old full sort bit-for-bit)
# ConstrainedBeamSearchTests / RepetitionGuardTests / LlamaSuggestionEngineCancellationTests: all passed

A throwaway micro-benchmark over a representative 200k vocab and 25-token budget (debug build) measured token selection at ~8.0s before vs ~0.55s after (14.5x), with identical selected-token sums confirming output is unchanged. The removed logSumExp is additional savings on top of that. Not verified end-to-end on device (needs a model + Accessibility + a live field); the function-level equivalence test plus the benchmark cover the change.

Linked issues

Refs #532 (introduced the regression by making the constrained decoder the only decode path).

Risk / rollout notes

  • Performance only; no behavior change on the default path. The new candidatePool returns the same token set with the same lower-id tie-break as the old full sort, proven by the 4000-trial randomized equivalence test.
  • The logSumExp skip is gated on confidenceFloor == -.infinity (the shipped default). When a caller raises the floor, per-token scoring runs exactly as before, so confidence suppression is unaffected.
  • No settings, schema, or pbxproj changes. No public API changes. The beam path (non-default, beamWidth > 1) is untouched.

Greptile Summary

This PR fixes a latency regression introduced in #532, which moved all suggestion generation onto the Swift constrained decoder and exposed two expensive O(vocab) operations per token. The fix replaces the full-vocabulary sort in candidatePool with a bounded top-K scan and skips the per-token logSumExp when the confidence floor is at its default (-.infinity), with no change to which tokens are selected.

  • Bounded top-K in candidatePool: replaces (0..<count).sorted (O(vocab log vocab)) with a single O(vocab) scan over a limit-sized fixed buffer, evicting the worst candidate on each improvement; tie-breaking (lower id wins) is preserved exactly by evicting the larger id on equal logits.
  • logSumExp skip in runConstrainedDecode: the per-token softmax computation is now guarded by options.confidenceFloor > -.infinity; when the shipped default is in effect, sumLogprob stays at 0.0 and shouldSuppress still fires correctly because the policy treats -.infinity as "never suppress".
  • New tests: a 4000-trial deterministic equivalence sweep against the old full-sort reference (using a seeded SplitMix64 RNG with heavy tie-heavy cases) and a large-vocab equal-logit cut-line test both accompany the change.

Confidence Score: 5/5

Safe to merge — pure performance change with no behavioral difference on the default configuration path.

The bounded top-K scan in candidatePool is algorithmically equivalent to the old full sort: tie-breaking (lower id wins) is reproduced exactly by evicting the larger id on equal logits, and a 4000-trial deterministic sweep against the old reference confirms bit-for-bit agreement. The logSumExp skip is correctly gated on confidenceFloor > -.infinity so shouldSuppress still receives the right inputs when a caller raises the floor. No schema, API, or behavioral changes are introduced.

No files require special attention.

Important Files Changed

Filename Overview
Cotabby/Support/ConstrainedSampler.swift Replaces O(vocab log vocab) full sort in candidatePool with an O(vocab) bounded top-K scan plus an O(limit) worstCandidateIndex helper; tie-breaking logic is correct and bit-for-bit equivalent to the old sort.
Cotabby/Services/Runtime/LlamaRuntimeCore.swift Adds a confidenceFloor > -.infinity guard before the per-token logSumExp call; when the floor is at its default the computation is skipped entirely and shouldSuppress still evaluates correctly with the zero-initialized sumLogprob.
CotabbyTests/ConstrainedSamplerTests.swift Adds a seeded 4000-trial randomized equivalence sweep against the old full-sort reference and a large-vocab equal-logit cut-line test; both new tests are deterministic and exercise the tie-heavy edge cases most likely to expose divergence.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[runConstrainedDecode called] --> B[Get logits from engine]
    B --> C[RepetitionGuard: compute blockedTokenIDs]
    C --> D[ConstrainedSampler.selectToken]
    D --> E[candidatePool: limit < count?]
    E -->|No - return all IDs| F[id-ordered full vocab]
    E -->|Yes| G[O-vocab scan: fixed-size buffer, worstCandidateIndex evicts on better logit]
    G --> F
    F --> H[argmax over surviving admissible/unblocked tokens]
    H --> I{Token found?}
    I -->|nil| J[stopReason = no_admissible_token]
    I -->|tokenID| K[preCommitStopReason check]
    K -->|stop| L[break loop]
    K -->|continue| M{confidenceFloor > -.infinity?}
    M -->|No - skip logSumExp| N[Append bytes, tokensGenerated++]
    M -->|Yes| O[logProb: logSumExp over full vocab]
    O --> P[sumLogprob += logProb]
    P --> N
    N --> Q[engine.acceptToken]
    Q --> R{Sentence boundary?}
    R -->|Yes| S[break loop]
    R -->|No| B
    J --> T[shouldSuppress]
    L --> T
    S --> T
    T -->|suppress| U[return empty string]
    T -->|pass| V[return generatedText]
Loading

Reviews (1): Last reviewed commit: "Fix constrained-decode latency: bounded ..." | Re-trigger Greptile

Making the constrained decoder the only llama decode path (#532) moved
every suggestion onto two O(vocab) operations per generated token that
the deleted native sampler never paid:

- ConstrainedSampler.candidatePool sorted the full vocabulary
  (150k-256k tokens) every step just to take the top topK. Replace the
  full sort with a single-pass bounded top-K selection that keeps the
  same membership and the same lower-id tie-break.
- runConstrainedDecode scored every token with a full-vocab logSumExp to
  feed the confidence floor, which shouldSuppress treats as a no-op at
  the default floor of -infinity. Skip it unless a caller raises the floor.

Token selection dropped from ~8.0s to ~0.55s per suggestion in a debug
build (200k vocab, 25-token budget) with identical selected tokens. A
4000-trial randomized equivalence test pins the fast path to the old
full-sort behavior bit-for-bit.
@FuJacob FuJacob merged commit a3e5fe7 into main Jun 2, 2026
4 checks passed
@FuJacob FuJacob deleted the fix/llm-generation-slow-regression branch June 2, 2026 06:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant