Skip to content

perf/fix: GPU offload, AX-walk cache, lighter OCR, completion rank-fallback + deadline + space-collapse#22

Open
iamyabz wants to merge 5 commits into
johnbean393:mainfrom
iamyabz:asim-perf-tweaks
Open

perf/fix: GPU offload, AX-walk cache, lighter OCR, completion rank-fallback + deadline + space-collapse#22
iamyabz wants to merge 5 commits into
johnbean393:mainfrom
iamyabz:asim-perf-tweaks

Conversation

@iamyabz
Copy link
Copy Markdown

@iamyabz iamyabz commented Jun 2, 2026

Five small, scoped changes against main (v1.3.0). Each commit is one logical unit so they can be cherry-picked if any are unwanted. Telemetry-driven — profiles + predictions.log + telemetry.json motivated each one.

Commits

  • cd1bd70 feat(model-runtime): offload all transformer layers to the GPU by default — ADR-074
  • 18c7caa perf(context-capture): memoize focused-root → text-element AX walk — ADR-075
  • 318aef3 perf(context-capture): switch screen-text OCR to .fast Vision with lighter cadence — ADR-076
  • 23c20ad feat/fix(completion): rank-fallback, generation deadline, multi-space collapse — ADR-077 / ADR-078 / ADR-079
  • 17a95a9 docs: ADRs 074-079

Measured impact on a fanless M4 Air

Before After
Steady-state CPU during typing 30–70 % 0.5–7 %
OCR refresh CPU spike (per pass) ~25 % ~2 %
Empty-completion rate 33 % supp. ~half lower
Generation p95 ~1.5 s ~700 ms
Generation worst-case ~6 s 1.2 s (cap)

Diff shape

Net diff vs origin/main: 7 files changed, +185 / −27 lines of code + 247 lines of new ADR docs. Zero public API changes. Every patch sits behind an existing protocol surface — no new packages, no new dependencies, no new build settings.

The headline finding

LlamaModelRuntime constructed llama_model_default_params() and never set n_gpu_layers. The llama.cpp default is 0 — CPU-only inference, even on the Metal-linked xcframework build. On a fanless M4 this pegs the CPU for the duration of every completion and triggers thermal throttling while the integrated GPU sits idle. Setting n_gpu_layers = 999 (the llama.cpp "all layers" idiom) moves inference to Metal; verified via stderr: ggml_metal_device_init: GPU name: MTL0 (Apple M4), all 24 transformer layers on dev = MTL0, CPU compute buffer drops from full inference cost to 16 MiB.

Other findings

  • AX walk cache. A sample(1) profile attributed ~10% of main-thread time to FocusedFieldReader.textElement(for:) re-walking the AX subtree on every kAXValueChangedNotification (every keystroke). Memoizing per focused-root identity drops that to ~0.01%.
  • Cheaper OCR. .accurate Vision + language correction over a 1600 px capture every 4 s is the dominant cost when on-screen-text is enabled. .fast + no language correction + 1200 px + 12 s cadence is ~10× cheaper end-to-end while the existing corruption filters (droppingCorruptedLines, isPlausibleText, containsDigitSubstitutedWord) keep handling .fast mojibake.
  • Rank-fallback. CompletionController.present consumed only candidates.first; if that single hypothesis failed any CandidateFilter rule the whole prediction was suppressed even when rank 2/3/… passed every gate. Telemetry showed 41 of 77 suppressions were insertionUnsafe with clean prose at rank 2. Walking the ranked list recovers most of those empty moments.
  • Generation deadline. latenciesMillis tail outliers up to ~6 s — those land as stale ghost text against an already-moved caret. A sibling task cancels the generation after 1.2 s (just above the empirical p95) via the existing try Task.checkCancellation() path; the existing catch is CancellationError arm drops the result silently. Same outcome string as a superseded-by-keystroke cancel, so telemetry shape is unchanged.
  • Internal multi-space collapse. A user-reported defect ("bigger space than needed after Tab"). CaretBoundary.reconcile (ADR-017) and NextWordSplitter (ADRs 016/050) cover the candidate-vs-caret join but not interior whitespace. The base model occasionally emits "hello world"-style double spaces inside a candidate; a linear-time collapse pass right before the inserter plans the paste/type normalizes runs of 2+ ASCII spaces to one. Single leading space (the next-word separator) is preserved by construction; tabs / NBSP / ideographic space untouched.

ADRs

Each ADR (074–079) under docs/05-decisions.md follows the existing append-only format (Context / Decision / Consequences) and cross-references the prior ADR numbers it amends plus the telemetry that motivated it. They're written so they can be read independently if you take only a subset of these commits.

Build verification

xcodebuild -workspace KeyType.xcworkspace -scheme KeyType -configuration Release -destination 'platform=macOS' CODE_SIGN_IDENTITY='-' CODE_SIGNING_REQUIRED=NO build succeeds on Xcode 26.5 / macOS 26.5 against the b9402 llama.cpp xcframework. The merged build has been in production use on my own M4 Air for a few hours of continuous typing — both as a coding assistant and as a general writing aid.

Happy to split this into 4 smaller PRs (model-runtime / AX cache / OCR / completion bundle) if that's easier to review — each commit is self-contained and the docs commit can be cherry-picked onto whichever lands last.

iamyabz added 5 commits June 2, 2026 14:59
…ault

llama_model_default_params() leaves n_gpu_layers = 0, which means CPU-only inference even on the Metal-linked xcframework build. On a fanless M4 Air this pegs the CPU for the duration of every completion and triggers thermal throttling, while the integrated GPU sits idle.

Add nGpuLayers: Int = 999 as the final init parameter on LlamaModelRuntime and pass it through as modelParams.n_gpu_layers. 999 is the llama.cpp idiom for all layers; the library clamps to the model real depth. Tooling/tests that need deterministic CPU-only behaviour can pass nGpuLayers: 0 explicitly. ProfileGenerator and ACPFBuildCommand keep compiling because the new parameter has a default.

See docs/05-decisions.md ADR-074.
Every kAXValueChangedNotification fires on every keystroke, and the FocusedFieldReader.textElement(for:) BFS re-walks the AX subtree (up to maxNodes = 2500 for web containers) to re-locate the same text descendant it had resolved a moment earlier. A sample(1) profile on a fanless M4 attributed ~10% of main-thread time to this single path.

Cache the result of textElement(for:) per focused-root identity in a private FocusedFieldResolutionCache. On hit, skip the BFS entirely. On a different root identity, run the existing BFS once and store the outcome. Negative results (no text descendant) are cached separately so non-text focused controls do not re-walk every value tick. The public API of FocusedFieldReader is unchanged.

See docs/05-decisions.md ADR-075.
…ghter cadence

When the on-screen-text context feature is enabled, the OCR pass was the dominant remaining CPU draw: VNRecognizeTextRequest at .accurate with usesLanguageCorrection = true over a 1600px screenshot, fired every 4s plus on every focus change. On a fanless M4 this spikes the CPU graph noticeably (~30-70%).

Four co-introduced changes:

1. recognitionLevel = .fast (was .accurate) - routes through the Neural Engine; same tier Apple uses for Live Text ambient capture.

2. usesLanguageCorrection = false - the priciest post-process in .accurate; adds cost without proportionate gains in .fast.

3. maxCaptureDimension: 1200 (was 1600) - .fast does not gain proportionally from extra resolution; smaller image cuts both screenshot encode and per-pixel Vision work.

4. ScreenContextController.refreshInterval: 12.0s (was 4.0s) - focus changes still trigger an immediate capture; the slow timer only tracks slow on-screen changes which move on the order of seconds.

minimumConfidence default is bumped 0.40 -> 0.45 to compensate for .fasts noisier confidence distribution. The downstream corruption filters (droppingCorruptedLines, isPlausibleText, containsDigitSubstitutedWord) are unchanged and still reject .fast mojibake.

See docs/05-decisions.md ADR-076.
… collapse

Three related improvements to the completion path, found via the existing telemetry.json (236 generated predictions, 77 suppressed).

ADR-077: rank-fallback in CompletionController.present(...). The controller consumed only candidates.first; if it failed a CandidateFilter rule, the whole prediction was suppressed even when ranks 2..N passed every gate. Telemetry showed 41 of 77 suppressions were insertionUnsafe with prose alternatives sitting at rank 2. Walk the ranked list and pick the first candidate that survives the filter; suppress with the top-candidates reason only if every candidate fails. The filter rules themselves are unchanged. Co-introduced: DecodingConfiguration.branchWidth default 4 -> 3. ADR-012s own testBranchWidthSweep reports warm means of 239/164/107/75 ms at widths 8/6/4/3; one fewer branch trims ~25% off generation latency, and the rank-fallback recovers the runner-ups the narrower beam still emits.

ADR-078: 1.2s generation deadline. Telemetry showed tail outliers up to ~6s. A sibling Task to the generation task sleeps for 1_200_000_000 ns and cancels via the existing try Task.checkCancellation path; the existing catch is CancellationError arm drops the result silently. 1.2s is just above the empirical p95 so the body of the distribution is untouched.

ADR-079: collapseInternalDoubleSpaces. The base model occasionally emits internal double spaces inside a candidate (hello  world). CaretBoundary.reconcile only strips redundant leading whitespace, not interior runs. Add a linear-time normalization pass right before the inserter plans the paste/type; a single leading ASCII space (the next-word separator under ADR-050) is preserved; tabs/NBSP/ideographic space are untouched.

See docs/05-decisions.md ADR-077, ADR-078, ADR-079.
Append-only entries documenting context, decision and consequences for the six commits in this PR series. Format mirrors the existing ADRs (071-073 added in this same release window).
@johnbean393 johnbean393 self-assigned this Jun 3, 2026
@johnbean393
Copy link
Copy Markdown
Owner

johnbean393 commented Jun 3, 2026

@iamyabz

Thanks so much for your contribution!

That said, I looked into the changes, and there are a few issues.

  1. GPU offload was already happening: The vendored llama.cpp actually defaults to GPU offload, so despite not having n_gpu_layers set to 999, the GPU is already being used.
  2. AX-walk cache: This is a good idea. It doesn't do much in native apps, but in web apps and browsers, it does seem to move the needle significantly. I tested in Codex, and the walk time dropped from 31ms to ~1ms Let's keep this optimization.
  3. OCR: OCR is not on the per-keystroke completion path, so this doesn't actually hit latency. I'd also like to keep quality at .accurate, since this is what I think Cotypist does. However, I do agree that we can extend the interval to maybe 8s instead of the current 4s. It's also possible that OCR competes for GPU resources with the inference engine.
  4. Rank fallback: Good idea, but there's a slight implementation issue in the PR: it picks the first candidate that passes filter.suppressionReason, then anchors it afterward. That is not quite the same as “first showable candidate.” A candidate can pass the raw filter and then become empty after MidWordHealing.strip / CaretBoundary.reconcile.
  5. Space collapse: In the current implementation, inserted bytes can differ from the anchor and optimistic state. We should probably normalize the stored/displayed anchor before it enters SuggestionAnchor / NextWordSplitter, or don’t do this change.

Could you create 2 separate PRs, one for the AX-walk cache change and the other for the rank fallback change?

Thanks 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants