perf/fix: GPU offload, AX-walk cache, lighter OCR, completion rank-fallback + deadline + space-collapse#22
Open
iamyabz wants to merge 5 commits into
Open
perf/fix: GPU offload, AX-walk cache, lighter OCR, completion rank-fallback + deadline + space-collapse#22iamyabz wants to merge 5 commits into
iamyabz wants to merge 5 commits into
Conversation
…ault llama_model_default_params() leaves n_gpu_layers = 0, which means CPU-only inference even on the Metal-linked xcframework build. On a fanless M4 Air this pegs the CPU for the duration of every completion and triggers thermal throttling, while the integrated GPU sits idle. Add nGpuLayers: Int = 999 as the final init parameter on LlamaModelRuntime and pass it through as modelParams.n_gpu_layers. 999 is the llama.cpp idiom for all layers; the library clamps to the model real depth. Tooling/tests that need deterministic CPU-only behaviour can pass nGpuLayers: 0 explicitly. ProfileGenerator and ACPFBuildCommand keep compiling because the new parameter has a default. See docs/05-decisions.md ADR-074.
Every kAXValueChangedNotification fires on every keystroke, and the FocusedFieldReader.textElement(for:) BFS re-walks the AX subtree (up to maxNodes = 2500 for web containers) to re-locate the same text descendant it had resolved a moment earlier. A sample(1) profile on a fanless M4 attributed ~10% of main-thread time to this single path. Cache the result of textElement(for:) per focused-root identity in a private FocusedFieldResolutionCache. On hit, skip the BFS entirely. On a different root identity, run the existing BFS once and store the outcome. Negative results (no text descendant) are cached separately so non-text focused controls do not re-walk every value tick. The public API of FocusedFieldReader is unchanged. See docs/05-decisions.md ADR-075.
…ghter cadence When the on-screen-text context feature is enabled, the OCR pass was the dominant remaining CPU draw: VNRecognizeTextRequest at .accurate with usesLanguageCorrection = true over a 1600px screenshot, fired every 4s plus on every focus change. On a fanless M4 this spikes the CPU graph noticeably (~30-70%). Four co-introduced changes: 1. recognitionLevel = .fast (was .accurate) - routes through the Neural Engine; same tier Apple uses for Live Text ambient capture. 2. usesLanguageCorrection = false - the priciest post-process in .accurate; adds cost without proportionate gains in .fast. 3. maxCaptureDimension: 1200 (was 1600) - .fast does not gain proportionally from extra resolution; smaller image cuts both screenshot encode and per-pixel Vision work. 4. ScreenContextController.refreshInterval: 12.0s (was 4.0s) - focus changes still trigger an immediate capture; the slow timer only tracks slow on-screen changes which move on the order of seconds. minimumConfidence default is bumped 0.40 -> 0.45 to compensate for .fasts noisier confidence distribution. The downstream corruption filters (droppingCorruptedLines, isPlausibleText, containsDigitSubstitutedWord) are unchanged and still reject .fast mojibake. See docs/05-decisions.md ADR-076.
… collapse Three related improvements to the completion path, found via the existing telemetry.json (236 generated predictions, 77 suppressed). ADR-077: rank-fallback in CompletionController.present(...). The controller consumed only candidates.first; if it failed a CandidateFilter rule, the whole prediction was suppressed even when ranks 2..N passed every gate. Telemetry showed 41 of 77 suppressions were insertionUnsafe with prose alternatives sitting at rank 2. Walk the ranked list and pick the first candidate that survives the filter; suppress with the top-candidates reason only if every candidate fails. The filter rules themselves are unchanged. Co-introduced: DecodingConfiguration.branchWidth default 4 -> 3. ADR-012s own testBranchWidthSweep reports warm means of 239/164/107/75 ms at widths 8/6/4/3; one fewer branch trims ~25% off generation latency, and the rank-fallback recovers the runner-ups the narrower beam still emits. ADR-078: 1.2s generation deadline. Telemetry showed tail outliers up to ~6s. A sibling Task to the generation task sleeps for 1_200_000_000 ns and cancels via the existing try Task.checkCancellation path; the existing catch is CancellationError arm drops the result silently. 1.2s is just above the empirical p95 so the body of the distribution is untouched. ADR-079: collapseInternalDoubleSpaces. The base model occasionally emits internal double spaces inside a candidate (hello world). CaretBoundary.reconcile only strips redundant leading whitespace, not interior runs. Add a linear-time normalization pass right before the inserter plans the paste/type; a single leading ASCII space (the next-word separator under ADR-050) is preserved; tabs/NBSP/ideographic space are untouched. See docs/05-decisions.md ADR-077, ADR-078, ADR-079.
Append-only entries documenting context, decision and consequences for the six commits in this PR series. Format mirrors the existing ADRs (071-073 added in this same release window).
Owner
|
Thanks so much for your contribution! That said, I looked into the changes, and there are a few issues.
Could you create 2 separate PRs, one for the AX-walk cache change and the other for the rank fallback change? Thanks 🙏 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Five small, scoped changes against
main(v1.3.0). Each commit is one logical unit so they can be cherry-picked if any are unwanted. Telemetry-driven — profiles +predictions.log+telemetry.jsonmotivated each one.Commits
cd1bd70feat(model-runtime): offload all transformer layers to the GPU by default — ADR-07418c7caaperf(context-capture): memoize focused-root → text-element AX walk — ADR-075318aef3perf(context-capture): switch screen-text OCR to.fastVision with lighter cadence — ADR-07623c20adfeat/fix(completion): rank-fallback, generation deadline, multi-space collapse — ADR-077 / ADR-078 / ADR-07917a95a9docs: ADRs 074-079Measured impact on a fanless M4 Air
Diff shape
Net diff vs
origin/main: 7 files changed, +185 / −27 lines of code + 247 lines of new ADR docs. Zero public API changes. Every patch sits behind an existing protocol surface — no new packages, no new dependencies, no new build settings.The headline finding
LlamaModelRuntimeconstructedllama_model_default_params()and never setn_gpu_layers. The llama.cpp default is0— CPU-only inference, even on the Metal-linked xcframework build. On a fanless M4 this pegs the CPU for the duration of every completion and triggers thermal throttling while the integrated GPU sits idle. Settingn_gpu_layers = 999(the llama.cpp "all layers" idiom) moves inference to Metal; verified via stderr:ggml_metal_device_init: GPU name: MTL0 (Apple M4), all 24 transformer layers ondev = MTL0, CPU compute buffer drops from full inference cost to 16 MiB.Other findings
sample(1)profile attributed ~10% of main-thread time toFocusedFieldReader.textElement(for:)re-walking the AX subtree on everykAXValueChangedNotification(every keystroke). Memoizing per focused-root identity drops that to ~0.01%..accurateVision + language correction over a 1600 px capture every 4 s is the dominant cost when on-screen-text is enabled..fast+ no language correction + 1200 px + 12 s cadence is ~10× cheaper end-to-end while the existing corruption filters (droppingCorruptedLines,isPlausibleText,containsDigitSubstitutedWord) keep handling.fastmojibake.CompletionController.presentconsumed onlycandidates.first; if that single hypothesis failed anyCandidateFilterrule the whole prediction was suppressed even when rank 2/3/… passed every gate. Telemetry showed 41 of 77 suppressions wereinsertionUnsafewith clean prose at rank 2. Walking the ranked list recovers most of those empty moments.latenciesMillistail outliers up to ~6 s — those land as stale ghost text against an already-moved caret. A sibling task cancels the generation after 1.2 s (just above the empirical p95) via the existingtry Task.checkCancellation()path; the existingcatch is CancellationErrorarm drops the result silently. Same outcome string as a superseded-by-keystroke cancel, so telemetry shape is unchanged.CaretBoundary.reconcile(ADR-017) andNextWordSplitter(ADRs 016/050) cover the candidate-vs-caret join but not interior whitespace. The base model occasionally emits"hello world"-style double spaces inside a candidate; a linear-time collapse pass right before the inserter plans the paste/type normalizes runs of 2+ ASCII spaces to one. Single leading space (the next-word separator) is preserved by construction; tabs / NBSP / ideographic space untouched.ADRs
Each ADR (074–079) under
docs/05-decisions.mdfollows the existing append-only format (Context / Decision / Consequences) and cross-references the prior ADR numbers it amends plus the telemetry that motivated it. They're written so they can be read independently if you take only a subset of these commits.Build verification
xcodebuild -workspace KeyType.xcworkspace -scheme KeyType -configuration Release -destination 'platform=macOS' CODE_SIGN_IDENTITY='-' CODE_SIGNING_REQUIRED=NO buildsucceeds on Xcode 26.5 / macOS 26.5 against the b9402 llama.cpp xcframework. The merged build has been in production use on my own M4 Air for a few hours of continuous typing — both as a coding assistant and as a general writing aid.Happy to split this into 4 smaller PRs (model-runtime / AX cache / OCR / completion bundle) if that's easier to review — each commit is self-contained and the docs commit can be cherry-picked onto whichever lands last.