Skip to content

fix(hook): break the macOS run loop on a stop flag, not CFRunLoopStop alone#263

Open
AprilNEA wants to merge 1 commit into
masterfrom
fix/macos-hook-stop-race
Open

fix(hook): break the macOS run loop on a stop flag, not CFRunLoopStop alone#263
AprilNEA wants to merge 1 commit into
masterfrom
fix/macos-hook-stop-race

Conversation

@AprilNEA

Copy link
Copy Markdown
Owner

Problem

macos::stop() relies solely on CFRunLoopStop to terminate the hook thread. But CFRunLoopStop is a no-op unless the run loop is currently running: it only sets the stop flag when rl->_currentMode is non-null. The hook thread services the tap in 500 ms run_in_mode slices and, between slices, runs has_accessibility() (a TCC IPC) and tap.enable().

If stop() fires in that between-slices gap, the CF stop is dropped, nothing records the request, the thread re-enters a fresh 500 ms slice, and stop()'s thread.join() blocks forever — with the CGEventTap still live at the HID location. The window is small (a few ms out of every 500 ms) but the failure mode is a permanent hang on hook teardown.

Fix

Add an AtomicBool stop flag, checked at the top of every run-loop slice — mirroring the Linux backend's existing stop-flag pattern. stop() now sets the flag before calling run_loop.stop(), so shutdown always terminates:

  • common case (thread inside run_in_mode): run_loop.stop() wakes it immediately → returns Stopped → breaks, ~0 ms;
  • race case (stop lands between slices, CF stop dropped): the thread observes the flag at the next slice top → breaks within one 500 ms slice.

Either way the loop exits and falls through to the existing disable_tap(), so the tap is always detached on teardown. Relaxed ordering suffices: the flag carries no other data, and thread.join() provides the final synchronisation.

Testing

  • cargo clippy -p openlogi-hook --all-targets -- -D warnings — clean.
  • Full-workspace clippy via the pre-push hook — clean.

Not unit-tested: the path is FFI/timing-bound (a real repro needs Accessibility + a live CGEventTap), and the crate has no macOS unit tests today. A flag-only assertion would be tautological. The change is small and verifiable by inspection; happy to extract the slice-loop's termination decision behind a seam if a regression test is wanted.

… alone

CFRunLoopStop is a no-op when it lands in the gap between the hook
thread's 500ms run_in_mode slices: the shutdown signal is dropped, the
loop re-enters a fresh slice, and stop()'s thread.join() blocks forever
with the event tap still live at the HID location.

Add an AtomicBool checked at the top of every slice. stop() now sets the
flag before calling run_loop.stop(), so shutdown always terminates:
promptly via the CF stop while a slice is running, otherwise within one
500ms slice via the flag. Mirrors the Linux backend's stop-flag pattern.
@greptile-apps

greptile-apps Bot commented Jun 15, 2026

Copy link
Copy Markdown

Greptile Summary

This PR fixes a genuine hang-on-teardown race in the macOS hook backend where CFRunLoopStop is silently dropped when it fires in the gap between 500 ms run_in_mode slices, causing stop()'s thread.join() to block indefinitely with the CGEventTap still live at the HID location.

  • Adds an Arc<AtomicBool> stop flag to HookInner, cloned into the hook thread, and checked at the top of every run-loop slice — mirroring the existing Linux backend's pattern exactly.
  • stop() now stores true to the flag before calling run_loop.stop(), ensuring the thread always observes the shutdown request within at most one 500 ms slice even when the CF stop is a no-op.

Confidence Score: 4/5

Safe to merge — the fix is a small, self-contained addition that eliminates a real permanent-hang path and always falls through to the existing disable_tap() cleanup.

The AtomicBool pattern is correct and already proven in the Linux backend. The only gap is that the flag is not rechecked in the TimedOut/HandledSource arm, leaving a cosmetic enabled→disabled tap toggle and a small IPC overhead in the race case.

Only crates/openlogi-hook/src/macos.rs changed; the race-case flag-recheck gap in the TimedOut/HandledSource arm is worth a follow-up.

Important Files Changed

Filename Overview
crates/openlogi-hook/src/macos.rs Introduces an AtomicBool stop flag checked at each loop iteration top, mirroring the Linux backend pattern, to guarantee the hook thread terminates even when CFRunLoopStop fires between run-loop slices. The fix correctly sets the flag before calling run_loop.stop() in stop(), and the Relaxed ordering is sound. One minor gap: the flag is not rechecked in the TimedOut/HandledSource arm, so the race case still executes has_accessibility() + tap.enable() before the flag is observed.

Sequence Diagram

sequenceDiagram
    participant C as Caller (stop())
    participant T as Hook Thread

    Note over C,T: Common case — stop fires during run_in_mode slice
    C->>C: stop.store(true, Relaxed)
    C->>T: run_loop.stop()
    T-->>T: run_in_mode returns Stopped
    T-->>T: break → disable_tap()
    C->>T: thread.join()
    T-->>C: joined (~0 ms)

    Note over C,T: Race case — stop fires between slices
    T-->>T: flag check passes (false)
    T-->>T: enter run_in_mode (500 ms slice)
    C->>C: stop.store(true, Relaxed)
    C->>T: run_loop.stop() [no-op, loop not running]
    T-->>T: run_in_mode returns TimedOut
    T-->>T: has_accessibility() + tap.enable()
    T-->>T: loop top: flag check → true → break → disable_tap()
    C->>T: thread.join()
    T-->>C: joined (≤500 ms + IPC overhead)
Loading

Comments Outside Diff (1)

  1. crates/openlogi-hook/src/macos.rs, line 330-354 (link)

    P2 Stop flag not rechecked before re-enabling the tap

    In the race case where stop() fires in the gap after the flag check passes and before run_in_mode is entered, the thread spends up to 500 ms in the slice, returns TimedOut, then calls has_accessibility() (a TCC IPC round-trip, ~5–20 ms) and tap.enable() before the flag is finally observed at the next iteration top. The tap is then immediately disabled by disable_tap(), so this is not a correctness bug — but worst-case shutdown tail latency is 500 ms + IPC overhead rather than the stated "one 500 ms slice", and the tap is briefly toggled enabled→disabled unnecessarily. Adding a flag check in the TimedOut | HandledSource arm would short-circuit those two calls.

    Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

    Fix in Codex Fix in Claude Code

Fix All in Codex Fix All in Claude Code

Reviews (1): Last reviewed commit: "fix(hook): break the macOS run loop on a..." | Re-trigger Greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant