Skip to content

fix(telemetry): stop finding() telemetry from blocking the asyncio event loop#677

Open
SEPURI-SAI-KRISHNA wants to merge 2 commits into
usestrix:mainfrom
SEPURI-SAI-KRISHNA:fix/telemetry-nonblocking-672
Open

fix(telemetry): stop finding() telemetry from blocking the asyncio event loop#677
SEPURI-SAI-KRISHNA wants to merge 2 commits into
usestrix:mainfrom
SEPURI-SAI-KRISHNA:fix/telemetry-nonblocking-672

Conversation

@SEPURI-SAI-KRISHNA

Copy link
Copy Markdown

Problem

Telemetry finding events no longer run their blocking network call on the asyncio event loop. Delivery is dispatched to a single bounded background daemon worker; the emit call now
returns immediately.

Fixes #672.

Why

ReportState.add_vulnerability_report() runs synchronously inside the async reporting tool (strix/tools/reporting/tool.py, right after await check_duplicate). It calls
posthog.finding() then scarf.finding(), and each does a blocking urllib.request.urlopen(req, timeout=10).

That stalls the asyncio event loop — and therefore every other concurrent agent's tool call — for up to ~20s (10s + 10s) every time a finding is reported and the telemetry endpoint
is slow or unreachable. Since Strix's core model is multi-agent parallelism, this silently freezes the whole scan.

How

  • strix/telemetry/_common.py: add a fire-and-forget dispatcher — dispatch() (non-blocking enqueue that drops on a full queue, so a hung endpoint can neither block nor back-pressure
    the caller), a single daemon worker loop, and a bounded flush() helper for tests / clean shutdown.
  • strix/telemetry/scarf.py / strix/telemetry/posthog.py: _send() builds the request cheaply on the caller thread and dispatch()es the actual urlopen to the worker. Behavior is
    unchanged when telemetry is disabled (the _is_enabled() short-circuit still runs first).

No public API changes: finding() / start() / end() / error() keep the same signatures and remain fire-and-forget.

Tests

New tests/test_telemetry.py:

  • finding() returns in <0.2s even when the underlying urlopen sleeps 1s, and the event is still delivered off-thread (both scarf + posthog).
  • disabled telemetry sends nothing.
  • a full queue drops the event instead of blocking or raising.

Checks

  • uv run pytest — 88 passed (+5 new)
  • uv run ruff check — clean on all changed files
  • uv run mypy — clean on all changed files

@greptile-apps

greptile-apps Bot commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR moves telemetry delivery off the asyncio caller path. The main changes are:

  • A bounded background queue and daemon worker for telemetry sends.
  • PostHog and Scarf delivery moved into queued closures.
  • Tests for non-blocking findings, disabled telemetry, and full-queue drops.

Confidence Score: 4/5

The changed telemetry flow needs fixes for shutdown delivery and caller-thread construction errors.

  • Terminal telemetry can be queued and then dropped when the daemon worker is killed at process exit.
  • Some payload construction failures now escape before the best-effort send guard runs.
  • The non-blocking network delivery behavior is otherwise covered by the new tests.

strix/telemetry/_common.py, strix/telemetry/posthog.py, strix/telemetry/scarf.py

Important Files Changed

Filename Overview
strix/telemetry/_common.py Adds the shared telemetry queue, daemon worker, dispatch helper, and bounded flush helper.
strix/telemetry/posthog.py Keeps the PostHog public API unchanged while moving the blocking network call into background delivery.
strix/telemetry/scarf.py Keeps the Scarf public API unchanged while moving the blocking network call into background delivery.
tests/test_telemetry.py Adds regression coverage for fast finding() calls and best-effort queue behavior.
Prompt To Fix All With AI
Fix the following 3 code review issues. Work through them one at a time, proposing concise fixes.

---

### Issue 1 of 3
strix/telemetry/_common.py:100
**Daemon Worker Drops Shutdown Events**

When the CLI shutdown path calls `posthog.end(...)` and `scarf.end(...)`, those calls now only enqueue work on this daemon thread. The process can exit before the worker dequeues the terminal events, so end/error telemetry that was previously sent synchronously can be silently lost.

### Issue 2 of 3
strix/telemetry/posthog.py:40
**Serialization Escapes Telemetry Guard**

`json.dumps(payload)` now runs before the guarded delivery closure, while the old implementation caught exceptions from payload construction and network delivery together. If a future caller adds a non-JSON-safe property to `start`, `end`, `error`, or `finding`, telemetry can raise synchronously into the scan path instead of being swallowed as best-effort telemetry.

### Issue 3 of 3
strix/telemetry/scarf.py:39-41
**URL Encoding Escapes Telemetry Guard**

The Scarf URL is now built on the caller thread before the guarded delivery closure. If a property value's string conversion or URL encoding raises, that exception propagates through the public telemetry call and can interrupt reporting, whereas the previous synchronous implementation swallowed construction failures with the send failure path.

Reviews (1): Last reviewed commit: "fix(telemetry): dispatch finding events ..." | Re-trigger Greptile

with _worker_lock:
if _worker_thread is None or not _worker_thread.is_alive():
_worker_thread = threading.Thread(
target=_worker_loop, name="strix-telemetry", daemon=True

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Daemon Worker Drops Shutdown Events

When the CLI shutdown path calls posthog.end(...) and scarf.end(...), those calls now only enqueue work on this daemon thread. The process can exit before the worker dequeues the terminal events, so end/error telemetry that was previously sent synchronously can be silently lost.

Prompt To Fix With AI
This is a comment left during a code review.
Path: strix/telemetry/_common.py
Line: 100

Comment:
**Daemon Worker Drops Shutdown Events**

When the CLI shutdown path calls `posthog.end(...)` and `scarf.end(...)`, those calls now only enqueue work on this daemon thread. The process can exit before the worker dequeues the terminal events, so end/error telemetry that was previously sent synchronously can be silently lost.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a bounded atexit flush (3s) that drains the queue at interpreter shutdown. end/error now deliver in the normal case; the bound means a hung endpoint still can't stall process exit (strictly better than the old per-call 10s synchronous block). Verified at real shutdown the process waits for an in-flight send to finish before exiting.

Comment thread strix/telemetry/posthog.py Outdated
"distinct_id": SESSION_ID,
"properties": properties,
}
data = json.dumps(payload).encode()

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Serialization Escapes Telemetry Guard

json.dumps(payload) now runs before the guarded delivery closure, while the old implementation caught exceptions from payload construction and network delivery together. If a future caller adds a non-JSON-safe property to start, end, error, or finding, telemetry can raise synchronously into the scan path instead of being swallowed as best-effort telemetry.

Prompt To Fix With AI
This is a comment left during a code review.
Path: strix/telemetry/posthog.py
Line: 40

Comment:
**Serialization Escapes Telemetry Guard**

`json.dumps(payload)` now runs before the guarded delivery closure, while the old implementation caught exceptions from payload construction and network delivery together. If a future caller adds a non-JSON-safe property to `start`, `end`, `error`, or `finding`, telemetry can raise synchronously into the scan path instead of being swallowed as best-effort telemetry.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved json.dumps(...) back inside the guarded _deliver closure, so a non-JSON-safe property is swallowed and logged on the worker thread instead of raising on the caller.

Comment thread strix/telemetry/scarf.py Outdated
Comment on lines +39 to +41
query = urllib.parse.urlencode(
{k: ("" if v is None else str(v)) for k, v in props.items()},
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 URL Encoding Escapes Telemetry Guard

The Scarf URL is now built on the caller thread before the guarded delivery closure. If a property value's string conversion or URL encoding raises, that exception propagates through the public telemetry call and can interrupt reporting, whereas the previous synchronous implementation swallowed construction failures with the send failure path.

Prompt To Fix With AI
This is a comment left during a code review.
Path: strix/telemetry/scarf.py
Line: 39-41

Comment:
**URL Encoding Escapes Telemetry Guard**

The Scarf URL is now built on the caller thread before the guarded delivery closure. If a property value's string conversion or URL encoding raises, that exception propagates through the public telemetry call and can interrupt reporting, whereas the previous synchronous implementation swallowed construction failures with the send failure path.

How can I resolve this? If you propose a fix, please make it concise.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same for Scarf quote/urlencode/str() now run inside the guarded closure. The caller thread only does the _is_enabled() check and a non-blocking enqueue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] scarf.finding() + posthog.finding() block asyncio event loop up to 20s per finding

1 participant