fix(telemetry): stop finding() telemetry from blocking the asyncio event loop#677
fix(telemetry): stop finding() telemetry from blocking the asyncio event loop#677SEPURI-SAI-KRISHNA wants to merge 2 commits into
Conversation
Greptile SummaryThis PR moves telemetry delivery off the asyncio caller path. The main changes are:
Confidence Score: 4/5The changed telemetry flow needs fixes for shutdown delivery and caller-thread construction errors.
strix/telemetry/_common.py, strix/telemetry/posthog.py, strix/telemetry/scarf.py Important Files Changed
Prompt To Fix All With AIFix the following 3 code review issues. Work through them one at a time, proposing concise fixes.
---
### Issue 1 of 3
strix/telemetry/_common.py:100
**Daemon Worker Drops Shutdown Events**
When the CLI shutdown path calls `posthog.end(...)` and `scarf.end(...)`, those calls now only enqueue work on this daemon thread. The process can exit before the worker dequeues the terminal events, so end/error telemetry that was previously sent synchronously can be silently lost.
### Issue 2 of 3
strix/telemetry/posthog.py:40
**Serialization Escapes Telemetry Guard**
`json.dumps(payload)` now runs before the guarded delivery closure, while the old implementation caught exceptions from payload construction and network delivery together. If a future caller adds a non-JSON-safe property to `start`, `end`, `error`, or `finding`, telemetry can raise synchronously into the scan path instead of being swallowed as best-effort telemetry.
### Issue 3 of 3
strix/telemetry/scarf.py:39-41
**URL Encoding Escapes Telemetry Guard**
The Scarf URL is now built on the caller thread before the guarded delivery closure. If a property value's string conversion or URL encoding raises, that exception propagates through the public telemetry call and can interrupt reporting, whereas the previous synchronous implementation swallowed construction failures with the send failure path.
Reviews (1): Last reviewed commit: "fix(telemetry): dispatch finding events ..." | Re-trigger Greptile |
| with _worker_lock: | ||
| if _worker_thread is None or not _worker_thread.is_alive(): | ||
| _worker_thread = threading.Thread( | ||
| target=_worker_loop, name="strix-telemetry", daemon=True |
There was a problem hiding this comment.
Daemon Worker Drops Shutdown Events
When the CLI shutdown path calls posthog.end(...) and scarf.end(...), those calls now only enqueue work on this daemon thread. The process can exit before the worker dequeues the terminal events, so end/error telemetry that was previously sent synchronously can be silently lost.
Prompt To Fix With AI
This is a comment left during a code review.
Path: strix/telemetry/_common.py
Line: 100
Comment:
**Daemon Worker Drops Shutdown Events**
When the CLI shutdown path calls `posthog.end(...)` and `scarf.end(...)`, those calls now only enqueue work on this daemon thread. The process can exit before the worker dequeues the terminal events, so end/error telemetry that was previously sent synchronously can be silently lost.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Added a bounded atexit flush (3s) that drains the queue at interpreter shutdown. end/error now deliver in the normal case; the bound means a hung endpoint still can't stall process exit (strictly better than the old per-call 10s synchronous block). Verified at real shutdown the process waits for an in-flight send to finish before exiting.
| "distinct_id": SESSION_ID, | ||
| "properties": properties, | ||
| } | ||
| data = json.dumps(payload).encode() |
There was a problem hiding this comment.
Serialization Escapes Telemetry Guard
json.dumps(payload) now runs before the guarded delivery closure, while the old implementation caught exceptions from payload construction and network delivery together. If a future caller adds a non-JSON-safe property to start, end, error, or finding, telemetry can raise synchronously into the scan path instead of being swallowed as best-effort telemetry.
Prompt To Fix With AI
This is a comment left during a code review.
Path: strix/telemetry/posthog.py
Line: 40
Comment:
**Serialization Escapes Telemetry Guard**
`json.dumps(payload)` now runs before the guarded delivery closure, while the old implementation caught exceptions from payload construction and network delivery together. If a future caller adds a non-JSON-safe property to `start`, `end`, `error`, or `finding`, telemetry can raise synchronously into the scan path instead of being swallowed as best-effort telemetry.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Moved json.dumps(...) back inside the guarded _deliver closure, so a non-JSON-safe property is swallowed and logged on the worker thread instead of raising on the caller.
| query = urllib.parse.urlencode( | ||
| {k: ("" if v is None else str(v)) for k, v in props.items()}, | ||
| ) |
There was a problem hiding this comment.
URL Encoding Escapes Telemetry Guard
The Scarf URL is now built on the caller thread before the guarded delivery closure. If a property value's string conversion or URL encoding raises, that exception propagates through the public telemetry call and can interrupt reporting, whereas the previous synchronous implementation swallowed construction failures with the send failure path.
Prompt To Fix With AI
This is a comment left during a code review.
Path: strix/telemetry/scarf.py
Line: 39-41
Comment:
**URL Encoding Escapes Telemetry Guard**
The Scarf URL is now built on the caller thread before the guarded delivery closure. If a property value's string conversion or URL encoding raises, that exception propagates through the public telemetry call and can interrupt reporting, whereas the previous synchronous implementation swallowed construction failures with the send failure path.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Same for Scarf quote/urlencode/str() now run inside the guarded closure. The caller thread only does the _is_enabled() check and a non-blocking enqueue.
…ide the delivery guard
Problem
Telemetry finding events no longer run their blocking network call on the asyncio event loop. Delivery is dispatched to a single bounded background daemon worker; the emit call now
returns immediately.
Fixes #672.
Why
ReportState.add_vulnerability_report() runs synchronously inside the async reporting tool (strix/tools/reporting/tool.py, right after await check_duplicate). It calls
posthog.finding() then scarf.finding(), and each does a blocking urllib.request.urlopen(req, timeout=10).
That stalls the asyncio event loop — and therefore every other concurrent agent's tool call — for up to ~20s (10s + 10s) every time a finding is reported and the telemetry endpoint
is slow or unreachable. Since Strix's core model is multi-agent parallelism, this silently freezes the whole scan.
How
the caller), a single daemon worker loop, and a bounded flush() helper for tests / clean shutdown.
unchanged when telemetry is disabled (the _is_enabled() short-circuit still runs first).
No public API changes: finding() / start() / end() / error() keep the same signatures and remain fire-and-forget.
Tests
New tests/test_telemetry.py:
Checks