fix(telemetry): measure_swarms guard — swallow call timeouts and shape drift, not just :noproc#72
Conversation
…the guard
The periodic measurement caught only {:noproc, _} exits. Under agent load
the SwarmManager GenServer.call can exit with :timeout (seen live: 'Error
when calling MFA defined by measurement: Genswarms.Telemetry
:measure_swarms []'), and a swarm entry missing :agent_count would raise —
both escaped and logged poller errors every tick. rescue-all + :exit-wide
catch, agent_count via Map.get. Tests pin the no-raise contract and the
guard shape.
📝 WalkthroughWalkthroughThe ChangesTelemetry resilience
Estimated code review effort: 1 (Trivial) | ~5 minutes Sequence Diagram(s)Not applicable. Suggested labels: telemetry, tests Suggested reviewers: genlayerlabs Poem 🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
test/genswarms/telemetry/telemetry_test.exs (1)
14-27: 📐 Maintainability & Code Quality | 🔵 Trivial | 🏗️ Heavy liftGuard-shape assertion tests source text, not runtime behavior.
This test reads
telemetry.exas a string and regex-matches forcatch/:exit,\s*_/rescue. It doesn't exercisemeasure_swarms/0against an actual timeout or raise, so it can't confirm the guard actually suppresses those failures at runtime — only that certain tokens appear near each other in source. A harmless rewrite (e.g.catch :exit, _reason ->reformatted across lines, or swapping tois_struct/2guards) could pass or fail this check independent of real behavior.Consider instead exercising real failure paths, e.g. stub
Genswarms.SwarmManager.list/0(viaMox/behaviour indirection) to raise, and a fake GenServer that never replies to force aGenServer.calltimeout, then assertmeasure_swarms/0still returns:ok.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@test/genswarms/telemetry/telemetry_test.exs` around lines 14 - 27, The current test in telemetry_test.exs validates `measure_swarms/0` by scanning the `Genswarms.Telemetry.measure_swarms` source text for `catch`, `:exit`, and `rescue`, which only proves tokens exist and not that failures are swallowed at runtime. Replace this with runtime coverage by exercising `measure_swarms/0` through a stubbed `Genswarms.SwarmManager.list/0` (using behaviour indirection/Mox or similar) that raises, and a timeout path that forces a `GenServer.call` exit; then assert `measure_swarms/0` still completes successfully.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@lib/genswarms/telemetry/telemetry.ex`:
- Around line 116-130: The telemetry collection in
Genswarms.Telemetry.handle_event/3 is swallowing failures for the entire batch,
so isolate each swarm inside the Enum.each loop and let one bad entry not stop
the rest from emitting [:genswarms, :swarm, :agent_count]. Update the unguarded
swarm.name access to be defensive like the existing Map.get/3 usage for
agent_count, and add logging in the rescue/catch paths so failures from
Genswarms.SwarmManager.list() or malformed swarm data are visible instead of
silently returning :ok.
---
Nitpick comments:
In `@test/genswarms/telemetry/telemetry_test.exs`:
- Around line 14-27: The current test in telemetry_test.exs validates
`measure_swarms/0` by scanning the `Genswarms.Telemetry.measure_swarms` source
text for `catch`, `:exit`, and `rescue`, which only proves tokens exist and not
that failures are swallowed at runtime. Replace this with runtime coverage by
exercising `measure_swarms/0` through a stubbed `Genswarms.SwarmManager.list/0`
(using behaviour indirection/Mox or similar) that raises, and a timeout path
that forces a `GenServer.call` exit; then assert `measure_swarms/0` still
completes successfully.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 9226b2e8-b334-40b9-9ba1-0c698c7845ae
📒 Files selected for processing (2)
lib/genswarms/telemetry/telemetry.extest/genswarms/telemetry/telemetry_test.exs
| try do | ||
| swarms = Genswarms.SwarmManager.list() | ||
|
|
||
| Enum.each(swarms, fn swarm -> | ||
| :telemetry.execute( | ||
| [:genswarms, :swarm, :agent_count], | ||
| %{agent_count: swarm.agent_count}, | ||
| %{agent_count: Map.get(swarm, :agent_count, 0)}, | ||
| %{swarm: swarm.name} | ||
| ) | ||
| end) | ||
| rescue | ||
| _ -> :ok | ||
| catch | ||
| :exit, {:noproc, _} -> :ok | ||
| :exit, _ -> :ok | ||
| end |
There was a problem hiding this comment.
🩺 Stability & Availability | 🟠 Major | ⚡ Quick win
Single bad swarm entry silently drops telemetry for every swarm that tick, with zero logging.
The try wraps the whole Enum.each, so a raise inside processing one malformed swarm (e.g. swarm.name missing/shape-drifted) aborts the loop and skips telemetry for all remaining swarms that cycle — not just the offending one. Additionally, rescue _ -> :ok / catch :exit, _ -> :ok swallow failures with no logging, so if SwarmManager starts timing out or crashing repeatedly, there is no signal to notice it; you only stop seeing the "Error when calling MFA" log, replaced by silent, permanent telemetry gaps.
Also note swarm.name is accessed unguarded while agent_count is defensively read with Map.get/3 — inconsistent given the stated shape-drift concern.
🛠️ Proposed fix: per-entry isolation + logging
+ require Logger
+
`@doc` false
def measure_swarms do
try do
swarms = Genswarms.SwarmManager.list()
Enum.each(swarms, fn swarm ->
- :telemetry.execute(
- [:genswarms, :swarm, :agent_count],
- %{agent_count: Map.get(swarm, :agent_count, 0)},
- %{swarm: swarm.name}
- )
+ try do
+ :telemetry.execute(
+ [:genswarms, :swarm, :agent_count],
+ %{agent_count: Map.get(swarm, :agent_count, 0)},
+ %{swarm: Map.get(swarm, :name, :unknown)}
+ )
+ rescue
+ e -> Logger.warning("measure_swarms: skipping malformed swarm entry: #{Exception.message(e)}")
+ end
end)
+
+ :ok
rescue
- _ -> :ok
+ e ->
+ Logger.warning("measure_swarms: failed to list swarms: #{Exception.message(e)}")
+ :ok
catch
- :exit, _ -> :ok
+ :exit, reason ->
+ Logger.warning("measure_swarms: SwarmManager call failed: #{inspect(reason)}")
+ :ok
end
end📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| try do | |
| swarms = Genswarms.SwarmManager.list() | |
| Enum.each(swarms, fn swarm -> | |
| :telemetry.execute( | |
| [:genswarms, :swarm, :agent_count], | |
| %{agent_count: swarm.agent_count}, | |
| %{agent_count: Map.get(swarm, :agent_count, 0)}, | |
| %{swarm: swarm.name} | |
| ) | |
| end) | |
| rescue | |
| _ -> :ok | |
| catch | |
| :exit, {:noproc, _} -> :ok | |
| :exit, _ -> :ok | |
| end | |
| require Logger | |
| `@doc` false | |
| def measure_swarms do | |
| try do | |
| swarms = Genswarms.SwarmManager.list() | |
| Enum.each(swarms, fn swarm -> | |
| try do | |
| :telemetry.execute( | |
| [:genswarms, :swarm, :agent_count], | |
| %{agent_count: Map.get(swarm, :agent_count, 0)}, | |
| %{swarm: Map.get(swarm, :name, :unknown)} | |
| ) | |
| rescue | |
| e -> | |
| Logger.warning( | |
| "measure_swarms: skipping malformed swarm entry: #{Exception.message(e)}" | |
| ) | |
| end | |
| end) | |
| :ok | |
| rescue | |
| e -> | |
| Logger.warning("measure_swarms: failed to list swarms: #{Exception.message(e)}") | |
| :ok | |
| catch | |
| :exit, reason -> | |
| Logger.warning("measure_swarms: SwarmManager call failed: #{inspect(reason)}") | |
| :ok | |
| end | |
| end |
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@lib/genswarms/telemetry/telemetry.ex` around lines 116 - 130, The telemetry
collection in Genswarms.Telemetry.handle_event/3 is swallowing failures for the
entire batch, so isolate each swarm inside the Enum.each loop and let one bad
entry not stop the rest from emitting [:genswarms, :swarm, :agent_count]. Update
the unguarded swarm.name access to be defensive like the existing Map.get/3
usage for agent_count, and add logging in the rescue/catch paths so failures
from Genswarms.SwarmManager.list() or malformed swarm data are visible instead
of silently returning :ok.
The telemetry_poller periodic measurement
measure_swarms/0only caught:exit, {:noproc, _}. Two other failure shapes escaped and loggedError when calling MFA defined by measurement: Genswarms.Telemetry :measure_swarms []on the poll tick:swarm.agent_counton a map without the key)A measurement must never blow up the poller: this widens the guard to a shape-agnostic
:exitcatch +rescue, readsagent_countviaMap.get/3, and adds tests pinning the no-raise contract and the guard shape.Full suite: 612 tests, 1 failure —
RootlessLauncherScriptTest(RLIMIT_AS), which fails identically on main on macOS (ulimit -vunsupported); unrelated to this diff.Summary by CodeRabbit
0instead of causing telemetry emission to fail.