Skip to content

fix(telemetry): measure_swarms guard — swallow call timeouts and shape drift, not just :noproc#72

Merged
acastellana merged 1 commit into
mainfrom
fix/telemetry-measure-guard
Jul 5, 2026
Merged

fix(telemetry): measure_swarms guard — swallow call timeouts and shape drift, not just :noproc#72
acastellana merged 1 commit into
mainfrom
fix/telemetry-measure-guard

Conversation

@acastellana

@acastellana acastellana commented Jul 4, 2026

Copy link
Copy Markdown
Contributor

The telemetry_poller periodic measurement measure_swarms/0 only caught :exit, {:noproc, _}. Two other failure shapes escaped and logged Error when calling MFA defined by measurement: Genswarms.Telemetry :measure_swarms [] on the poll tick:

  • GenServer.call timeout while SwarmManager is busy — observed live on a wingston host mid-agent-turn (2026-07-04 16:24:50)
  • any raise from swarm-entry shape drift (swarm.agent_count on a map without the key)

A measurement must never blow up the poller: this widens the guard to a shape-agnostic :exit catch + rescue, reads agent_count via Map.get/3, and adds tests pinning the no-raise contract and the guard shape.

Full suite: 612 tests, 1 failure — RootlessLauncherScriptTest (RLIMIT_AS), which fails identically on main on macOS (ulimit -v unsupported); unrelated to this diff.

Summary by CodeRabbit

  • Bug Fixes
    • Improved telemetry collection so swarm metrics continue reporting even if one swarm entry is malformed or data is missing.
    • Missing agent counts now default to 0 instead of causing telemetry emission to fail.
    • Telemetry polling is now more resilient to unexpected runtime errors and process exits.

…the guard

The periodic measurement caught only {:noproc, _} exits. Under agent load
the SwarmManager GenServer.call can exit with :timeout (seen live: 'Error
when calling MFA defined by measurement: Genswarms.Telemetry
:measure_swarms []'), and a swarm entry missing :agent_count would raise —
both escaped and logged poller errors every tick. rescue-all + :exit-wide
catch, agent_count via Map.get. Tests pin the no-raise contract and the
guard shape.
@coderabbitai

coderabbitai Bot commented Jul 4, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

The measure_swarms/0 function in telemetry.ex was hardened to default missing agent_count to 0 and to suppress any exception or exit, not just {:noproc, _}. A new test file was added to verify this defensive behavior.

Changes

Telemetry resilience

Layer / File(s) Summary
Defensive telemetry polling and tests
lib/genswarms/telemetry/telemetry.ex, test/genswarms/telemetry/telemetry_test.exs
measure_swarms/0 defaults agent_count to 0, adds a rescue clause, and broadens catch to any :exit value; new tests assert the function always returns :ok and that the source contains both guard patterns.

Estimated code review effort: 1 (Trivial) | ~5 minutes

Sequence Diagram(s)

Not applicable.

Suggested labels: telemetry, tests

Suggested reviewers: genlayerlabs

Poem
A rabbit hops through swarm and code,
catching crashes down the road.
No exit, error, nor stray noproc,
shall break the count or block the clock.
Tests confirm: it's always :ok — hooray! 🐇

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main fix: broader telemetry guard handling for timeouts and shape drift beyond :noproc.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/telemetry-measure-guard

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
test/genswarms/telemetry/telemetry_test.exs (1)

14-27: 📐 Maintainability & Code Quality | 🔵 Trivial | 🏗️ Heavy lift

Guard-shape assertion tests source text, not runtime behavior.

This test reads telemetry.ex as a string and regex-matches for catch/:exit,\s*_/rescue. It doesn't exercise measure_swarms/0 against an actual timeout or raise, so it can't confirm the guard actually suppresses those failures at runtime — only that certain tokens appear near each other in source. A harmless rewrite (e.g. catch :exit, _reason -> reformatted across lines, or swapping to is_struct/2 guards) could pass or fail this check independent of real behavior.

Consider instead exercising real failure paths, e.g. stub Genswarms.SwarmManager.list/0 (via Mox/behaviour indirection) to raise, and a fake GenServer that never replies to force a GenServer.call timeout, then assert measure_swarms/0 still returns :ok.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@test/genswarms/telemetry/telemetry_test.exs` around lines 14 - 27, The
current test in telemetry_test.exs validates `measure_swarms/0` by scanning the
`Genswarms.Telemetry.measure_swarms` source text for `catch`, `:exit`, and
`rescue`, which only proves tokens exist and not that failures are swallowed at
runtime. Replace this with runtime coverage by exercising `measure_swarms/0`
through a stubbed `Genswarms.SwarmManager.list/0` (using behaviour
indirection/Mox or similar) that raises, and a timeout path that forces a
`GenServer.call` exit; then assert `measure_swarms/0` still completes
successfully.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@lib/genswarms/telemetry/telemetry.ex`:
- Around line 116-130: The telemetry collection in
Genswarms.Telemetry.handle_event/3 is swallowing failures for the entire batch,
so isolate each swarm inside the Enum.each loop and let one bad entry not stop
the rest from emitting [:genswarms, :swarm, :agent_count]. Update the unguarded
swarm.name access to be defensive like the existing Map.get/3 usage for
agent_count, and add logging in the rescue/catch paths so failures from
Genswarms.SwarmManager.list() or malformed swarm data are visible instead of
silently returning :ok.

---

Nitpick comments:
In `@test/genswarms/telemetry/telemetry_test.exs`:
- Around line 14-27: The current test in telemetry_test.exs validates
`measure_swarms/0` by scanning the `Genswarms.Telemetry.measure_swarms` source
text for `catch`, `:exit`, and `rescue`, which only proves tokens exist and not
that failures are swallowed at runtime. Replace this with runtime coverage by
exercising `measure_swarms/0` through a stubbed `Genswarms.SwarmManager.list/0`
(using behaviour indirection/Mox or similar) that raises, and a timeout path
that forces a `GenServer.call` exit; then assert `measure_swarms/0` still
completes successfully.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 9226b2e8-b334-40b9-9ba1-0c698c7845ae

📥 Commits

Reviewing files that changed from the base of the PR and between 197bd59 and f44d1cf.

📒 Files selected for processing (2)
  • lib/genswarms/telemetry/telemetry.ex
  • test/genswarms/telemetry/telemetry_test.exs

Comment on lines 116 to 130
try do
swarms = Genswarms.SwarmManager.list()

Enum.each(swarms, fn swarm ->
:telemetry.execute(
[:genswarms, :swarm, :agent_count],
%{agent_count: swarm.agent_count},
%{agent_count: Map.get(swarm, :agent_count, 0)},
%{swarm: swarm.name}
)
end)
rescue
_ -> :ok
catch
:exit, {:noproc, _} -> :ok
:exit, _ -> :ok
end

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Single bad swarm entry silently drops telemetry for every swarm that tick, with zero logging.

The try wraps the whole Enum.each, so a raise inside processing one malformed swarm (e.g. swarm.name missing/shape-drifted) aborts the loop and skips telemetry for all remaining swarms that cycle — not just the offending one. Additionally, rescue _ -> :ok / catch :exit, _ -> :ok swallow failures with no logging, so if SwarmManager starts timing out or crashing repeatedly, there is no signal to notice it; you only stop seeing the "Error when calling MFA" log, replaced by silent, permanent telemetry gaps.

Also note swarm.name is accessed unguarded while agent_count is defensively read with Map.get/3 — inconsistent given the stated shape-drift concern.

🛠️ Proposed fix: per-entry isolation + logging
+  require Logger
+
   `@doc` false
   def measure_swarms do
     try do
       swarms = Genswarms.SwarmManager.list()

       Enum.each(swarms, fn swarm ->
-        :telemetry.execute(
-          [:genswarms, :swarm, :agent_count],
-          %{agent_count: Map.get(swarm, :agent_count, 0)},
-          %{swarm: swarm.name}
-        )
+        try do
+          :telemetry.execute(
+            [:genswarms, :swarm, :agent_count],
+            %{agent_count: Map.get(swarm, :agent_count, 0)},
+            %{swarm: Map.get(swarm, :name, :unknown)}
+          )
+        rescue
+          e -> Logger.warning("measure_swarms: skipping malformed swarm entry: #{Exception.message(e)}")
+        end
       end)
+
+      :ok
     rescue
-      _ -> :ok
+      e ->
+        Logger.warning("measure_swarms: failed to list swarms: #{Exception.message(e)}")
+        :ok
     catch
-      :exit, _ -> :ok
+      :exit, reason ->
+        Logger.warning("measure_swarms: SwarmManager call failed: #{inspect(reason)}")
+        :ok
     end
   end
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
try do
swarms = Genswarms.SwarmManager.list()
Enum.each(swarms, fn swarm ->
:telemetry.execute(
[:genswarms, :swarm, :agent_count],
%{agent_count: swarm.agent_count},
%{agent_count: Map.get(swarm, :agent_count, 0)},
%{swarm: swarm.name}
)
end)
rescue
_ -> :ok
catch
:exit, {:noproc, _} -> :ok
:exit, _ -> :ok
end
require Logger
`@doc` false
def measure_swarms do
try do
swarms = Genswarms.SwarmManager.list()
Enum.each(swarms, fn swarm ->
try do
:telemetry.execute(
[:genswarms, :swarm, :agent_count],
%{agent_count: Map.get(swarm, :agent_count, 0)},
%{swarm: Map.get(swarm, :name, :unknown)}
)
rescue
e ->
Logger.warning(
"measure_swarms: skipping malformed swarm entry: #{Exception.message(e)}"
)
end
end)
:ok
rescue
e ->
Logger.warning("measure_swarms: failed to list swarms: #{Exception.message(e)}")
:ok
catch
:exit, reason ->
Logger.warning("measure_swarms: SwarmManager call failed: #{inspect(reason)}")
:ok
end
end
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@lib/genswarms/telemetry/telemetry.ex` around lines 116 - 130, The telemetry
collection in Genswarms.Telemetry.handle_event/3 is swallowing failures for the
entire batch, so isolate each swarm inside the Enum.each loop and let one bad
entry not stop the rest from emitting [:genswarms, :swarm, :agent_count]. Update
the unguarded swarm.name access to be defensive like the existing Map.get/3
usage for agent_count, and add logging in the rescue/catch paths so failures
from Genswarms.SwarmManager.list() or malformed swarm data are visible instead
of silently returning :ok.

@acastellana acastellana merged commit 4036ec5 into main Jul 5, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant