Skip to content

fix(observability): demote channel supervisor restart noise (Sentry TAURI-RUST-15)#2691

Closed
oxoxDev wants to merge 4 commits into
tinyhumansai:mainfrom
oxoxDev:fix/sentry-15-channel-supervisor-restart
Closed

fix(observability): demote channel supervisor restart noise (Sentry TAURI-RUST-15)#2691
oxoxDev wants to merge 4 commits into
tinyhumansai:mainfrom
oxoxDev:fix/sentry-15-channel-supervisor-restart

Conversation

@oxoxDev
Copy link
Copy Markdown
Contributor

@oxoxDev oxoxDev commented May 26, 2026

Summary

  • Add ExpectedErrorKind::ChannelSupervisorRestart classifier tier in src/core/observability.rs, demoting per-restart messages from channels::runtime::supervision::spawn_supervised_listener to a tracing::info! breadcrumb (no Sentry event).
  • New is_channel_supervisor_restart_message predicate anchors on the Rust supervisor wrapper format ("Channel <name> error: <inner>; restarting") — language-agnostic, so it catches OS-localized inner errors that the English-only is_network_unreachable_message anchors miss.
  • Precedence is checked BEFORE is_loopback_unavailable and is_network_unreachable_message so the supervisor wrap always wins over its inner-error substrings.
  • Targets self-hosted Sentry tauri-rust top-1 by event count (TAURI-RUST-15, ~11.4 k events / 14d) and its Chinese-Windows WSAETIMEDOUT variant (TAURI-RUST-BB, ~815 events).

Problem

The supervised-listener loop at src/openhuman/channels/runtime/supervision.rs:60 reports every transient channel restart through report_error_or_expected:

let message = format!("Channel {} error: {e:#}; restarting", ch.name());
crate::core::observability::report_error_or_expected(
    message.as_str(), "channels", "supervised_listener", &[("channel", ch.name())],
);

The supervisor restarts the listener with its own exponential backoff and sustained outages surface via health.bus / FAIL_ESCALATE_THRESHOLD (a separate path). The per-restart message itself carries no actionable Sentry signal.

Two failure modes today:

  1. English Discord-gateway body flows through expected_error_kind, matches is_network_unreachable_message (because the inner reqwest error contains "error sending request for url"), and demotes to tracing::warn!. The Sentry tracing layer still captures warn! as a Sentry event (just at lower severity). Across multi-user × Discord-channel × intermittent-network this produces 11,408 events / 14d on self-hosted tauri-rust.
  2. Chinese-Windows WSAETIMEDOUT body (IO error: 由于连接方在一段时间后没有正确答复或连接的主机没有反应... (os error 10060)) doesn't match any English-only anchor in is_network_unreachable_message, so it escapes classification entirely and emits as a full Sentry error (TAURI-RUST-BB, ~815 events).

Solution

Add a new ChannelSupervisorRestart classifier tier that:

  1. Is language-agnostic — anchored on the Rust wrapper format the supervisor produces, not the inner error wording. The matcher requires three anchors together ("channel " prefix + " error:" separator + "; restarting" trailer) so it covers OS-localized variants without false-positiving on generic restart logs.
  2. Takes precedence over NetworkUnreachable and LoopbackUnavailable so the supervisor wrap always wins over its inner-error substrings.
  3. Demotes to tracing::info! — breadcrumb only, no Sentry event. Sustained outages still page via the separate health.bus / FAIL_ESCALATE_THRESHOLD path, which is unaffected by this change.

Existing demotion paths (NetworkUnreachablewarn!, LoopbackUnavailabledebug!) are preserved for non-supervisor call sites.

Submission Checklist

If a section does not apply to this change, mark the item as N/A with a one-line reason. Do not delete items.

  • Tests added or updated (happy path + at least one failure / edge case) per Testing Strategy — 6 new tests in src/core/observability.rs (English Discord shape, Chinese WSAETIMEDOUT, four additional channel names, precedence over NetworkUnreachable, rejection of generic non-supervisor restart logs, smoke test routing through report_error_or_expected).
  • Diff coverage ≥ 80% — every new line in the classifier + arm is exercised by the new tests; the report_error_or_expected smoke test covers the dispatch path end-to-end.
  • N/A: Coverage matrix updated — observability classifier change is behaviour-only inside the core::observability module; not a tracked feature row in docs/TEST-COVERAGE-MATRIX.md.
  • N/A: All affected feature IDs from the matrix are listed in the PR description under ## Related — no matrix feature IDs affected (observability internals).
  • No new external network dependencies introduced (mock backend used per Testing Strategy) — pure in-process classifier tests.
  • N/A: Manual smoke checklist updated if this touches release-cut surfaces — internal observability classifier; no user-visible UI surface, no release-cut behaviour change.
  • N/A: Linked issue closed via Closes #NNN in the ## Related section — Sentry-only fix; no GitHub issue. Sentry-Issue trailers in ## Related provide the back-reference.

Impact

  • Runtime: Desktop (Tauri shell). Per-restart breadcrumb emits at info! instead of warn!/error!. No behavioural change to the supervisor's restart loop or to health.bus escalation; only the Sentry/log severity tier changes.
  • Performance: Net positive — eliminates ~12 k Sentry events / 14d on self-hosted tauri-rust, reducing dashboard noise and event-ingestion cost.
  • Security: None — no new code paths, no PII, no auth surface.
  • Migration / compatibility: None — additive enum variant + additive classifier. Existing demotion paths preserved.
  • Observability trade-off: Sustained channel outages still surface via the separate health.bus / FAIL_ESCALATE_THRESHOLD path. The demoted breadcrumbs are still captured in the local tracing log and are available for correlation when investigating a sustained outage manually.

Related

  • Sentry-Issue: TAURI-RUST-15
  • Sentry-Issue: TAURI-RUST-BB
  • Follow-up PR(s)/TODOs: Workflow tooling — Phase 7 Step 4.5 grep regex (.claude/rules/phases/07-cleanup.md) currently matches OPENHUMAN-(TAURI|REACT|CORE)-[A-Z0-9]+ / BACKEND-ALPHAHUMAN-[A-Z0-9]+; widen to accept self-hosted TAURI-RUST-* / TAURI-REACT-* / CORE-RUST-* / BACKEND-NODEJS-* prefixes so the cleanup hook can auto-flip these IDs to resolved post-merge. Separate workflow-tooling PR.

AI Authored PR Metadata (required for Codex/Linear PRs)

Keep this section for AI-authored PRs. For human-only PRs, mark each field N/A.

Linear Issue

  • Key: N/A (Sentry-only fix; no Linear issue)
  • URL: N/A

Commit & Branch

  • Branch: fix/sentry-15-channel-supervisor-restart
  • Commit SHA: 262b874

Validation Run

  • N/A: pnpm --filter openhuman-app format:check — no frontend changes.
  • N/A: pnpm typecheck — no TypeScript changes.
  • Focused tests: cargo test --lib core::observability → 94 passed (6 new), 0 failed.
  • Rust fmt/check (if changed): cargo fmt --check clean; cargo check --manifest-path Cargo.toml clean (only pre-existing slack_backfill dead-code warnings unrelated to this change). cargo clippy --lib warning count = 168 (matches main baseline).
  • N/A: Tauri fmt/check (if changed) — no app/src-tauri/ changes.

Validation Blocked

  • command: N/A
  • error: N/A
  • impact: N/A

Behavior Changes

  • Intended behavior change: Per-restart messages from channels::runtime::supervision::spawn_supervised_listener now classify as ExpectedErrorKind::ChannelSupervisorRestart and emit at tracing::info! (breadcrumb only) instead of routing through NetworkUnreachable (warn! — still a Sentry event) or escaping classification (full error event for OS-localized inner errors).
  • User-visible effect: None — supervisor restart loop, channel reconnection behaviour, and health.bus escalation are all unchanged. Effect is observability-only (Sentry dashboard noise reduction).

Parity Contract

  • Legacy behavior preserved: Existing NetworkUnreachable and LoopbackUnavailable demotion paths are unchanged for non-supervisor call sites. The new tier intercepts only the canonical supervisor wrapper format ("Channel <name> error: <inner>; restarting") — bodies lacking the wrapper prefix continue to flow through the existing matchers exactly as before.
  • Guard/fallback/dispatch parity checks: channel_supervisor_restart_does_not_classify_unrelated_restart_notes pins the rejection contract (generic restart logs without the "Channel <name> error:" preamble must NOT classify). channel_supervisor_restart_precedence_over_network_unreachable pins that supervisor-wrap bodies that ALSO match is_network_unreachable_message route to the new tier, not the old one.

Duplicate / Superseded PR Handling

  • Duplicate PR(s): N/A
  • Canonical PR: N/A
  • Resolution (closed/superseded/updated): N/A

Summary by CodeRabbit

  • New Features

    • Recognizes transient channel-supervisor restarts as a distinct expected error classification.
  • Bug Fixes

    • These restart events are now logged as info-level breadcrumbs, preventing error alerts and avoiding misclassification as network failures.
  • Tests

    • Updated tests to verify the new classification, its precedence over other classifiers, and handling of localized/variant restart messages.

Review Change Stack

@oxoxDev oxoxDev requested a review from a team May 26, 2026 10:51
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 26, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds ExpectedErrorKind::ChannelSupervisorRestart, a predicate to detect Rust supervisor-wrapper restart messages, integrates it into expected_error_kind with higher precedence than loopback/network checks, and routes those cases through report_expected_message as info-level breadcrumbs; tests cover detection, precedence, and reporting path.

Changes

Channel Supervisor Restart Classification and Demotion

Layer / File(s) Summary
Channel supervisor restart detection contract
src/core/observability.rs
ExpectedErrorKind::ChannelSupervisorRestart variant and is_channel_supervisor_restart_message() predicate detect supervisor wrapper messages containing "channel ", " error:", and "; restarting" substrings.
Classification and demoted error routing
src/core/observability.rs
expected_error_kind classifier checks the supervisor restart pattern early and returns the new kind before loopback/network-unreachable checks; report_expected_message logs an info! breadcrumb with structured tags for this kind instead of an error-level Sentry capture.
Test coverage for supervisor restart classification
src/core/observability.rs, src/openhuman/channels/runtime/supervision.rs
Unit tests assert English and OS-localized wrapper detection, multiple channel-name coverage, precedence over NetworkUnreachable, non-classification of unrelated restart notes, and that report_error_or_expected sends the new kind through the demoted reporting path.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • tinyhumansai/openhuman#2808: Overlaps in src/core/observability.rs changes introducing new expected classifications and demotion logic.
  • tinyhumansai/openhuman#2782: Adjusts precedence and mapping for channel supervisor–wrapped failures vs network-unreachable detection.
  • tinyhumansai/openhuman#2810: Modifies ExpectedErrorKind/precedence logic in the same classifier range, potentially overlapping network-unreachable expansions.

Suggested reviewers

  • graycyrus

Poem

🐰 I sniff the log, a tiny start,
A channel hiccup, playing its part.
I leave a breadcrumb, soft and light,
Quiet the noise, restore the night,
Hops of code — all set to right.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely describes the main change: adding a new classifier to demote channel supervisor restart messages in observability monitoring, with specific reference to the Sentry issue being fixed.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot added the working A PR that is being worked on by the team. label May 26, 2026
coderabbitai[bot]
coderabbitai Bot previously approved these changes May 26, 2026
graycyrus
graycyrus previously approved these changes May 26, 2026
Copy link
Copy Markdown
Contributor

@graycyrus graycyrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Clean observability fix — well-scoped, well-tested, additive only.

What this does: Adds ChannelSupervisorRestart classifier tier to demote noisy per-restart Sentry events (TAURI-RUST-15 ~11.4k events, TAURI-RUST-BB ~815 events) to info! breadcrumbs. Supervisor wrapper format is language-agnostic, catching both English Discord gateway bodies and OS-localized variants (Chinese WSAETIMEDOUT) that slip through the English-only is_network_unreachable_message anchors.

Area Verdict
Correctness Three-anchor predicate (starts_with("channel ") + contains(" error:") + contains("; restarting")) is tight enough to avoid false positives while catching all supervisor-wrapped bodies regardless of locale. Precedence before NetworkUnreachable/LoopbackUnavailable is correct and well-commented.
Tests 6 new tests: English body, Chinese WSAETIMEDOUT, multi-channel names, precedence pinning, negative cases (4 rejection patterns), smoke test through report_error_or_expected. Solid coverage.
Safety health.bus / FAIL_ESCALATE_THRESHOLD path for sustained outages is untouched — no observability blind spot introduced. Existing demotion paths preserved for non-supervisor call sites.
Performance Net positive — eliminates ~12k Sentry events/14d.

@oxoxDev oxoxDev removed the working A PR that is being worked on by the team. label May 28, 2026
@oxoxDev oxoxDev assigned oxoxDev and unassigned oxoxDev May 28, 2026
@oxoxDev oxoxDev dismissed stale reviews from coderabbitai[bot] and graycyrus via 098856e May 28, 2026 08:23
@coderabbitai coderabbitai Bot added the bug label May 28, 2026
coderabbitai[bot]
coderabbitai Bot previously approved these changes May 28, 2026
@oxoxDev oxoxDev force-pushed the fix/sentry-15-channel-supervisor-restart branch from 098856e to 43b04ae Compare May 28, 2026 15:34
@coderabbitai coderabbitai Bot added rust-core Core Rust runtime in src/: CLI, core_server, shared infrastructure. sentry-traced-bug Bug identified via Sentry triage labels May 28, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 28, 2026

Actionable comments posted: 0

coderabbitai[bot]
coderabbitai Bot previously approved these changes May 28, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 28, 2026

Actionable comments posted: 0

coderabbitai[bot]
coderabbitai Bot previously approved these changes May 28, 2026
oxoxDev added 3 commits May 29, 2026 00:56
…AURI-RUST-15)

Self-hosted Sentry's #1 unresolved tauri-rust issue by event count
(`Channel discord error: error sending request for url ...; restarting`,
~11.4 k events / 14d) and its Chinese-Windows WSAETIMEDOUT variant
(TAURI-RUST-BB, ~815 events) both originate from the channel supervisor
loop in `channels::runtime::supervision::spawn_supervised_listener`. The
supervisor already restarts the listener with its own exponential
backoff; sustained outages still surface through `health.bus` /
`FAIL_ESCALATE_THRESHOLD`. Per-restart messages carry no actionable
Sentry signal.

Previous path: `expected_error_kind` matched the English Discord body
against `is_network_unreachable_message`, which demotes to `tracing::warn!`
— still a Sentry event (just at lower severity). The Chinese-Windows
variant escaped the English-only anchors entirely and emitted as a full
Sentry error.

Fix: add a new `ChannelSupervisorRestart` classifier tier anchored on
the Rust supervisor wrapper format (`"Channel <name> error: <inner>;
restarting"`) — language-agnostic so it covers OS-localized inner
errors. Precedence is checked BEFORE `is_loopback_unavailable` and
`is_network_unreachable_message` so the supervisor wrap always wins.
Demotes to `tracing::info!` (breadcrumb only — no Sentry event).

Tests cover: English Discord gateway shape, Chinese WSAETIMEDOUT
variant, four additional channel names (slack/telegram/whatsapp/
gmessages), precedence over `NetworkUnreachable`, rejection of
generic non-supervisor restart notes (`systemd: docker.service;
restarting`), and a smoke test routing the verbatim Sentry body
through `report_error_or_expected`.

Sentry-Issue: TAURI-RUST-15
Sentry-Issue: TAURI-RUST-BB
CI workflows (test, build, coverage, etc.) trigger on default
pull_request types (opened/synchronize/reopened); they did not fire
when this PR was first opened. Force a synchronize event so the full
matrix runs.
…rvisorRestart

The new `is_channel_supervisor_restart_message` classifier added in this
PR takes precedence over `is_network_unreachable_message` in
`expected_error_kind`. The pre-existing supervision test
`supervision_discord_gateway_reqwest_failure_classifies_as_expected`
asserted `NetworkUnreachable` — update it to assert
`ChannelSupervisorRestart`, matching the new precedence + the broader
language-agnostic anchor introduced for TAURI-RUST-15/-BB.

Sentry-Issue: TAURI-RUST-15
@oxoxDev oxoxDev force-pushed the fix/sentry-15-channel-supervisor-restart branch from 34bd215 to 8fc2116 Compare May 28, 2026 19:26
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 28, 2026

Actionable comments posted: 0

coderabbitai[bot]
coderabbitai Bot previously approved these changes May 28, 2026
…SupervisorRestart

After rebase onto current `upstream/main`, the existing test
`channel_supervisor_operation_timed_out_classifies_as_expected` (added
in a sibling PR before this rebase landed) now hits the new
`ChannelSupervisorRestart` precedence path instead of
`NetworkUnreachable`. The new classifier is the broader anchor — it
covers every ETIMEDOUT / WSAETIMEDOUT / hyper-prose supervisor-wrap
shape the old test pinned, plus OS-localized variants the English-only
`NetworkUnreachable` would have missed.

Update the assertion + comment to reflect the new precedence and tier
difference (`ChannelSupervisorRestart` demotes to `info!`, vs `warn!`
for `NetworkUnreachable`).

Sentry-Issue: TAURI-RUST-15
@M3gA-Mind
Copy link
Copy Markdown
Contributor

The CI failure here is a stale-base issue — classifies_embedding_api_invalid_token_401_as_session_expired was added to main via #2869 after this branch was cut.

Opened #2879 with the 3 functional commits rebased on current main (empty CI-trigger commit dropped). All 129 observability tests pass including the new one. Recommending we close this in favour of #2879.

@M3gA-Mind
Copy link
Copy Markdown
Contributor

Closing in favour of #2879 (rebased on current main, stale-base CI failure resolved).

@M3gA-Mind M3gA-Mind closed this May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug rust-core Core Rust runtime in src/: CLI, core_server, shared infrastructure. sentry-traced-bug Bug identified via Sentry triage

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants