Skip to content

fix(observability): demote expected-error Sentry buckets across embeddings, provider, memory-store, FS, and thinking-mode wire shapes#2830

Merged
oxoxDev merged 5 commits into
tinyhumansai:mainfrom
YellowSnnowmann:fix/observability-expected-errors-wave5
May 28, 2026
Merged

fix(observability): demote expected-error Sentry buckets across embeddings, provider, memory-store, FS, and thinking-mode wire shapes#2830
oxoxDev merged 5 commits into
tinyhumansai:mainfrom
YellowSnnowmann:fix/observability-expected-errors-wave5

Conversation

@YellowSnnowmann
Copy link
Copy Markdown
Contributor

@YellowSnnowmann YellowSnnowmann commented May 28, 2026

Summary

  • Extend the expected_error_kind classifier in src/core/observability.rs with three new variants (MemoryStoreBreakerOpen, DiskFull) and broader matchers (_api_key is not configured, ollama 401/501 embed, embedding-backend 401 invalid-token, legacy no-paren api error <code> transient shape) so deterministic user-state and self-healing transient conditions stop firing as Sentry error events.
  • Add two phrases to src/openhuman/inference/provider/config_rejection.rs::is_provider_config_rejection_message: "requires a subscription, upgrade for access" (Ollama Cloud paywall) and "thinking mode must be passed back" (DeepSeek-R1 / Moonshot-K2 thinking-mode protocol gap).
  • Bug fix in src/openhuman/memory_store/factories.rs: the ollama-health-gate emit site was calling report_error_message directly, bypassing the classifier. Switched to report_error_or_expected so the existing GX matcher demotes the fallback log.
  • 14+ Sentry buckets stop generating events (~22k events / 14 d at the time of writing). Behavior is unchanged for every code path — only tracing::error!tracing::warn!/info! for these conditions.

Problem

The Sentry inbox is dominated by deterministic, non-actionable signals: user-configuration rejections (missing API keys, paywalled models, unsupported model capabilities), transient upstream HTTP (502/504 from older release wire shapes that the existing matcher missed), self-healing internal states (memory-store circuit breaker, disk full), and provider-contract protocol gaps (thinking-mode reasoning_content round-trip). Each one fires an error event per affected turn, hides real regressions in the noise, and has no remediation Sentry can act on — the UI already surfaces the actionable error to the user where one exists.

These are the same class of demotion this codebase already applies through expected_error_kind (precedent: PR #2692 demoted backend Cloudflare anti-bot wrap, and the existing Wave 1-4 matchers). The patterns are stable, the volume is high, and every new wave continues to follow the same shape.

Solution

  • src/core/observability.rs

    • New ExpectedErrorKind::MemoryStoreBreakerOpen — anchored on [memory_tree] + circuit breaker open substrings. The breaker is the recovery mechanism, not the failure; demoted to warn!. Drops TAURI-RUST-52X.
    • New ExpectedErrorKind::DiskFull — anchored on no space left on device (POSIX errno 28) and not enough space on the disk (Windows ERROR_DISK_FULL / 39 / 112). Demoted to warn!. Drops TAURI-RUST-H4.
    • ApiKeyMissing arm extended with _api_key is not configured to catch <PROVIDER>_API_KEY is not configured env-var-style rejections from the OpenHuman backend embedding endpoint. Drops TAURI-RUST-2H5 / TAURI-RUST-2.
    • New is_embedding_backend_auth_failure(lower) — anchored on embedding api error + 401 + invalid token, routed to BackendUserError (deliberately not SessionExpired to avoid racing the chat-side session-teardown publication in core::jsonrpc::is_session_expired_error). Drops TAURI-RUST-T.
    • is_ollama_user_config_rejection extended:
      • ollama embed failed + this model does not support embeddings (TAURI-RUST-3X, 501).
      • ollama embed failed + status 401 (TAURI-RUST-3E, 401 unauthorized).
    • is_transient_upstream_http_message extended with the legacy no-paren wire shape api error <code> (with trailing space) emitted by older embeddings::openai / embeddings::cohere formats. Drops TAURI-RUST-H (504) and TAURI-RUST-2T (502).
  • src/openhuman/inference/provider/config_rejection.rs

    • Added "requires a subscription, upgrade for access" — Ollama Cloud paywall body from compatible::OpenAiCompatibleProvider with name = "ollama". Drops TAURI-RUST-4XK.
    • Added "thinking mode must be passed back" — DeepSeek-R1 / Moonshot K2-thinking 400 when reasoning_content isn't echoed back. Anchored without backticks so the match survives upstream-side quote-style variance. Drops TAURI-RUST-2G / TAURI-RUST-2F. (Note: the underlying contract gap is on our side — the inference layer should round-trip reasoning_content — but until that lands, the Sentry-side noise is removed.)
  • src/openhuman/memory_store/factories.rs

    • report_ollama_health_gate_once switched from report_error_message to report_error_or_expected so the existing GX arm of is_ollama_user_config_rejection demotes the fallback log. Wire shape stays byte-identical. Drops TAURI-RUST-B.
  • Doc-only updates for TAURI-RUST-K, TAURI-RUST-8K, TAURI-RUST-CE, TAURI-RUST-9X — these wire shapes are already classified by existing matchers (MA/KM and /openai/v1/models); the residual Sentry events are from release 0.54.0 which predates the matchers and will tail off as users upgrade. Self-hosted Sentry shortIds added to the relevant doc-comments for future grep.

Tradeoffs:

  • All matchers are anchored on multiple co-occurring substrings (e.g. [memory_tree] + circuit breaker open, ollama embed failed + status 401, embedding api error + 401 + invalid token) so unrelated prose mentioning a single substring is not silenced. Each new test block includes a negative-assertion fixture for this.
  • The DiskFull / MemoryStoreBreakerOpen variants demote at warn! (not info!) so a sustained spike still shows up in operator dashboards; only the per-event Sentry capture is suppressed.
  • For TAURI-RUST-T we deliberately routed the embedding 401 to BackendUserError rather than SessionExpired. The chat-side 401 is authoritative for session teardown; bridging the embedding 401 to the same path could race with the chat-side DomainEvent::SessionExpired publication and clear a still-valid session under intermittent backend mis-flagging.

Submission Checklist

  • Tests added or updated (happy path + at least one failure / edge case) per Testing Strategy — new positive + negative fixtures in core::observability::tests (classifies_memory_store_breaker_open, does_not_classify_unrelated_breaker_messages, classifies_backend_env_api_key_not_configured, does_not_classify_unrelated_is_not_configured_messages, classifies_disk_full_errors, does_not_classify_unrelated_space_messages, classifies_embedding_backend_auth_failure, does_not_classify_unrelated_invalid_token_messages, does_not_classify_unrelated_digit_runs_as_transient); extended existing classifies_ollama_user_config_rejections and classifies_transient_upstream_http_errors; extended config_rejection::tests::detects_wave4_sentry_bodies with TAURI-RUST-4XK / -2G / -2F bodies.
  • Diff coverage ≥ 80% — changed lines (Vitest + cargo-llvm-cov merged via diff-cover) meet the gate enforced by .github/workflows/coverage.yml. Run pnpm test:coverage and pnpm test:rust locally; PRs below 80% on changed lines will not merge.
  • Coverage matrix updated — N/A: observability-only change, no new feature rows. Classifier coverage stays inside the same core::observability bucket already tracked.
  • All affected feature IDs from the matrix are listed in the PR description under ## RelatedN/A: no feature IDs touched.
  • No new external network dependencies introduced (mock backend used per Testing Strategy)
  • Manual smoke checklist updated if this touches release-cut surfaces — N/A: log-level demotion only; no user-visible UI surface changes.
  • Linked issue closed via Closes #NNN in the ## Related section

Impact

  • Runtime/platform: desktop (macOS, Windows, Linux). No mobile / web / CLI behavior change. iOS client unaffected.
  • Performance: classifier runs a fixed number of String::contains calls per error path; the additions amortize to a few extra μs per failed turn, deep inside an already-failing path. No hot-path impact.
  • Security: none. Only tracing level for matched messages changes (Error → Warn / Info). No secrets or PII handling change. The factories.rs switch from report_error_message to report_error_or_expected preserves the same tag set (ollama_host, fallback) and the same wire body.
  • Migration: none. No schema, no config, no IPC contract change.
  • Compatibility: Sentry events from older releases (0.54.0, 0.56.0) keep flowing from already-installed clients until those clients update — events tail off naturally with release adoption, not with this PR landing. The matchers are pure additions; existing classifications stay intact (regression-guarded by the negative-assertion tests).

Related

Summary by CodeRabbit

Release Notes

  • Bug Fixes
    • Improved detection of infrastructure-level failures such as disk-full and memory circuit breaker conditions.
    • Enhanced error classification for service authentication and configuration issues.
    • Better identification of configuration rejection patterns across multiple providers.
    • Improved recognition of legacy service error message formats.

Review Change Stack

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 28, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c3897807-de47-4368-86ad-9f1557e0ff7f

📥 Commits

Reviewing files that changed from the base of the PR and between 3329ef6 and 72f7944.

📒 Files selected for processing (1)
  • src/core/observability.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/core/observability.rs

📝 Walkthrough

Walkthrough

This PR expands error classification and observability reporting across three files to recognize four new expected/suppressed error conditions: disk-full and memory-store circuit-breaker infrastructure failures, embedding backend authentication failures, additional Ollama user-rejection shapes, and legacy transient-upstream HTTP formats. Memory store observability is routed through the updated classification pipeline.

Changes

Error Kind Classification and Expected Reporting

Layer / File(s) Summary
Infrastructure failure classification (disk full, circuit breaker)
src/core/observability.rs
ExpectedErrorKind enum gains DiskFull and MemoryStoreBreakerOpen variants. expected_error_kind detects their canonical message substrings via new helper predicates. report_expected_message emits warn-level breadcrumbs for these conditions with Sentry-suppression semantics. Tests verify detection accuracy and prevent false positives on generic "space" and "circuit breaker" text.
API-key and embedding-backend auth classification
src/core/observability.rs
expected_error_kind broadened to classify _api_key is not configured shapes as ApiKeyMissing, and adds an early classification for embedding-backend auth failures (invalid-token shapes) mapping to BackendUserError. Tests include positive and negative cases.
Ollama user-config rejection & transient-upstream legacy support
src/core/observability.rs
is_ollama_user_config_rejection extended to detect unsupported-embeddings (501) and unauthorized embed auth (401). is_transient_upstream_http_message added legacy api error {code} form. Tests extend TAURI-RUST fixtures and legacy transient-wire regression coverage.
Provider config rejection phrase expansion
src/openhuman/inference/provider/config_rejection.rs
is_provider_config_rejection_message PHRASES list extended with thinking-mode reasoning_content round-trip requirement and an Ollama Cloud subscription-gate denial. Tests add TAURI-RUST Sentry-body fixtures verifying classification.

Memory Store Health Gate Observability Routing

Layer / File(s) Summary
Ollama health-gate observability rerouting
src/openhuman/memory_store/factories.rs
report_ollama_health_gate_once replaces a direct report_error_message call with report_error_or_expected, passing the fallback text through the expected/error classifier so the message is breadcrumbed (expected) or escalated (error) appropriately.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

Suggested reviewers

  • graycyrus
  • senamakel
  • M3gA-Mind

Poem

🐰 I hopped through logs with careful cheer,
Quieted warnings we once would fear,
Disk and breaker now whispered, not cried,
Auth and Ollama shapes gently applied,
Breadcrumbs guide us home, soft and clear.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and specifically summarizes the main change: demotion of expected-error Sentry buckets across multiple subsystems (embeddings, provider, memory-store, FS, and thinking-mode).
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@YellowSnnowmann YellowSnnowmann marked this pull request as ready for review May 28, 2026 08:39
@YellowSnnowmann YellowSnnowmann requested a review from a team May 28, 2026 08:39
@coderabbitai coderabbitai Bot added the working A PR that is being worked on by the team. label May 28, 2026
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/core/observability.rs`:
- Around line 1012-1033: The warn! invocations inside
ExpectedErrorKind::DiskFull and ExpectedErrorKind::MemoryStoreBreakerOpen are
breadcrumbing raw local filesystem paths via the %message field; change them to
avoid including the raw message text (either log a redacted version via a helper
like redact_local_paths(message) or omit %message entirely and only emit
structured metadata such as domain, operation, and kind), matching the approach
used in LoopbackUnavailable; update the tracing::warn! calls for DiskFull and
MemoryStoreBreakerOpen to remove %message from the breadcrumb payload and
instead include only redacted_message or no message plus the existing structured
fields.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: c566bfab-2c4e-46b4-9ccc-cd4681b5fd50

📥 Commits

Reviewing files that changed from the base of the PR and between 145e768 and 3329ef6.

📒 Files selected for processing (3)
  • src/core/observability.rs
  • src/openhuman/inference/provider/config_rejection.rs
  • src/openhuman/memory_store/factories.rs

Comment thread src/core/observability.rs
ozpool
ozpool previously approved these changes May 28, 2026
@oxoxDev oxoxDev dismissed ozpool’s stale review May 28, 2026 10:47

Wrong account (ozpool) — re-posting under oxoxDev

Copy link
Copy Markdown
Contributor

@oxoxDev oxoxDev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Author: @YellowSnnowmann (MEMBER — core team)

Wave-5 observability hardening — substantial, well-tested, well-documented. ~22k events / 14d silenced across 14+ Sentry buckets. Path-redaction fix in 72f7944 correctly removes %message from DiskFull/MemoryStoreBreakerOpen warn breadcrumbs per coderabbit's resolved finding.

Verified

  • is_memory_store_breaker_open anchors on [memory_tree] AND circuit breaker open — strong polarity; negative tests pin provider reliability: circuit breaker open for openai + [memory_tree] failed to run schema DDL: disk full as NOT classifying ✓
  • is_disk_full_message anchors on platform-stable errno renderings (POSIX 28 + Windows 39/112); negative test pins generic "space" prose ✓
  • is_embedding_backend_auth_failure requires embedding api error + 401 + invalid token — all 3 substrings; negative tests pin chat-side 401s + 500-level embedding errors ✓
  • Legacy no-paren api error <code> uses trailing-space anchor; negative test pins api error 5042 non-transient digit run ✓
  • _api_key is not configured env-var-style anchor prevents prose "is not configured" matches ✓
  • factories.rs switch to report_error_or_expected is the correct entry point; preserves tag set + wire body ✓
  • 2 new dispatcher arms placed LAST in expected_error_kind — earlier specific matchers always win ✓
  • Comprehensive positive (real Sentry wire shape) + negative (rejection contract) fixtures matching Wave 1–4 conventions ✓

Two minor nits (inline) — neither blocks merge

Posting inline comments on the in-flight-PR coordination + cross-PR merge collision concerns. See inline replies.

APPROVE.

// `thinking mode must be passed back` substring so the match
// doesn't depend on the upstream's backtick-quoting around
// `reasoning_content` (some provider versions ship without them).
"thinking mode must be passed back",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit (non-blocking) — link the in-flight real fix so this anchor doesn't get orphaned.

Three PRs in flight try to actually fix this round-trip — #2818 (graycyrus: adds ChatResponse.reasoning_content field, threads through stack), #2817 (CodeGhost21: tool-call-turn variant), #2806 (staimoorulhassan: JSON-blob-in-text hack, likely closing in favour of #2818).

Per feedback_deferred_classifier_arm_trap.md — silencing the Sentry signal while the actual fix is deferred risks the fix never landing (inbox stops yelling). Suggest extending the doc-comment block above with an explicit pointer:

// ⚠️ Real fix is in flight at #2818 (preserve reasoning_content in
// multi-turn thinking model conversations). This matcher silences the
// Sentry noise in the interim; REMOVE this anchor once #2818 lands and
// the post-fix Sentry signal confirms the round-trip works.
"thinking mode must be passed back",

Alternatively open a tracking issue with Closes #NNN queued for the post-#2818 cleanup so deletion of this anchor is a forced step.

Comment thread src/core/observability.rs
if is_backend_user_error_message(&lower) {
return Some(ExpectedErrorKind::BackendUserError);
}
if is_embedding_backend_auth_failure(&lower) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit (non-blocking) — surface the SessionExpired rationale at the dispatcher arm.

The reasoning for routing 401-invalid-token to BackendUserError instead of SessionExpired (avoid racing core::jsonrpc::is_session_expired_error's chat-side teardown publication, since the chat-side 401 is authoritative) is documented only in the PR body. Easy to "fix" to SessionExpired in a future cleanup. Worth a one-line inline comment at the dispatcher arm so the rationale travels with the code:

// Deliberately not SessionExpired — chat-side 401 is authoritative for
// session teardown; bridging here would race the
// DomainEvent::SessionExpired publication and clear a still-valid session
// under intermittent backend mis-flagging.
if is_embedding_backend_auth_failure(&lower) {
    return Some(ExpectedErrorKind::BackendUserError);
}

Also worth flagging the cross-PR merge collision: #2810 / #2808 / #2796 / #2809 / #2795 / #2807 all extend expected_error_kind. Trivial line-level conflicts but coordinate landing order — this PR has the broadest scope and would benefit landing last after the smaller ones are absorbed.

@oxoxDev oxoxDev merged commit d578b57 into tinyhumansai:main May 28, 2026
37 of 38 checks passed
graycyrus added a commit to graycyrus/openhuman that referenced this pull request May 28, 2026
…variants alongside MemoryStorePiiRejection

Resolves conflict between our MemoryStorePiiRejection addition (TAURI-RUST-54T)
and upstream PR tinyhumansai#2830's MemoryStoreBreakerOpen + DiskFull additions. All five
variants, their predicates, match arms, and tests are preserved.
sanil-23 added a commit to sanil-23/openhuman that referenced this pull request May 28, 2026
…adiction on Embedding 401 + Invalid token

PRs tinyhumansai#2830 and tinyhumansai#2786 both shipped on main and made contradictory assertions
for the SAME wire shape:

    Embedding API error (401 Unauthorized): {"success":false,"error":"Invalid token"}

- tinyhumansai#2830 added `is_embedding_backend_auth_failure` and a test asserting
  `BackendUserError`.
- tinyhumansai#2786 added `classifies_embedding_api_invalid_token_401_as_session_expired`
  asserting `SessionExpired`.

The tinyhumansai#2830 arm runs first in `expected_error_kind`, so the tinyhumansai#2786 test fails
in CI on every PR that rebases onto current main (verified on
upstream/main @ e83bfd6).

Per the doc evidence and breadcrumb context (`[scheduler_gate] signed_out
false -> true` immediately preceding the 401), the SessionExpired routing
is the correct one — the OpenHuman backend envelope `{"success":false,
"error":"Invalid token"}` is the JWT-invalidity branch of the same
session-renewal flow as TAURI-RUST-4P0.

Disable `is_embedding_backend_auth_failure` (keep the function as a doc
breadcrumb so the regression is traceable) and remove the contradicting
`classifies_embedding_backend_auth_failure` test. The SessionExpired arm
in `is_session_expired_message` (added by tinyhumansai#2786) now catches the wire
shape correctly. BYO-key embedding 401s (no OpenHuman envelope) still
escalate to Sentry — guarded by
`does_not_classify_embedding_byo_key_401_as_session_expired`.

Local tests: cargo test core::observability::tests → 117/117 pass.
Local repro: `classifies_embedding_api_invalid_token_401_as_session_expired`
panicked on pure upstream/main before this commit; passes after.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
sanil-23 added a commit to sanil-23/openhuman that referenced this pull request May 28, 2026
…sionExpired (TAURI-RUST-4K5 regression)

PR tinyhumansai#2786 (commit 14ff92b) added an `is_embedding_backend_auth_failure`
matcher that was supposed to classify the TAURI-RUST-4K5 wire shape
(`Embedding API error (401 Unauthorized): {"error":"Invalid token"}`)
as `SessionExpired` — to align with the 4P0 OpenHuman-backend variant
and surface a re-login prompt instead of a Sentry noise bucket.

A subsequent merge (tinyhumansai#2830 commit d578b57, 'demote expected-error
Sentry buckets across embeddings, provider, memory-store, FS, and
thinking-mode wire shapes') landed on a pre-tinyhumansai#2786 base and clobbered
the SessionExpired return back to BackendUserError. The
`classifies_embedding_api_invalid_token_401_as_session_expired` test
that tinyhumansai#2786 shipped therefore fails on every PR rebased onto current
`upstream/main`:

  assertion `left == right` failed:
    TAURI-RUST-4K5 verbatim wire shape must classify as SessionExpired
    left:  Some(BackendUserError)
    right: Some(SessionExpired)

Restore the intended return value. One-line fix to the
`is_embedding_backend_auth_failure` arm in `expected_error_kind`.
All observability tests pass.

Co-Authored-By: Claude <noreply@anthropic.com>
@sanil-23
Copy link
Copy Markdown
Contributor

Posted #2869 with the root-cause fix: #2830's merge inadvertently reverted #2786's SessionExpired classification back to BackendUserError, so the classifies_embedding_api_invalid_token_401_as_session_expired test added by #2786 has been failing on every PR rebased onto post-#2830 main. One-line classifier fix + reconciled the stale companion test.

CodeGhost21 added a commit to CodeGhost21/openhuman that referenced this pull request May 28, 2026
… so embedding 401 'Invalid token' classifies as SessionExpired (TAURI-RUST-4K5)

The merge surfaced a pre-existing main contradiction (tinyhumansai#2786 vs tinyhumansai#2830): the
embedding 401 "Invalid token" envelope was shadowed by the broader
is_embedding_backend_auth_failure matcher (BackendUserError) before reaching
is_session_expired_message. Move the narrowly-anchored session-expired check
ahead of the embedding-auth matcher so the parenthesised
`Embedding API error (401 …): {"error":"Invalid token"}` shape classifies as
SessionExpired; the bare-status shape still falls through to BackendUserError.
Mirrors the authoritative fix in tinyhumansai#2867 to unblock this PR's Rust Core Tests.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

working A PR that is being worked on by the team.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants