fix(observability): demote expected-error Sentry buckets across embeddings, provider, memory-store, FS, and thinking-mode wire shapes#2830
Conversation
…memory store and disk issues
…d reasoning content requirements
…classification for ollama health checks
…n error handling tests
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThis PR expands error classification and observability reporting across three files to recognize four new expected/suppressed error conditions: disk-full and memory-store circuit-breaker infrastructure failures, embedding backend authentication failures, additional Ollama user-rejection shapes, and legacy transient-upstream HTTP formats. Memory store observability is routed through the updated classification pipeline. ChangesError Kind Classification and Expected Reporting
Memory Store Health Gate Observability Routing
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/core/observability.rs`:
- Around line 1012-1033: The warn! invocations inside
ExpectedErrorKind::DiskFull and ExpectedErrorKind::MemoryStoreBreakerOpen are
breadcrumbing raw local filesystem paths via the %message field; change them to
avoid including the raw message text (either log a redacted version via a helper
like redact_local_paths(message) or omit %message entirely and only emit
structured metadata such as domain, operation, and kind), matching the approach
used in LoopbackUnavailable; update the tracing::warn! calls for DiskFull and
MemoryStoreBreakerOpen to remove %message from the breadcrumb payload and
instead include only redacted_message or no message plus the existing structured
fields.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: c566bfab-2c4e-46b4-9ccc-cd4681b5fd50
📒 Files selected for processing (3)
src/core/observability.rssrc/openhuman/inference/provider/config_rejection.rssrc/openhuman/memory_store/factories.rs
…redundant error details
Wrong account (ozpool) — re-posting under oxoxDev
oxoxDev
left a comment
There was a problem hiding this comment.
Author: @YellowSnnowmann (
MEMBER— core team)
Wave-5 observability hardening — substantial, well-tested, well-documented. ~22k events / 14d silenced across 14+ Sentry buckets. Path-redaction fix in 72f7944 correctly removes %message from DiskFull/MemoryStoreBreakerOpen warn breadcrumbs per coderabbit's resolved finding.
Verified
is_memory_store_breaker_openanchors on[memory_tree]ANDcircuit breaker open— strong polarity; negative tests pinprovider reliability: circuit breaker open for openai+[memory_tree] failed to run schema DDL: disk fullas NOT classifying ✓is_disk_full_messageanchors on platform-stable errno renderings (POSIX 28 + Windows 39/112); negative test pins generic "space" prose ✓is_embedding_backend_auth_failurerequiresembedding api error+401+invalid token— all 3 substrings; negative tests pin chat-side 401s + 500-level embedding errors ✓- Legacy no-paren
api error <code>uses trailing-space anchor; negative test pinsapi error 5042non-transient digit run ✓ _api_key is not configuredenv-var-style anchor prevents prose "is not configured" matches ✓factories.rsswitch toreport_error_or_expectedis the correct entry point; preserves tag set + wire body ✓- 2 new dispatcher arms placed LAST in
expected_error_kind— earlier specific matchers always win ✓ - Comprehensive positive (real Sentry wire shape) + negative (rejection contract) fixtures matching Wave 1–4 conventions ✓
Two minor nits (inline) — neither blocks merge
Posting inline comments on the in-flight-PR coordination + cross-PR merge collision concerns. See inline replies.
APPROVE.
| // `thinking mode must be passed back` substring so the match | ||
| // doesn't depend on the upstream's backtick-quoting around | ||
| // `reasoning_content` (some provider versions ship without them). | ||
| "thinking mode must be passed back", |
There was a problem hiding this comment.
Nit (non-blocking) — link the in-flight real fix so this anchor doesn't get orphaned.
Three PRs in flight try to actually fix this round-trip — #2818 (graycyrus: adds ChatResponse.reasoning_content field, threads through stack), #2817 (CodeGhost21: tool-call-turn variant), #2806 (staimoorulhassan: JSON-blob-in-text hack, likely closing in favour of #2818).
Per feedback_deferred_classifier_arm_trap.md — silencing the Sentry signal while the actual fix is deferred risks the fix never landing (inbox stops yelling). Suggest extending the doc-comment block above with an explicit pointer:
// ⚠️ Real fix is in flight at #2818 (preserve reasoning_content in
// multi-turn thinking model conversations). This matcher silences the
// Sentry noise in the interim; REMOVE this anchor once #2818 lands and
// the post-fix Sentry signal confirms the round-trip works.
"thinking mode must be passed back",Alternatively open a tracking issue with Closes #NNN queued for the post-#2818 cleanup so deletion of this anchor is a forced step.
| if is_backend_user_error_message(&lower) { | ||
| return Some(ExpectedErrorKind::BackendUserError); | ||
| } | ||
| if is_embedding_backend_auth_failure(&lower) { |
There was a problem hiding this comment.
Nit (non-blocking) — surface the SessionExpired rationale at the dispatcher arm.
The reasoning for routing 401-invalid-token to BackendUserError instead of SessionExpired (avoid racing core::jsonrpc::is_session_expired_error's chat-side teardown publication, since the chat-side 401 is authoritative) is documented only in the PR body. Easy to "fix" to SessionExpired in a future cleanup. Worth a one-line inline comment at the dispatcher arm so the rationale travels with the code:
// Deliberately not SessionExpired — chat-side 401 is authoritative for
// session teardown; bridging here would race the
// DomainEvent::SessionExpired publication and clear a still-valid session
// under intermittent backend mis-flagging.
if is_embedding_backend_auth_failure(&lower) {
return Some(ExpectedErrorKind::BackendUserError);
}Also worth flagging the cross-PR merge collision: #2810 / #2808 / #2796 / #2809 / #2795 / #2807 all extend expected_error_kind. Trivial line-level conflicts but coordinate landing order — this PR has the broadest scope and would benefit landing last after the smaller ones are absorbed.
…variants alongside MemoryStorePiiRejection Resolves conflict between our MemoryStorePiiRejection addition (TAURI-RUST-54T) and upstream PR tinyhumansai#2830's MemoryStoreBreakerOpen + DiskFull additions. All five variants, their predicates, match arms, and tests are preserved.
…adiction on Embedding 401 + Invalid token PRs tinyhumansai#2830 and tinyhumansai#2786 both shipped on main and made contradictory assertions for the SAME wire shape: Embedding API error (401 Unauthorized): {"success":false,"error":"Invalid token"} - tinyhumansai#2830 added `is_embedding_backend_auth_failure` and a test asserting `BackendUserError`. - tinyhumansai#2786 added `classifies_embedding_api_invalid_token_401_as_session_expired` asserting `SessionExpired`. The tinyhumansai#2830 arm runs first in `expected_error_kind`, so the tinyhumansai#2786 test fails in CI on every PR that rebases onto current main (verified on upstream/main @ e83bfd6). Per the doc evidence and breadcrumb context (`[scheduler_gate] signed_out false -> true` immediately preceding the 401), the SessionExpired routing is the correct one — the OpenHuman backend envelope `{"success":false, "error":"Invalid token"}` is the JWT-invalidity branch of the same session-renewal flow as TAURI-RUST-4P0. Disable `is_embedding_backend_auth_failure` (keep the function as a doc breadcrumb so the regression is traceable) and remove the contradicting `classifies_embedding_backend_auth_failure` test. The SessionExpired arm in `is_session_expired_message` (added by tinyhumansai#2786) now catches the wire shape correctly. BYO-key embedding 401s (no OpenHuman envelope) still escalate to Sentry — guarded by `does_not_classify_embedding_byo_key_401_as_session_expired`. Local tests: cargo test core::observability::tests → 117/117 pass. Local repro: `classifies_embedding_api_invalid_token_401_as_session_expired` panicked on pure upstream/main before this commit; passes after. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…sionExpired (TAURI-RUST-4K5 regression) PR tinyhumansai#2786 (commit 14ff92b) added an `is_embedding_backend_auth_failure` matcher that was supposed to classify the TAURI-RUST-4K5 wire shape (`Embedding API error (401 Unauthorized): {"error":"Invalid token"}`) as `SessionExpired` — to align with the 4P0 OpenHuman-backend variant and surface a re-login prompt instead of a Sentry noise bucket. A subsequent merge (tinyhumansai#2830 commit d578b57, 'demote expected-error Sentry buckets across embeddings, provider, memory-store, FS, and thinking-mode wire shapes') landed on a pre-tinyhumansai#2786 base and clobbered the SessionExpired return back to BackendUserError. The `classifies_embedding_api_invalid_token_401_as_session_expired` test that tinyhumansai#2786 shipped therefore fails on every PR rebased onto current `upstream/main`: assertion `left == right` failed: TAURI-RUST-4K5 verbatim wire shape must classify as SessionExpired left: Some(BackendUserError) right: Some(SessionExpired) Restore the intended return value. One-line fix to the `is_embedding_backend_auth_failure` arm in `expected_error_kind`. All observability tests pass. Co-Authored-By: Claude <noreply@anthropic.com>
|
Posted #2869 with the root-cause fix: #2830's merge inadvertently reverted #2786's SessionExpired classification back to BackendUserError, so the |
… so embedding 401 'Invalid token' classifies as SessionExpired (TAURI-RUST-4K5) The merge surfaced a pre-existing main contradiction (tinyhumansai#2786 vs tinyhumansai#2830): the embedding 401 "Invalid token" envelope was shadowed by the broader is_embedding_backend_auth_failure matcher (BackendUserError) before reaching is_session_expired_message. Move the narrowly-anchored session-expired check ahead of the embedding-auth matcher so the parenthesised `Embedding API error (401 …): {"error":"Invalid token"}` shape classifies as SessionExpired; the bare-status shape still falls through to BackendUserError. Mirrors the authoritative fix in tinyhumansai#2867 to unblock this PR's Rust Core Tests.
Summary
expected_error_kindclassifier insrc/core/observability.rswith three new variants (MemoryStoreBreakerOpen,DiskFull) and broader matchers (_api_key is not configured, ollama 401/501 embed, embedding-backend 401 invalid-token, legacy no-parenapi error <code>transient shape) so deterministic user-state and self-healing transient conditions stop firing as Sentry error events.src/openhuman/inference/provider/config_rejection.rs::is_provider_config_rejection_message:"requires a subscription, upgrade for access"(Ollama Cloud paywall) and"thinking mode must be passed back"(DeepSeek-R1 / Moonshot-K2 thinking-mode protocol gap).src/openhuman/memory_store/factories.rs: the ollama-health-gate emit site was callingreport_error_messagedirectly, bypassing the classifier. Switched toreport_error_or_expectedso the existing GX matcher demotes the fallback log.tracing::error!→tracing::warn!/info!for these conditions.Problem
The Sentry inbox is dominated by deterministic, non-actionable signals: user-configuration rejections (missing API keys, paywalled models, unsupported model capabilities), transient upstream HTTP (502/504 from older release wire shapes that the existing matcher missed), self-healing internal states (memory-store circuit breaker, disk full), and provider-contract protocol gaps (thinking-mode
reasoning_contentround-trip). Each one fires an error event per affected turn, hides real regressions in the noise, and has no remediation Sentry can act on — the UI already surfaces the actionable error to the user where one exists.These are the same class of demotion this codebase already applies through
expected_error_kind(precedent: PR #2692 demoted backend Cloudflare anti-bot wrap, and the existing Wave 1-4 matchers). The patterns are stable, the volume is high, and every new wave continues to follow the same shape.Solution
src/core/observability.rsExpectedErrorKind::MemoryStoreBreakerOpen— anchored on[memory_tree]+circuit breaker opensubstrings. The breaker is the recovery mechanism, not the failure; demoted towarn!. Drops TAURI-RUST-52X.ExpectedErrorKind::DiskFull— anchored onno space left on device(POSIX errno 28) andnot enough space on the disk(Windows ERROR_DISK_FULL / 39 / 112). Demoted towarn!. Drops TAURI-RUST-H4.ApiKeyMissingarm extended with_api_key is not configuredto catch<PROVIDER>_API_KEY is not configuredenv-var-style rejections from the OpenHuman backend embedding endpoint. Drops TAURI-RUST-2H5 / TAURI-RUST-2.is_embedding_backend_auth_failure(lower)— anchored onembedding api error+401+invalid token, routed toBackendUserError(deliberately notSessionExpiredto avoid racing the chat-side session-teardown publication incore::jsonrpc::is_session_expired_error). Drops TAURI-RUST-T.is_ollama_user_config_rejectionextended:ollama embed failed+this model does not support embeddings(TAURI-RUST-3X, 501).ollama embed failed+status 401(TAURI-RUST-3E, 401 unauthorized).is_transient_upstream_http_messageextended with the legacy no-paren wire shapeapi error <code>(with trailing space) emitted by olderembeddings::openai/embeddings::cohereformats. Drops TAURI-RUST-H (504) and TAURI-RUST-2T (502).src/openhuman/inference/provider/config_rejection.rs"requires a subscription, upgrade for access"— Ollama Cloud paywall body fromcompatible::OpenAiCompatibleProviderwithname = "ollama". Drops TAURI-RUST-4XK."thinking mode must be passed back"— DeepSeek-R1 / Moonshot K2-thinking 400 whenreasoning_contentisn't echoed back. Anchored without backticks so the match survives upstream-side quote-style variance. Drops TAURI-RUST-2G / TAURI-RUST-2F. (Note: the underlying contract gap is on our side — the inference layer should round-tripreasoning_content— but until that lands, the Sentry-side noise is removed.)src/openhuman/memory_store/factories.rsreport_ollama_health_gate_onceswitched fromreport_error_messagetoreport_error_or_expectedso the existing GX arm ofis_ollama_user_config_rejectiondemotes the fallback log. Wire shape stays byte-identical. Drops TAURI-RUST-B.Doc-only updates for TAURI-RUST-K, TAURI-RUST-8K, TAURI-RUST-CE, TAURI-RUST-9X — these wire shapes are already classified by existing matchers (MA/KM and
/openai/v1/models); the residual Sentry events are from release0.54.0which predates the matchers and will tail off as users upgrade. Self-hosted Sentry shortIds added to the relevant doc-comments for future grep.Tradeoffs:
[memory_tree]+circuit breaker open,ollama embed failed+status 401,embedding api error+401+invalid token) so unrelated prose mentioning a single substring is not silenced. Each new test block includes a negative-assertion fixture for this.DiskFull/MemoryStoreBreakerOpenvariants demote atwarn!(notinfo!) so a sustained spike still shows up in operator dashboards; only the per-event Sentry capture is suppressed.BackendUserErrorrather thanSessionExpired. The chat-side 401 is authoritative for session teardown; bridging the embedding 401 to the same path could race with the chat-sideDomainEvent::SessionExpiredpublication and clear a still-valid session under intermittent backend mis-flagging.Submission Checklist
core::observability::tests(classifies_memory_store_breaker_open,does_not_classify_unrelated_breaker_messages,classifies_backend_env_api_key_not_configured,does_not_classify_unrelated_is_not_configured_messages,classifies_disk_full_errors,does_not_classify_unrelated_space_messages,classifies_embedding_backend_auth_failure,does_not_classify_unrelated_invalid_token_messages,does_not_classify_unrelated_digit_runs_as_transient); extended existingclassifies_ollama_user_config_rejectionsandclassifies_transient_upstream_http_errors; extendedconfig_rejection::tests::detects_wave4_sentry_bodieswith TAURI-RUST-4XK / -2G / -2F bodies.diff-cover) meet the gate enforced by.github/workflows/coverage.yml. Runpnpm test:coverageandpnpm test:rustlocally; PRs below 80% on changed lines will not merge.N/A: observability-only change, no new feature rows. Classifier coverage stays inside the samecore::observabilitybucket already tracked.## Related—N/A: no feature IDs touched.N/A: log-level demotion only; no user-visible UI surface changes.Closes #NNNin the## RelatedsectionImpact
String::containscalls per error path; the additions amortize to a few extra μs per failed turn, deep inside an already-failing path. No hot-path impact.tracinglevel for matched messages changes (Error → Warn / Info). No secrets or PII handling change. Thefactories.rsswitch fromreport_error_messagetoreport_error_or_expectedpreserves the same tag set (ollama_host,fallback) and the same wire body.0.54.0,0.56.0) keep flowing from already-installed clients until those clients update — events tail off naturally with release adoption, not with this PR landing. The matchers are pure additions; existing classifications stay intact (regression-guarded by the negative-assertion tests).Related
Closes: TAURI-RUST-52X, TAURI-RUST-2H5 + TAURI-RUST-2, TAURI-RUST-H4, TAURI-RUST-T, TAURI-RUST-3X, TAURI-RUST-H + TAURI-RUST-2T, TAURI-RUST-4XK, TAURI-RUST-3E, TAURI-RUST-2G + TAURI-RUST-2F
Pattern precedent: PR fix(observability): demote backend Cloudflare anti-bot wrap (Sentry TAURI-RUST-34H) #2692 (
fix(observability): demote backend Cloudflare anti-bot wrap (Sentry TAURI-RUST-34H)), PR fix: route embedding emitters through expected-error reporting #2216 (Wave 4 ollama matchers MA/KM), and the Wave 1-3 config-rejection commits referenced inline insrc/openhuman/inference/provider/config_rejection.rs.Summary by CodeRabbit
Release Notes