Skip to content

fix(subconscious): preserve anyhow chain + retry DDL with busy_timeout on SQLITE_BUSY#2851

Merged
M3gA-Mind merged 2 commits into
tinyhumansai:mainfrom
oxoxDev:fix/sentry-A-subconscious-ddl-resilience
May 28, 2026
Merged

fix(subconscious): preserve anyhow chain + retry DDL with busy_timeout on SQLITE_BUSY#2851
M3gA-Mind merged 2 commits into
tinyhumansai:mainfrom
oxoxDev:fix/sentry-A-subconscious-ddl-resilience

Conversation

@oxoxDev
Copy link
Copy Markdown
Contributor

@oxoxDev oxoxDev commented May 28, 2026

Summary

  • Surface the rusqlite root cause behind failed to run subconscious schema DDL Sentry events by preserving the anyhow chain across the RPC Result<Value, String> boundary (was: outer wrapper string only — Sentry never saw the underlying database is locked / disk I/O error / etc.).
  • Defend the subconscious DB open + DDL run against the most likely root cause (SQLite SQLITE_BUSY / SQLITE_LOCKED lock-races between concurrent in-process callers) via per-connection busy_timeout(5s) plus a 3-attempt exponential backoff retry (100/300/900 ms).

Problem

Self-hosted Sentry issue TAURI-RUST-A (~1,383 events / 24h, cross-platform — Windows 10.0.26200 + macOS builds 25F71/22H730) fires whenever conn.execute_batch(SCHEMA_DDL) at src/openhuman/subconscious/store.rs:30 returns an error. Two distinct problems compounded:

  1. Diagnostic blackout: every emit site in src/openhuman/subconscious/schemas.rs was flattening anyhow to String via .map_err(|e| e.to_string()) BEFORE the central report_error helper saw it. report_error already uses format!("{err:#}") (lines 718 / 732) to walk the chain — but anyhow's Display only prints the outermost context, so by the time the helper got the string the rusqlite root was already gone. Sentry events showed nothing past failed to run subconscious schema DDL.
  2. Underlying transient: the SCHEMA_DDL itself has not changed in 7+ days, but release 0.56.0+e8968077aeb5 cut 2026-05-27 16:42 UTC saw newGroups: 696 on this signature — strongly suggesting a traffic-pattern change (more concurrent with_connection callers from subconscious-adjacent RPCs) tipping the default SQLite busy_timeout = 0 into immediate SQLITE_BUSY errors.

Solution

Commit 1 — fix(subconscious): preserve anyhow chain across RPC boundary so rusqlite root reaches Sentry (bfd76658)

  • src/openhuman/subconscious/schemas.rs (13 call sites): swap every .map_err(|e| e.to_string()).map_err(|e| format!("{e:#}")). The :# formatter walks the anyhow context chain inline.
  • src/openhuman/subconscious/schemas_tests.rs (+87): 2 guard tests pin the chain-formatter shape — one synthetic 3-level .context chain, one with a real rusqlite::Error::SqliteFailure(DatabaseBusy, "database is locked") wrapped through with_context.

Commit 2 — fix(subconscious): retry DDL with busy_timeout on SQLITE_BUSY (a0b6de7a)

  • src/openhuman/subconscious/store.rs (+130/-3): refactor with_connection into open_and_initialize_with_retry + open_and_initialize. Per-connection busy_timeout(5s) set BEFORE the SCHEMA_DDL run. 3-attempt exponential backoff (100/300/900 ms) gated by a new private is_sqlite_busy classifier that matches DatabaseBusy | DatabaseLocked via rusqlite::Error downcast with a text-fallback for context-wrapped chains.
  • src/openhuman/subconscious/store_tests.rs (+183): 6 classifier guards (DatabaseBusy code / DatabaseLocked code / ConstraintViolation rejection / syntax-error rejection / through-context-layers / text-fallback) + 1 end-to-end lock-race test (side thread holds an EXCLUSIVE write txn for ~250 ms; main thread's with_connection retries through and succeeds once the blocker rolls back).

Deliberately NOT shipped (with rationale)

  • No observability.rs change: the helper already walks the chain. Fixing at the 13 emit sites in schemas.rs is narrower-blast-radius AND avoids conflict with PR fix(observability): demote expected-error Sentry buckets across embeddings, provider, memory-store, FS, and thinking-mode wire shapes #2830 (fix/observability-expected-errors-wave5, merged 2026-05-28).
  • No ExpectedErrorKind arm for "subconscious DDL + database is locked": after this PR's retry+busy_timeout(5s) loop, the only path to a Sentry event is exhausting ~6 s of unrelenting contention — exactly when we WANT the event to reach Sentry (long-held write txn, AV scanner holding the file, disk-full, real corruption). Pre-emptive classification would silence diagnostic signal during the post-deploy observation window.
  • No corruption-quarantine arm: orchestrator deliberately gated this on Commit 1's deployment surfacing the "file is encrypted or is not a database" shape first. Without 24h of post-deploy Sentry observation confirming corruption is dominant, ship-before-confirmation risks data loss via false-positive quarantine.

Submission Checklist

  • Tests added or updated (happy path + at least one failure / edge case) per Testing Strategy — 2 chain-formatter guards + 6 classifier guards + 1 end-to-end lock-race test.
  • Diff coverage ≥ 80% — changed lines (Vitest + cargo-llvm-cov merged via diff-cover) meet the gate enforced by .github/workflows/coverage.yml. Touched code is Rust-only and is exercised by the new subconscious::store_tests + subconscious::schemas_tests cases.
  • Coverage matrix updated — N/A: behaviour-only resilience + observability change, no new feature rows. Existing subconscious-domain bucket already tracked.
  • All affected feature IDs from the matrix are listed in the PR description under ## RelatedN/A: no feature IDs touched.
  • No new external network dependencies introduced.
  • Manual smoke checklist updated if this touches release-cut surfaces — N/A: subconscious DDL + RPC error-shape change only; no user-visible UI surface.
  • Linked issue closed via Closes #NNN in the ## Related section — N/A: Sentry-only, no GitHub issue created. Sentry-Issue header below.

Impact

  • Runtime/platform: desktop (macOS, Windows, Linux). Tauri-rust + core-rust both affected (same with_connection codepath). iOS unaffected (no subconscious store on-device).
  • Performance: busy_timeout = 5000 is per-connection and only blocks WHEN contention exists — uncontended opens are unchanged. The 3-attempt retry adds at most ~1.3 s of bounded backoff before surfacing a real failure, and only fires for the SQLITE_BUSY/LOCKED arm of the classifier (constraint violations / syntax errors fail-fast unchanged).
  • Security: none. No secrets / PII path. The new chain-formatter exposes the rusqlite error reason in RPC error strings — these were already user-visible per the existing .map_err(|e| e.to_string()) calls; the chain just goes one layer deeper.
  • Migration: none. No schema, no config, no IPC contract change.
  • Compatibility: pure additive. Older clients keep flowing existing events until they upgrade — the matchers and retry are new code only.

Related

Sentry-Issue: TAURI-RUST-A

  • Closes: N/A: Sentry-only triage, no GitHub issue tracked.
  • Follow-up PR(s)/TODOs:
    • Post-deploy (24h): triage new Sentry events for TAURI-RUST-A (now showing rusqlite root chain) — confirm SQLITE_BUSY is dominant or pivot to a different defensive arm (SQLITE_IOERR, file is encrypted or is not a database).
    • If corruption surfaces in dominant share: ship the quarantine arm (rename to .subconscious-corrupt-{ts}, NEVER unlink — preserve user data).

AI Authored PR Metadata (required for Codex/Linear PRs)

Linear Issue

  • Key: N/A (Sentry-driven; no Linear ticket)
  • URL: N/A

Commit & Branch

  • Branch: fix/sentry-A-subconscious-ddl-resilience
  • Commit SHA: a0b6de7a1fbefe3ac0179e5d519585c7fcbd9aff

Validation Run

  • N/A: pnpm --filter openhuman-app format:check — no app/ changes in this PR.
  • N/A: pnpm typecheck — no app/ changes in this PR.
  • Focused tests: cargo test --lib openhuman::subconscious → 106 passed / 0 failed / 0 ignored.
  • Rust fmt/check (if changed): cargo fmt --all -- --check clean; cargo check --lib clean (10 pre-existing warnings unchanged).
  • N/A: Tauri fmt/check (if changed) — no app/src-tauri/ changes in this PR.

Validation Blocked

  • command: cargo test --lib (full lib, not scoped) — known to SIGKILL locally on discord_integration per project memory
  • error: SIGKILL
  • impact: Bounded by scoping the run to cargo test --lib openhuman::subconscious (the touched module). CI will run the full matrix.

Behavior Changes

  • Intended behavior change: with_connection becomes resilient to SQLite SQLITE_BUSY / SQLITE_LOCKED transients via per-connection busy_timeout(5s) + 3-attempt retry; schemas.rs error mapping preserves the anyhow chain.
  • User-visible effect: subconscious RPCs (e.g. subconscious_tasks_list, subconscious_status) succeed under transient lock contention that previously failed immediately. When they DO fail (after exhausting the retry budget), the resulting RPC error string carries the rusqlite root cause instead of just failed to run subconscious schema DDL.

Parity Contract

  • Legacy behavior preserved: uncontended opens are unchanged. Non-busy rusqlite errors (constraint violations, syntax errors, file-not-found) still fail-fast on the first attempt — the classifier is anchored on DatabaseBusy | DatabaseLocked only.
  • Guard/fallback/dispatch parity checks: classifier covers both direct rusqlite::Error downcasts AND context-wrapped chains via text-fallback (6 dedicated unit tests).

Duplicate / Superseded PR Handling

  • Duplicate PR(s): none found via gh pr list -R tinyhumansai/openhuman --state all --search "in:body,title TAURI-RUST-A".
  • Canonical PR: this.
  • Resolution: N/A.

Summary by CodeRabbit

  • Bug Fixes

    • Improved error messages in API responses to provide complete error context for easier troubleshooting.
  • Reliability

    • Enhanced database connection handling with automatic retries when database contention occurs, improving stability during concurrent operations.

Review Change Stack

oxoxDev added 2 commits May 28, 2026 17:03
…ite root reaches Sentry

The subconscious RPC handlers funnel anyhow::Result errors into the
JSON-RPC `Result<Value, String>` boundary via `map_err(|e| e.to_string())`.
Plain `to_string()` on an `anyhow::Error` returns ONLY the outermost
context — for `store::with_connection` failures that meant only the
generic wrapper `"failed to run subconscious schema DDL"` reached
jsonrpc::invoke_method_inner, which in turn captured it in Sentry. The
underlying rusqlite root cause (the actual SQLite error code + body)
was permanently invisible.

This is the load-bearing fix for Sentry TAURI-RUST-A (self-hosted,
~1.3k events in 24h, cross-platform Windows + macOS) where every event
carries the same opaque wrapper text. Without a visible root cause we
cannot distinguish a DDL lock-race from a corruption / disk-full /
permission failure — all four fingerprint identically.

Switch every `.map_err(|e| e.to_string())` in schemas.rs to
`format!("{e:#}")` (anyhow's alternate Display walks the cause chain
inline joined by ": "). Mirrors the pattern already used in
`observability::report_error` itself and the whatsapp_data /
memory_queue retry helpers.

Two guard tests pin the formatter shape so future contributors cannot
silently regress to plain Display: one synthetic chain, one with a
real `rusqlite::Error::SqliteFailure(DatabaseBusy, "database is
locked")` wrapped through the same `with_context` shape that
`with_connection` uses in production.
Adds defence-in-depth against transient SQLite lock contention at the
`Connection::open + execute_batch(SCHEMA_DDL)` boundary in
`store::with_connection`. Motivated by Sentry TAURI-RUST-A (self-hosted,
~1.3k events / 24h, cross-platform Windows + macOS) where the rusqlite
root surfaces as SQLITE_BUSY immediately under concurrent subconscious
RPC traffic (status poll every 3 s + tasks_list on Intelligence page +
manual trigger, each opening its own connection).

Three layers:

1. Per-connection `busy_timeout(5s)`. SQLite's default is 0, so the
   first lock contention returns SQLITE_BUSY without waiting. Set
   BEFORE running SCHEMA_DDL since the very first PRAGMA / CREATE
   TABLE can race with another in-process connection.
2. Application-level retry, 3 attempts, exponential backoff
   100 / 300 / 900 ms (≤ 1.3 s total). Catches the residual case
   where the busy handler is exhausted (long-running external txn,
   AV scan holding the file).
3. Retry classifier (`is_sqlite_busy`) gates on
   `DatabaseBusy | DatabaseLocked` only — schema / syntax /
   corruption errors are real bugs or unrecoverable file-state
   failures, so retrying just delays the same report.

Modelled directly on the established precedents
`whatsapp_data::sqlite_retry` and `memory_queue::worker::is_sqlite_busy`,
plus the `memory_store::unified::init` busy_timeout configuration.

Tests:
- Six classifier guards: DatabaseBusy code, DatabaseLocked code,
  ConstraintViolation rejection, syntax-error rejection, downcast
  through `with_context` layers (production failure shape), text
  fallback for stringified errors.
- One end-to-end lock-race test: a side thread holds an EXCLUSIVE
  write transaction for ~250 ms, the main thread calls
  `with_connection`, the retry loop absorbs the contention and the
  query completes successfully once the blocker rolls back.
@oxoxDev oxoxDev requested a review from a team May 28, 2026 12:50
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 28, 2026

📝 Walkthrough

Walkthrough

This PR enhances SQLite connection resilience by improving error visibility, classifying transient database contention, and implementing automatic retry logic. RPC handlers now preserve full error chains; the store layer adds a classifier for SQLITE_BUSY/SQLITE_LOCKED conditions and retries open+DDL+migrations with exponential backoff and per-connection busy timeouts.

Changes

SQLite Resilience & Error Propagation

Layer / File(s) Summary
Improved error chain formatting in RPC handlers
src/openhuman/subconscious/schemas.rs, src/openhuman/subconscious/schemas_tests.rs
Twelve RPC handlers (handle_status, handle_tasks_list, handle_tasks_add, handle_tasks_update, handle_tasks_remove, handle_log_list, handle_escalations_list, handle_escalations_approve, handle_escalations_dismiss, handle_reflections_list, handle_reflections_act, handle_reflections_dismiss) now format errors with format!("{e:#}") instead of e.to_string() to preserve full error chain context. Two new tests validate that anyhow error chains—including nested rusqlite root causes—are preserved through the alternate display format.
SQLite busy error classification and documentation
src/openhuman/subconscious/store.rs, src/openhuman/subconscious/store_tests.rs
Adds contention-handling constants (BUSY_TIMEOUT=5s, OPEN_RETRY_ATTEMPTS, exponential backoff base) and comprehensive documentation. Introduces is_sqlite_busy classifier that detects transient contention by downcasting to rusqlite::Error::SqliteFailure for DatabaseBusy/DatabaseLocked codes, with fallback string matching for "database is locked" in formatted error output. Unit tests verify classification for exact error codes, anyhow-wrapped errors, and rejection of constraint/schema failures.
Connection retry orchestration with exponential backoff
src/openhuman/subconscious/store.rs, src/openhuman/subconscious/store_tests.rs
Refactors with_connection to retry open+DDL+migrations via open_and_initialize_with_retry when is_sqlite_busy matches, using exponential backoff with logging. Introduces open_and_initialize single-shot helper that opens the SQLite file, sets per-connection busy_timeout before DDL, executes schema batch, and applies idempotent migrations. Integration test spawns a blocker thread holding an exclusive write lock and verifies with_connection succeeds despite external contention.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

bug, rust-core

Suggested reviewers

  • M3gA-Mind
  • graycyrus

Poem

🐰 From locked databases we leap and bound,
With error chains now fully found!
Retry, backoff, give it time—
SQLite contention? No more crime! ✨

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The pull request title accurately summarizes the two main changes: preserving the anyhow error chain format and adding retry logic for SQLite busy errors with a busy_timeout mechanism.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot added rust-core Core Rust runtime in src/: CLI, core_server, shared infrastructure. bug labels May 28, 2026
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 28, 2026

Actionable comments posted: 0

Copy link
Copy Markdown
Contributor

@graycyrus graycyrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@oxoxDev hey! the code looks good to me, but CI isn't fully green yet so i'll hold off on approving.

Quick summary of what i saw:

The two-commit approach is clean. Fixing the 13 e.to_string() sites to format!("{e:#}") is the right narrow-blast-radius fix — the observability helper already walks the chain, the problem was the string flattening happening before it got there. The open_and_initialize_with_retry refactor is well-structured: busy_timeout set before DDL, classifier gated on DatabaseBusy | DatabaseLocked only, non-busy errors fail-fast unchanged. Test coverage is thorough — the end-to-end lock-race test with the blocker thread is exactly the right kind of guard for this.

One thing worth checking: std::thread::sleep inside open_and_initialize_with_retry runs on whatever thread calls with_connection. If these RPC handlers run on async executor threads rather than a dedicated blocking thread pool, sleeping up to 1.3s in the retry path blocks a runtime thread. This matches the pattern in whatsapp_data::sqlite_retry and memory_queue::worker, so it's likely fine if those domains handle it the same way — just worth confirming the call sites go through spawn_blocking or equivalent.

On CI:

  • The PR Submission Checklist failure is a nit: the three N/A items in the Validation section have [ ] instead of [x] — the checker requires [x] even for N/A-annotated lines. Quick fix.
  • The Rust Core Tests + Quality failure looks like the known discord_integration SIGKILL — the test step has no conclusion, not a test assertion failure. But i need CI green before i can approve, so once you fix the checklist items and CI re-runs clean, i'll approve.

@oxoxDev oxoxDev mentioned this pull request May 28, 2026
12 tasks
Copy link
Copy Markdown
Contributor

@graycyrus graycyrus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CI is fully green now — approving.

Picked this up in review 1 already but to recap: clean, targeted fix for Sentry TAURI-RUST-A. The two-commit structure is appropriate — the chain-preservation change is independent and separating it makes bisect easy. The is_sqlite_busy classifier is well-designed: anchored on specific ErrorCode variants with a text-fallback for context-wrapped chains, and explicitly excludes constraint/syntax errors from the retry path so real bugs still fail fast. Test coverage is thorough — the end-to-end lock-race test using a real exclusive transaction is the right way to prove this, not a mock.

The one note from last time stands: std::thread::sleep in open_and_initialize_with_retry blocks the calling thread for up to 1.3 s on retry. Worth confirming all with_connection call sites go through spawn_blocking (or equivalent) when called from an async executor — but this matches the existing pattern in whatsapp_data::sqlite_retry and memory_queue::worker, so it's not a blocker here.

Copy link
Copy Markdown
Contributor

@M3gA-Mind M3gA-Mind left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid fix — both commits are correct and well-scoped.

Commit 1 (bfd76658) — chain preservation: All 13 map_err(|e| e.to_string()) sites correctly swapped to format!("{e:#}"). The two guard tests are exemplary — one proves that plain Display drops the root (making the fix load-bearing by construction), the other pins the real rusqlite::Error::SqliteFailure shape that appears in production. Future contributors can't silently regress.

Commit 2 (a0b6de7a) — retry + busy_timeout:

  • busy_timeout(5s) set before SCHEMA_DDL — correct order; the very first PRAGMA in the DDL can already race.
  • is_sqlite_busy covers both direct downcast and context-wrapped chains via text-fallback — matches the sibling implementations in memory_queue and whatsapp_data.
  • Classifier correctly rejects ConstraintViolation, Unknown/syntax-error — only transient lock codes retry.
  • End-to-end lock-race test with mpsc synchronization is well-constructed; the timer gives the blocker ~250 ms of hold time while the retry loop (100→300 ms) lands in the window.

No issues. CI green, coverage gate passes, both previous approvers signed off.

@M3gA-Mind M3gA-Mind merged commit 0bd17ea into tinyhumansai:main May 28, 2026
49 of 57 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug rust-core Core Rust runtime in src/: CLI, core_server, shared infrastructure.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants