fix(subconscious): preserve anyhow chain + retry DDL with busy_timeout on SQLITE_BUSY by oxoxDev · Pull Request #2851 · tinyhumansai/openhuman

oxoxDev · 2026-05-28T12:50:28Z

Summary

Surface the rusqlite root cause behind failed to run subconscious schema DDL Sentry events by preserving the anyhow chain across the RPC Result<Value, String> boundary (was: outer wrapper string only — Sentry never saw the underlying database is locked / disk I/O error / etc.).
Defend the subconscious DB open + DDL run against the most likely root cause (SQLite SQLITE_BUSY / SQLITE_LOCKED lock-races between concurrent in-process callers) via per-connection busy_timeout(5s) plus a 3-attempt exponential backoff retry (100/300/900 ms).

Problem

Self-hosted Sentry issue TAURI-RUST-A (~1,383 events / 24h, cross-platform — Windows 10.0.26200 + macOS builds 25F71/22H730) fires whenever conn.execute_batch(SCHEMA_DDL) at src/openhuman/subconscious/store.rs:30 returns an error. Two distinct problems compounded:

Diagnostic blackout: every emit site in src/openhuman/subconscious/schemas.rs was flattening anyhow to String via .map_err(|e| e.to_string()) BEFORE the central report_error helper saw it. report_error already uses format!("{err:#}") (lines 718 / 732) to walk the chain — but anyhow's Display only prints the outermost context, so by the time the helper got the string the rusqlite root was already gone. Sentry events showed nothing past failed to run subconscious schema DDL.
Underlying transient: the SCHEMA_DDL itself has not changed in 7+ days, but release 0.56.0+e8968077aeb5 cut 2026-05-27 16:42 UTC saw newGroups: 696 on this signature — strongly suggesting a traffic-pattern change (more concurrent with_connection callers from subconscious-adjacent RPCs) tipping the default SQLite busy_timeout = 0 into immediate SQLITE_BUSY errors.

Solution

Commit 1 — fix(subconscious): preserve anyhow chain across RPC boundary so rusqlite root reaches Sentry (bfd76658)

src/openhuman/subconscious/schemas.rs (13 call sites): swap every .map_err(|e| e.to_string()) → .map_err(|e| format!("{e:#}")). The :# formatter walks the anyhow context chain inline.
src/openhuman/subconscious/schemas_tests.rs (+87): 2 guard tests pin the chain-formatter shape — one synthetic 3-level .context chain, one with a real rusqlite::Error::SqliteFailure(DatabaseBusy, "database is locked") wrapped through with_context.

Commit 2 — fix(subconscious): retry DDL with busy_timeout on SQLITE_BUSY (a0b6de7a)

src/openhuman/subconscious/store.rs (+130/-3): refactor with_connection into open_and_initialize_with_retry + open_and_initialize. Per-connection busy_timeout(5s) set BEFORE the SCHEMA_DDL run. 3-attempt exponential backoff (100/300/900 ms) gated by a new private is_sqlite_busy classifier that matches DatabaseBusy | DatabaseLocked via rusqlite::Error downcast with a text-fallback for context-wrapped chains.
src/openhuman/subconscious/store_tests.rs (+183): 6 classifier guards (DatabaseBusy code / DatabaseLocked code / ConstraintViolation rejection / syntax-error rejection / through-context-layers / text-fallback) + 1 end-to-end lock-race test (side thread holds an EXCLUSIVE write txn for ~250 ms; main thread's with_connection retries through and succeeds once the blocker rolls back).

Deliberately NOT shipped (with rationale)

No observability.rs change: the helper already walks the chain. Fixing at the 13 emit sites in schemas.rs is narrower-blast-radius AND avoids conflict with PR fix(observability): demote expected-error Sentry buckets across embeddings, provider, memory-store, FS, and thinking-mode wire shapes #2830 (fix/observability-expected-errors-wave5, merged 2026-05-28).
No ExpectedErrorKind arm for "subconscious DDL + database is locked": after this PR's retry+busy_timeout(5s) loop, the only path to a Sentry event is exhausting ~6 s of unrelenting contention — exactly when we WANT the event to reach Sentry (long-held write txn, AV scanner holding the file, disk-full, real corruption). Pre-emptive classification would silence diagnostic signal during the post-deploy observation window.
No corruption-quarantine arm: orchestrator deliberately gated this on Commit 1's deployment surfacing the "file is encrypted or is not a database" shape first. Without 24h of post-deploy Sentry observation confirming corruption is dominant, ship-before-confirmation risks data loss via false-positive quarantine.

Submission Checklist

Tests added or updated (happy path + at least one failure / edge case) per Testing Strategy — 2 chain-formatter guards + 6 classifier guards + 1 end-to-end lock-race test.
Diff coverage ≥ 80% — changed lines (Vitest + cargo-llvm-cov merged via diff-cover) meet the gate enforced by .github/workflows/coverage.yml. Touched code is Rust-only and is exercised by the new subconscious::store_tests + subconscious::schemas_tests cases.
Coverage matrix updated — N/A: behaviour-only resilience + observability change, no new feature rows. Existing subconscious-domain bucket already tracked.
All affected feature IDs from the matrix are listed in the PR description under ## Related — N/A: no feature IDs touched.
No new external network dependencies introduced.
Manual smoke checklist updated if this touches release-cut surfaces — N/A: subconscious DDL + RPC error-shape change only; no user-visible UI surface.
Linked issue closed via Closes #NNN in the ## Related section — N/A: Sentry-only, no GitHub issue created. Sentry-Issue header below.

Impact

Runtime/platform: desktop (macOS, Windows, Linux). Tauri-rust + core-rust both affected (same with_connection codepath). iOS unaffected (no subconscious store on-device).
Performance: busy_timeout = 5000 is per-connection and only blocks WHEN contention exists — uncontended opens are unchanged. The 3-attempt retry adds at most ~1.3 s of bounded backoff before surfacing a real failure, and only fires for the SQLITE_BUSY/LOCKED arm of the classifier (constraint violations / syntax errors fail-fast unchanged).
Security: none. No secrets / PII path. The new chain-formatter exposes the rusqlite error reason in RPC error strings — these were already user-visible per the existing .map_err(|e| e.to_string()) calls; the chain just goes one layer deeper.
Migration: none. No schema, no config, no IPC contract change.
Compatibility: pure additive. Older clients keep flowing existing events until they upgrade — the matchers and retry are new code only.

Closes: N/A: Sentry-only triage, no GitHub issue tracked.
Follow-up PR(s)/TODOs:
- Post-deploy (24h): triage new Sentry events for TAURI-RUST-A (now showing rusqlite root chain) — confirm SQLITE_BUSY is dominant or pivot to a different defensive arm (SQLITE_IOERR, file is encrypted or is not a database).
- If corruption surfaces in dominant share: ship the quarantine arm (rename to .subconscious-corrupt-{ts}, NEVER unlink — preserve user data).

AI Authored PR Metadata (required for Codex/Linear PRs)

Linear Issue

Key: N/A (Sentry-driven; no Linear ticket)
URL: N/A

Commit & Branch

Branch: fix/sentry-A-subconscious-ddl-resilience
Commit SHA: a0b6de7a1fbefe3ac0179e5d519585c7fcbd9aff

Validation Run

N/A: pnpm --filter openhuman-app format:check — no app/ changes in this PR.
N/A: pnpm typecheck — no app/ changes in this PR.
Focused tests: cargo test --lib openhuman::subconscious → 106 passed / 0 failed / 0 ignored.
Rust fmt/check (if changed): cargo fmt --all -- --check clean; cargo check --lib clean (10 pre-existing warnings unchanged).
N/A: Tauri fmt/check (if changed) — no app/src-tauri/ changes in this PR.

Validation Blocked

command: cargo test --lib (full lib, not scoped) — known to SIGKILL locally on discord_integration per project memory
error: SIGKILL
impact: Bounded by scoping the run to cargo test --lib openhuman::subconscious (the touched module). CI will run the full matrix.

Behavior Changes

Intended behavior change: with_connection becomes resilient to SQLite SQLITE_BUSY / SQLITE_LOCKED transients via per-connection busy_timeout(5s) + 3-attempt retry; schemas.rs error mapping preserves the anyhow chain.
User-visible effect: subconscious RPCs (e.g. subconscious_tasks_list, subconscious_status) succeed under transient lock contention that previously failed immediately. When they DO fail (after exhausting the retry budget), the resulting RPC error string carries the rusqlite root cause instead of just failed to run subconscious schema DDL.

Parity Contract

Legacy behavior preserved: uncontended opens are unchanged. Non-busy rusqlite errors (constraint violations, syntax errors, file-not-found) still fail-fast on the first attempt — the classifier is anchored on DatabaseBusy | DatabaseLocked only.
Guard/fallback/dispatch parity checks: classifier covers both direct rusqlite::Error downcasts AND context-wrapped chains via text-fallback (6 dedicated unit tests).

Duplicate / Superseded PR Handling

Duplicate PR(s): none found via gh pr list -R tinyhumansai/openhuman --state all --search "in:body,title TAURI-RUST-A".
Canonical PR: this.
Resolution: N/A.

Summary by CodeRabbit

Bug Fixes
- Improved error messages in API responses to provide complete error context for easier troubleshooting.
Reliability
- Enhanced database connection handling with automatic retries when database contention occurs, improving stability during concurrent operations.

…ite root reaches Sentry The subconscious RPC handlers funnel anyhow::Result errors into the JSON-RPC `Result<Value, String>` boundary via `map_err(|e| e.to_string())`. Plain `to_string()` on an `anyhow::Error` returns ONLY the outermost context — for `store::with_connection` failures that meant only the generic wrapper `"failed to run subconscious schema DDL"` reached jsonrpc::invoke_method_inner, which in turn captured it in Sentry. The underlying rusqlite root cause (the actual SQLite error code + body) was permanently invisible. This is the load-bearing fix for Sentry TAURI-RUST-A (self-hosted, ~1.3k events in 24h, cross-platform Windows + macOS) where every event carries the same opaque wrapper text. Without a visible root cause we cannot distinguish a DDL lock-race from a corruption / disk-full / permission failure — all four fingerprint identically. Switch every `.map_err(|e| e.to_string())` in schemas.rs to `format!("{e:#}")` (anyhow's alternate Display walks the cause chain inline joined by ": "). Mirrors the pattern already used in `observability::report_error` itself and the whatsapp_data / memory_queue retry helpers. Two guard tests pin the formatter shape so future contributors cannot silently regress to plain Display: one synthetic chain, one with a real `rusqlite::Error::SqliteFailure(DatabaseBusy, "database is locked")` wrapped through the same `with_context` shape that `with_connection` uses in production.

Adds defence-in-depth against transient SQLite lock contention at the `Connection::open + execute_batch(SCHEMA_DDL)` boundary in `store::with_connection`. Motivated by Sentry TAURI-RUST-A (self-hosted, ~1.3k events / 24h, cross-platform Windows + macOS) where the rusqlite root surfaces as SQLITE_BUSY immediately under concurrent subconscious RPC traffic (status poll every 3 s + tasks_list on Intelligence page + manual trigger, each opening its own connection). Three layers: 1. Per-connection `busy_timeout(5s)`. SQLite's default is 0, so the first lock contention returns SQLITE_BUSY without waiting. Set BEFORE running SCHEMA_DDL since the very first PRAGMA / CREATE TABLE can race with another in-process connection. 2. Application-level retry, 3 attempts, exponential backoff 100 / 300 / 900 ms (≤ 1.3 s total). Catches the residual case where the busy handler is exhausted (long-running external txn, AV scan holding the file). 3. Retry classifier (`is_sqlite_busy`) gates on `DatabaseBusy | DatabaseLocked` only — schema / syntax / corruption errors are real bugs or unrecoverable file-state failures, so retrying just delays the same report. Modelled directly on the established precedents `whatsapp_data::sqlite_retry` and `memory_queue::worker::is_sqlite_busy`, plus the `memory_store::unified::init` busy_timeout configuration. Tests: - Six classifier guards: DatabaseBusy code, DatabaseLocked code, ConstraintViolation rejection, syntax-error rejection, downcast through `with_context` layers (production failure shape), text fallback for stringified errors. - One end-to-end lock-race test: a side thread holds an EXCLUSIVE write transaction for ~250 ms, the main thread calls `with_connection`, the retry loop absorbs the contention and the query completes successfully once the blocker rolls back.

coderabbitai · 2026-05-28T12:50:49Z

📝 Walkthrough

Walkthrough

This PR enhances SQLite connection resilience by improving error visibility, classifying transient database contention, and implementing automatic retry logic. RPC handlers now preserve full error chains; the store layer adds a classifier for SQLITE_BUSY/SQLITE_LOCKED conditions and retries open+DDL+migrations with exponential backoff and per-connection busy timeouts.

Changes

SQLite Resilience & Error Propagation

Layer / File(s)	Summary
Improved error chain formatting in RPC handlers `src/openhuman/subconscious/schemas.rs`, `src/openhuman/subconscious/schemas_tests.rs`	Twelve RPC handlers (`handle_status`, `handle_tasks_list`, `handle_tasks_add`, `handle_tasks_update`, `handle_tasks_remove`, `handle_log_list`, `handle_escalations_list`, `handle_escalations_approve`, `handle_escalations_dismiss`, `handle_reflections_list`, `handle_reflections_act`, `handle_reflections_dismiss`) now format errors with `format!("{e:#}")` instead of `e.to_string()` to preserve full error chain context. Two new tests validate that anyhow error chains—including nested rusqlite root causes—are preserved through the alternate display format.
SQLite busy error classification and documentation `src/openhuman/subconscious/store.rs`, `src/openhuman/subconscious/store_tests.rs`	Adds contention-handling constants (`BUSY_TIMEOUT=5s`, `OPEN_RETRY_ATTEMPTS`, exponential backoff base) and comprehensive documentation. Introduces `is_sqlite_busy` classifier that detects transient contention by downcasting to `rusqlite::Error::SqliteFailure` for `DatabaseBusy`/`DatabaseLocked` codes, with fallback string matching for "database is locked" in formatted error output. Unit tests verify classification for exact error codes, anyhow-wrapped errors, and rejection of constraint/schema failures.
Connection retry orchestration with exponential backoff `src/openhuman/subconscious/store.rs`, `src/openhuman/subconscious/store_tests.rs`	Refactors `with_connection` to retry open+DDL+migrations via `open_and_initialize_with_retry` when `is_sqlite_busy` matches, using exponential backoff with logging. Introduces `open_and_initialize` single-shot helper that opens the SQLite file, sets per-connection `busy_timeout` before DDL, executes schema batch, and applies idempotent migrations. Integration test spawns a blocker thread holding an exclusive write lock and verifies `with_connection` succeeds despite external contention.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested labels

bug, rust-core

Suggested reviewers

M3gA-Mind
graycyrus

Poem

🐰 From locked databases we leap and bound,
With error chains now fully found!
Retry, backoff, give it time—
SQLite contention? No more crime! ✨

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Title check	✅ Passed	The pull request title accurately summarizes the two main changes: preserving the anyhow error chain format and adding retry logic for SQLite busy errors with a busy_timeout mechanism.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai · 2026-05-28T12:53:54Z

Actionable comments posted: 0

graycyrus

@oxoxDev hey! the code looks good to me, but CI isn't fully green yet so i'll hold off on approving.

Quick summary of what i saw:

The two-commit approach is clean. Fixing the 13 e.to_string() sites to format!("{e:#}") is the right narrow-blast-radius fix — the observability helper already walks the chain, the problem was the string flattening happening before it got there. The open_and_initialize_with_retry refactor is well-structured: busy_timeout set before DDL, classifier gated on DatabaseBusy | DatabaseLocked only, non-busy errors fail-fast unchanged. Test coverage is thorough — the end-to-end lock-race test with the blocker thread is exactly the right kind of guard for this.

One thing worth checking: std::thread::sleep inside open_and_initialize_with_retry runs on whatever thread calls with_connection. If these RPC handlers run on async executor threads rather than a dedicated blocking thread pool, sleeping up to 1.3s in the retry path blocks a runtime thread. This matches the pattern in whatsapp_data::sqlite_retry and memory_queue::worker, so it's likely fine if those domains handle it the same way — just worth confirming the call sites go through spawn_blocking or equivalent.

On CI:

The PR Submission Checklist failure is a nit: the three N/A items in the Validation section have [ ] instead of [x] — the checker requires [x] even for N/A-annotated lines. Quick fix.
The Rust Core Tests + Quality failure looks like the known discord_integration SIGKILL — the test step has no conclusion, not a test assertion failure. But i need CI green before i can approve, so once you fix the checklist items and CI re-runs clean, i'll approve.

graycyrus

CI is fully green now — approving.

Picked this up in review 1 already but to recap: clean, targeted fix for Sentry TAURI-RUST-A. The two-commit structure is appropriate — the chain-preservation change is independent and separating it makes bisect easy. The is_sqlite_busy classifier is well-designed: anchored on specific ErrorCode variants with a text-fallback for context-wrapped chains, and explicitly excludes constraint/syntax errors from the retry path so real bugs still fail fast. Test coverage is thorough — the end-to-end lock-race test using a real exclusive transaction is the right way to prove this, not a mock.

The one note from last time stands: std::thread::sleep in open_and_initialize_with_retry blocks the calling thread for up to 1.3 s on retry. Worth confirming all with_connection call sites go through spawn_blocking (or equivalent) when called from an async executor — but this matches the existing pattern in whatsapp_data::sqlite_retry and memory_queue::worker, so it's not a blocker here.

M3gA-Mind

Solid fix — both commits are correct and well-scoped.

Commit 1 (bfd76658) — chain preservation: All 13 map_err(|e| e.to_string()) sites correctly swapped to format!("{e:#}"). The two guard tests are exemplary — one proves that plain Display drops the root (making the fix load-bearing by construction), the other pins the real rusqlite::Error::SqliteFailure shape that appears in production. Future contributors can't silently regress.

Commit 2 (a0b6de7a) — retry + busy_timeout:

busy_timeout(5s) set before SCHEMA_DDL — correct order; the very first PRAGMA in the DDL can already race.
is_sqlite_busy covers both direct downcast and context-wrapped chains via text-fallback — matches the sibling implementations in memory_queue and whatsapp_data.
Classifier correctly rejects ConstraintViolation, Unknown/syntax-error — only transient lock codes retry.
End-to-end lock-race test with mpsc synchronization is well-constructed; the timer gives the blocker ~250 ms of hold time while the retry loop (100→300 ms) lands in the window.

No issues. CI green, coverage gate passes, both previous approvers signed off.

oxoxDev added 2 commits May 28, 2026 17:03

oxoxDev requested a review from a team May 28, 2026 12:50

coderabbitai Bot added rust-core Core Rust runtime in src/: CLI, core_server, shared infrastructure. bug labels May 28, 2026

coderabbitai Bot approved these changes May 28, 2026

View reviewed changes

graycyrus reviewed May 28, 2026

View reviewed changes

oxoxDev mentioned this pull request May 28, 2026

feat(search): add Querit provider #2753

Merged

12 tasks

graycyrus approved these changes May 28, 2026

View reviewed changes

M3gA-Mind approved these changes May 28, 2026

View reviewed changes

M3gA-Mind merged commit 0bd17ea into tinyhumansai:main May 28, 2026
49 of 57 checks passed

coderabbitai Bot mentioned this pull request May 29, 2026

fix(subconscious): suppress repeated Sentry noise when DB init fails (TAURI-RUST-A, closes #2903) #2906

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(subconscious): preserve anyhow chain + retry DDL with busy_timeout on SQLITE_BUSY#2851

fix(subconscious): preserve anyhow chain + retry DDL with busy_timeout on SQLITE_BUSY#2851
M3gA-Mind merged 2 commits into
tinyhumansai:mainfrom
oxoxDev:fix/sentry-A-subconscious-ddl-resilience

oxoxDev commented May 28, 2026 •

edited

Loading

Uh oh!

coderabbitai Bot commented May 28, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot commented May 28, 2026

Uh oh!

graycyrus left a comment

Uh oh!

graycyrus left a comment

Uh oh!

M3gA-Mind left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

oxoxDev commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Deliberately NOT shipped (with rationale)

Submission Checklist

Impact

Related

AI Authored PR Metadata (required for Codex/Linear PRs)

Linear Issue

Commit & Branch

Validation Run

Validation Blocked

Behavior Changes

Parity Contract

Duplicate / Superseded PR Handling

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Suggested labels

Suggested reviewers

Poem

Uh oh!

coderabbitai Bot commented May 28, 2026

Uh oh!

graycyrus left a comment

Choose a reason for hiding this comment

Uh oh!

graycyrus left a comment

Choose a reason for hiding this comment

Uh oh!

M3gA-Mind left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

oxoxDev commented May 28, 2026 •

edited

Loading

coderabbitai Bot commented May 28, 2026 •

edited

Loading