fix(subconscious): preserve anyhow chain + retry DDL with busy_timeout on SQLITE_BUSY#2851
Conversation
…ite root reaches Sentry
The subconscious RPC handlers funnel anyhow::Result errors into the
JSON-RPC `Result<Value, String>` boundary via `map_err(|e| e.to_string())`.
Plain `to_string()` on an `anyhow::Error` returns ONLY the outermost
context — for `store::with_connection` failures that meant only the
generic wrapper `"failed to run subconscious schema DDL"` reached
jsonrpc::invoke_method_inner, which in turn captured it in Sentry. The
underlying rusqlite root cause (the actual SQLite error code + body)
was permanently invisible.
This is the load-bearing fix for Sentry TAURI-RUST-A (self-hosted,
~1.3k events in 24h, cross-platform Windows + macOS) where every event
carries the same opaque wrapper text. Without a visible root cause we
cannot distinguish a DDL lock-race from a corruption / disk-full /
permission failure — all four fingerprint identically.
Switch every `.map_err(|e| e.to_string())` in schemas.rs to
`format!("{e:#}")` (anyhow's alternate Display walks the cause chain
inline joined by ": "). Mirrors the pattern already used in
`observability::report_error` itself and the whatsapp_data /
memory_queue retry helpers.
Two guard tests pin the formatter shape so future contributors cannot
silently regress to plain Display: one synthetic chain, one with a
real `rusqlite::Error::SqliteFailure(DatabaseBusy, "database is
locked")` wrapped through the same `with_context` shape that
`with_connection` uses in production.
Adds defence-in-depth against transient SQLite lock contention at the `Connection::open + execute_batch(SCHEMA_DDL)` boundary in `store::with_connection`. Motivated by Sentry TAURI-RUST-A (self-hosted, ~1.3k events / 24h, cross-platform Windows + macOS) where the rusqlite root surfaces as SQLITE_BUSY immediately under concurrent subconscious RPC traffic (status poll every 3 s + tasks_list on Intelligence page + manual trigger, each opening its own connection). Three layers: 1. Per-connection `busy_timeout(5s)`. SQLite's default is 0, so the first lock contention returns SQLITE_BUSY without waiting. Set BEFORE running SCHEMA_DDL since the very first PRAGMA / CREATE TABLE can race with another in-process connection. 2. Application-level retry, 3 attempts, exponential backoff 100 / 300 / 900 ms (≤ 1.3 s total). Catches the residual case where the busy handler is exhausted (long-running external txn, AV scan holding the file). 3. Retry classifier (`is_sqlite_busy`) gates on `DatabaseBusy | DatabaseLocked` only — schema / syntax / corruption errors are real bugs or unrecoverable file-state failures, so retrying just delays the same report. Modelled directly on the established precedents `whatsapp_data::sqlite_retry` and `memory_queue::worker::is_sqlite_busy`, plus the `memory_store::unified::init` busy_timeout configuration. Tests: - Six classifier guards: DatabaseBusy code, DatabaseLocked code, ConstraintViolation rejection, syntax-error rejection, downcast through `with_context` layers (production failure shape), text fallback for stringified errors. - One end-to-end lock-race test: a side thread holds an EXCLUSIVE write transaction for ~250 ms, the main thread calls `with_connection`, the retry loop absorbs the contention and the query completes successfully once the blocker rolls back.
📝 WalkthroughWalkthroughThis PR enhances SQLite connection resilience by improving error visibility, classifying transient database contention, and implementing automatic retry logic. RPC handlers now preserve full error chains; the store layer adds a classifier for SQLITE_BUSY/SQLITE_LOCKED conditions and retries open+DDL+migrations with exponential backoff and per-connection busy timeouts. ChangesSQLite Resilience & Error Propagation
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
|
Actionable comments posted: 0 |
graycyrus
left a comment
There was a problem hiding this comment.
@oxoxDev hey! the code looks good to me, but CI isn't fully green yet so i'll hold off on approving.
Quick summary of what i saw:
The two-commit approach is clean. Fixing the 13 e.to_string() sites to format!("{e:#}") is the right narrow-blast-radius fix — the observability helper already walks the chain, the problem was the string flattening happening before it got there. The open_and_initialize_with_retry refactor is well-structured: busy_timeout set before DDL, classifier gated on DatabaseBusy | DatabaseLocked only, non-busy errors fail-fast unchanged. Test coverage is thorough — the end-to-end lock-race test with the blocker thread is exactly the right kind of guard for this.
One thing worth checking: std::thread::sleep inside open_and_initialize_with_retry runs on whatever thread calls with_connection. If these RPC handlers run on async executor threads rather than a dedicated blocking thread pool, sleeping up to 1.3s in the retry path blocks a runtime thread. This matches the pattern in whatsapp_data::sqlite_retry and memory_queue::worker, so it's likely fine if those domains handle it the same way — just worth confirming the call sites go through spawn_blocking or equivalent.
On CI:
- The
PR Submission Checklistfailure is a nit: the three N/A items in the Validation section have[ ]instead of[x]— the checker requires[x]even for N/A-annotated lines. Quick fix. - The
Rust Core Tests + Qualityfailure looks like the knowndiscord_integrationSIGKILL — the test step has no conclusion, not a test assertion failure. But i need CI green before i can approve, so once you fix the checklist items and CI re-runs clean, i'll approve.
graycyrus
left a comment
There was a problem hiding this comment.
CI is fully green now — approving.
Picked this up in review 1 already but to recap: clean, targeted fix for Sentry TAURI-RUST-A. The two-commit structure is appropriate — the chain-preservation change is independent and separating it makes bisect easy. The is_sqlite_busy classifier is well-designed: anchored on specific ErrorCode variants with a text-fallback for context-wrapped chains, and explicitly excludes constraint/syntax errors from the retry path so real bugs still fail fast. Test coverage is thorough — the end-to-end lock-race test using a real exclusive transaction is the right way to prove this, not a mock.
The one note from last time stands: std::thread::sleep in open_and_initialize_with_retry blocks the calling thread for up to 1.3 s on retry. Worth confirming all with_connection call sites go through spawn_blocking (or equivalent) when called from an async executor — but this matches the existing pattern in whatsapp_data::sqlite_retry and memory_queue::worker, so it's not a blocker here.
M3gA-Mind
left a comment
There was a problem hiding this comment.
Solid fix — both commits are correct and well-scoped.
Commit 1 (bfd76658) — chain preservation: All 13 map_err(|e| e.to_string()) sites correctly swapped to format!("{e:#}"). The two guard tests are exemplary — one proves that plain Display drops the root (making the fix load-bearing by construction), the other pins the real rusqlite::Error::SqliteFailure shape that appears in production. Future contributors can't silently regress.
Commit 2 (a0b6de7a) — retry + busy_timeout:
busy_timeout(5s)set before SCHEMA_DDL — correct order; the very first PRAGMA in the DDL can already race.is_sqlite_busycovers both direct downcast and context-wrapped chains via text-fallback — matches the sibling implementations inmemory_queueandwhatsapp_data.- Classifier correctly rejects
ConstraintViolation,Unknown/syntax-error — only transient lock codes retry. - End-to-end lock-race test with
mpscsynchronization is well-constructed; the timer gives the blocker ~250 ms of hold time while the retry loop (100→300 ms) lands in the window.
No issues. CI green, coverage gate passes, both previous approvers signed off.
Summary
failed to run subconscious schema DDLSentry events by preserving the anyhow chain across the RPCResult<Value, String>boundary (was: outer wrapper string only — Sentry never saw the underlyingdatabase is locked/disk I/O error/ etc.).SQLITE_BUSY/SQLITE_LOCKEDlock-races between concurrent in-process callers) via per-connectionbusy_timeout(5s)plus a 3-attempt exponential backoff retry (100/300/900 ms).Problem
Self-hosted Sentry issue TAURI-RUST-A (~1,383 events / 24h, cross-platform — Windows 10.0.26200 + macOS builds 25F71/22H730) fires whenever
conn.execute_batch(SCHEMA_DDL)atsrc/openhuman/subconscious/store.rs:30returns an error. Two distinct problems compounded:src/openhuman/subconscious/schemas.rswas flattening anyhow toStringvia.map_err(|e| e.to_string())BEFORE the centralreport_errorhelper saw it.report_erroralready usesformat!("{err:#}")(lines 718 / 732) to walk the chain — but anyhow'sDisplayonly prints the outermost context, so by the time the helper got the string the rusqlite root was already gone. Sentry events showed nothing pastfailed to run subconscious schema DDL.0.56.0+e8968077aeb5cut 2026-05-27 16:42 UTC sawnewGroups: 696on this signature — strongly suggesting a traffic-pattern change (more concurrentwith_connectioncallers from subconscious-adjacent RPCs) tipping the default SQLitebusy_timeout = 0into immediateSQLITE_BUSYerrors.Solution
Commit 1 —
fix(subconscious): preserve anyhow chain across RPC boundary so rusqlite root reaches Sentry(bfd76658)src/openhuman/subconscious/schemas.rs(13 call sites): swap every.map_err(|e| e.to_string())→.map_err(|e| format!("{e:#}")). The:#formatter walks the anyhow context chain inline.src/openhuman/subconscious/schemas_tests.rs(+87): 2 guard tests pin the chain-formatter shape — one synthetic 3-level.contextchain, one with a realrusqlite::Error::SqliteFailure(DatabaseBusy, "database is locked")wrapped throughwith_context.Commit 2 —
fix(subconscious): retry DDL with busy_timeout on SQLITE_BUSY(a0b6de7a)src/openhuman/subconscious/store.rs(+130/-3): refactorwith_connectionintoopen_and_initialize_with_retry+open_and_initialize. Per-connectionbusy_timeout(5s)set BEFORE the SCHEMA_DDL run. 3-attempt exponential backoff (100/300/900 ms) gated by a new privateis_sqlite_busyclassifier that matchesDatabaseBusy | DatabaseLockedviarusqlite::Errordowncast with a text-fallback for context-wrapped chains.src/openhuman/subconscious/store_tests.rs(+183): 6 classifier guards (DatabaseBusy code / DatabaseLocked code / ConstraintViolation rejection / syntax-error rejection / through-context-layers / text-fallback) + 1 end-to-end lock-race test (side thread holds an EXCLUSIVE write txn for ~250 ms; main thread'swith_connectionretries through and succeeds once the blocker rolls back).Deliberately NOT shipped (with rationale)
observability.rschange: the helper already walks the chain. Fixing at the 13 emit sites inschemas.rsis narrower-blast-radius AND avoids conflict with PR fix(observability): demote expected-error Sentry buckets across embeddings, provider, memory-store, FS, and thinking-mode wire shapes #2830 (fix/observability-expected-errors-wave5, merged 2026-05-28).ExpectedErrorKindarm for "subconscious DDL + database is locked": after this PR's retry+busy_timeout(5s) loop, the only path to a Sentry event is exhausting ~6 s of unrelenting contention — exactly when we WANT the event to reach Sentry (long-held write txn, AV scanner holding the file, disk-full, real corruption). Pre-emptive classification would silence diagnostic signal during the post-deploy observation window."file is encrypted or is not a database"shape first. Without 24h of post-deploy Sentry observation confirming corruption is dominant, ship-before-confirmation risks data loss via false-positive quarantine.Submission Checklist
diff-cover) meet the gate enforced by.github/workflows/coverage.yml. Touched code is Rust-only and is exercised by the newsubconscious::store_tests+subconscious::schemas_testscases.N/A: behaviour-only resilience + observability change, no new feature rows. Existing subconscious-domain bucket already tracked.## Related—N/A: no feature IDs touched.N/A: subconscious DDL + RPC error-shape change only; no user-visible UI surface.Closes #NNNin the## Relatedsection —N/A: Sentry-only, no GitHub issue created. Sentry-Issue header below.Impact
with_connectioncodepath). iOS unaffected (no subconscious store on-device).busy_timeout = 5000is per-connection and only blocks WHEN contention exists — uncontended opens are unchanged. The 3-attempt retry adds at most ~1.3 s of bounded backoff before surfacing a real failure, and only fires for the SQLITE_BUSY/LOCKED arm of the classifier (constraint violations / syntax errors fail-fast unchanged)..map_err(|e| e.to_string())calls; the chain just goes one layer deeper.Related
Sentry-Issue: TAURI-RUST-A
N/A: Sentry-only triage, no GitHub issue tracked.TAURI-RUST-A(now showing rusqlite root chain) — confirm SQLITE_BUSY is dominant or pivot to a different defensive arm (SQLITE_IOERR,file is encrypted or is not a database)..subconscious-corrupt-{ts}, NEVERunlink— preserve user data).AI Authored PR Metadata (required for Codex/Linear PRs)
Linear Issue
Commit & Branch
fix/sentry-A-subconscious-ddl-resiliencea0b6de7a1fbefe3ac0179e5d519585c7fcbd9affValidation Run
pnpm --filter openhuman-app format:check— noapp/changes in this PR.pnpm typecheck— noapp/changes in this PR.cargo test --lib openhuman::subconscious→ 106 passed / 0 failed / 0 ignored.cargo fmt --all -- --checkclean;cargo check --libclean (10 pre-existing warnings unchanged).app/src-tauri/changes in this PR.Validation Blocked
command:cargo test --lib(full lib, not scoped) — known to SIGKILL locally ondiscord_integrationper project memoryerror:SIGKILLimpact:Bounded by scoping the run tocargo test --lib openhuman::subconscious(the touched module). CI will run the full matrix.Behavior Changes
with_connectionbecomes resilient to SQLiteSQLITE_BUSY/SQLITE_LOCKEDtransients via per-connectionbusy_timeout(5s)+ 3-attempt retry;schemas.rserror mapping preserves the anyhow chain.subconscious_tasks_list,subconscious_status) succeed under transient lock contention that previously failed immediately. When they DO fail (after exhausting the retry budget), the resulting RPC error string carries the rusqlite root cause instead of justfailed to run subconscious schema DDL.Parity Contract
DatabaseBusy | DatabaseLockedonly.rusqlite::Errordowncasts AND context-wrapped chains via text-fallback (6 dedicated unit tests).Duplicate / Superseded PR Handling
gh pr list -R tinyhumansai/openhuman --state all --search "in:body,title TAURI-RUST-A".Summary by CodeRabbit
Bug Fixes
Reliability