fix(subconscious): suppress repeated Sentry noise when DB init fails (TAURI-RUST-A, closes #2903)#2906
Conversation
… (TAURI-RUST-A) DDL runs on every `with_connection` call. On Windows, if the very first call fails (directory lock, antivirus hold, network-drive WAL rejection, etc.) every subsequent `subconscious_tasks_list` poll re-runs the same failing path and fires `report_error_or_expected` — producing 1,499+ identical Sentry events from a single underlying cause (TAURI-RUST-A). Fix: add `DDL_EVER_FAILED` / `DDL_LOGGED` atomic flags in `store.rs`. - First failure: log at `error!` level (Sentry captures once) + set both flags. - Subsequent calls: return `DB_INIT_FAILED_SENTINEL` immediately, skipping the expensive re-open + DDL re-run entirely. - Read-only list handlers (`tasks_list`, `log_list`, `escalations_list`, `reflections_list`, `status`) in `schemas.rs` detect the sentinel via `is_db_init_failed()` and return empty payloads instead of propagating an error that would re-fire `report_error_or_expected` on every poll. - Write/mutate handlers (`tasks_add`, `tasks_update`, `tasks_remove`, `escalations_approve`, `escalations_dismiss`, `reflections_act`, `reflections_dismiss`) still surface the error — appropriate since those are explicit user actions, not background polls. Closes tinyhumansai#2903.
📝 WalkthroughWalkthroughThe PR prevents Sentry flooding from repeated DDL initialization failures by introducing a fast-fail sentinel. The store module caches the first DDL failure and returns a constant sentinel on subsequent calls. Five RPC handlers detect this sentinel and return empty responses instead of propagating errors. ChangesGraceful Degradation for Subconscious DDL Init Failure
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly Related PRs
Suggested Labels
Suggested Reviewers
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/openhuman/subconscious/store.rs`:
- Around line 32-57: The global DDL failure flags (DDL_EVER_FAILED, DDL_LOGGED)
cause poisoning across workspaces because with_connection fast-bails for any
workspace once one fails; change this by scoping the sentinel to the workspace
(or by resetting the flags when the engine is reset). Either: replace the two
globals with a workspace-keyed cache (e.g., map from workspace path/DB path to
bools) and have with_connection consult/set the entry for the given
workspace_dir/DB path and use DB_INIT_FAILED_SENTINEL per-workspace, or add
logic in reset_engine_for_user_switch (and the SubconsciousEngine teardown) to
clear DDL_EVER_FAILED and DDL_LOGGED so a new workspace can retry migrations;
update references in with_connection to use the new per-workspace flag or
cleared globals accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 186bbee2-a2da-4f53-af74-4d130e359011
📒 Files selected for processing (3)
src/openhuman/subconscious/schemas.rssrc/openhuman/subconscious/store.rssrc/openhuman/subconscious/store_tests.rs
| static DDL_EVER_FAILED: AtomicBool = AtomicBool::new(false); | ||
|
|
||
| /// Set to `true` once we have emitted the first error log. Ensures Sentry only | ||
| /// captures the failure once — subsequent calls get a silent sentinel error. | ||
| static DDL_LOGGED: AtomicBool = AtomicBool::new(false); | ||
|
|
||
| /// Sentinel message returned after the first DDL failure. Read-only list | ||
| /// handlers in `schemas.rs` match on this prefix to return empty payloads | ||
| /// instead of propagating an error that would fire `report_error_or_expected`. | ||
| pub const DB_INIT_FAILED_SENTINEL: &str = "[subconscious] db unavailable (init failed)"; | ||
|
|
||
| /// Open the subconscious database and run schema migrations. | ||
| /// | ||
| /// After the first DDL failure the function returns [`DB_INIT_FAILED_SENTINEL`] | ||
| /// immediately, skipping both the filesystem open and the DDL re-run. Callers | ||
| /// that perform read-only list queries should treat this sentinel as "empty | ||
| /// result" rather than an RPC error (see `schemas.rs`). | ||
| pub fn with_connection<T>( | ||
| workspace_dir: &Path, | ||
| f: impl FnOnce(&Connection) -> Result<T>, | ||
| ) -> Result<T> { | ||
| // Fast path: DDL already failed in this process — skip the re-run and | ||
| // return the sentinel without logging again. | ||
| if DDL_EVER_FAILED.load(Ordering::Relaxed) { | ||
| anyhow::bail!(DB_INIT_FAILED_SENTINEL); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Call sites of store::with_connection =="
rg -nP --type=rs -C2 '\bstore::with_connection\s*\('
echo
echo "== All workspace_dir usages (look for mutation/switch patterns) =="
rg -nP --type=rs -C3 '\bworkspace_dir\b'
echo
echo "== Potential workspace switch/update entry points =="
rg -nP --type=rs -C3 'set_.*workspace|workspace.*(switch|change|update)|load_config_with_timeout|load_or_init'Repository: tinyhumansai/openhuman
Length of output: 154
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Call sites of store::with_connection =="
rg -n --glob='*.rs' '\bstore::with_connection\s*\(' -S
echo
echo "== All workspace_dir usages (look for mutation/switch patterns) =="
rg -n --glob='*.rs' '\bworkspace_dir\b' -S
echo
echo "== Potential workspace switch/update entry points =="
rg -n --glob='*.rs' 'set_.*workspace|workspace.*(switch|change|update)|load_config_with_timeout|load_or_init' -SRepository: tinyhumansai/openhuman
Length of output: 50379
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Inspect src/openhuman/subconscious/global.rs =="
ls -la src/openhuman/subconscious/global.rs
sed -n '1,220p' src/openhuman/subconscious/global.rs
echo
echo "== Search for multi-workspace or switching language in subconscious =="
rg -n --glob='src/openhuman/subconscious/**/*.rs' '(switch|switching|active_user|multi[-_ ]user|concurrent|per-user|leak|leaking|NOW-correct)' -S
echo
echo "== Look for how Config is loaded per request/task (workspace_dir scope) =="
rg -n --glob='src/openhuman/**/*.rs' 'Config::load_or_init|active_user\.toml|workspace_dir.*active_user|load_active|set_active' -S src/openhumanRepository: tinyhumansai/openhuman
Length of output: 21563
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Inspect src/openhuman/subconscious/store.rs (atomic flags + maybe_log_ddl_error + with_connection failure paths) =="
sed -n '1,220p' src/openhuman/subconscious/store.rs
echo
sed -n '220,520p' src/openhuman/subconscious/store.rs
echo
echo "== Search for reset_engine_for_user_switch call sites =="
rg -n --glob='*.rs' 'reset_engine_for_user_switch\s*\(' -S src/openhuman
echo
echo "== Search for DDL_EVER_FAILED / DDL_LOGGED resets/usages =="
rg -n --glob='*.rs' 'DDL_EVER_FAILED|DDL_LOGGED|DB_INIT_FAILED_SENTINEL' -S src/openhuman/subconsciousRepository: tinyhumansai/openhuman
Length of output: 21425
🏁 Script executed:
#!/bin/bash
set -euo pipefail
sed -n '330,430p' src/openhuman/credentials/ops.rs
echo
echo "== Inspect src/openhuman/subconscious/schemas.rs handling DB_INIT_FAILED_SENTINEL =="
# show the relevant match/if block around the sentinel usage near line ~758
sed -n '720,810p' src/openhuman/subconscious/schemas.rs
echo
echo "== Find other DB_INIT_FAILED_SENTINEL uses =="
rg -n --glob='*.rs' 'DB_INIT_FAILED_SENTINEL' src/openhuman/subconscious -SRepository: tinyhumansai/openhuman
Length of output: 6283
Fix: make DDL failure cache workspace-scoped (current global poisoning across user switches)
reset_engine_for_user_switch() clears the cached SubconsciousEngine on logout/user switch, but src/openhuman/subconscious/store.rs’s DDL_EVER_FAILED / DDL_LOGGED are never reset—so if DDL fails for one workspace_dir, with_connection will fast-bail with DB_INIT_FAILED_SENTINEL for all subsequent workspace_dir values in the same process, skipping DB open + migrations even if the new workspace would succeed. Scope the sentinel/cache by DB path/workspace (or reset the DDL flags when resetting the engine).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@src/openhuman/subconscious/store.rs` around lines 32 - 57, The global DDL
failure flags (DDL_EVER_FAILED, DDL_LOGGED) cause poisoning across workspaces
because with_connection fast-bails for any workspace once one fails; change this
by scoping the sentinel to the workspace (or by resetting the flags when the
engine is reset). Either: replace the two globals with a workspace-keyed cache
(e.g., map from workspace path/DB path to bools) and have with_connection
consult/set the entry for the given workspace_dir/DB path and use
DB_INIT_FAILED_SENTINEL per-workspace, or add logic in
reset_engine_for_user_switch (and the SubconsciousEngine teardown) to clear
DDL_EVER_FAILED and DDL_LOGGED so a new workspace can retry migrations; update
references in with_connection to use the new per-workspace flag or cleared
globals accordingly.
…try (resolves CodeRabbit) Resolves the workspace-poisoning issue CodeRabbit flagged: the old DDL_EVER_FAILED / DDL_LOGGED globals would prevent any workspace from connecting after a single failure in the process lifetime. Adopts upstream/main's retry-based approach (busy_timeout=5s + 3 application-level retries at 100/300/900ms backoff) which is both correct and workspace-safe. Also removes: - DB_INIT_FAILED_SENTINEL constant and is_db_init_failed helper - maybe_log_ddl_error helper - Stale sentinel test in store_tests.rs - Degraded-empty-list fallback branches in schemas.rs handlers (errors now propagate normally; the retry loop handles transient SQLITE_BUSY before that point) Merges upstream/main to resolve conflicts.
|
@coderabbitai review |
✅ Actions performedReview triggered.
|
Summary
store::with_connectionruns the full schema DDL on every call. Under concurrent in-process RPC calls (status poll every 3 s, tasks-list on Intelligence page, manual trigger), SQLite's defaultbusy_timeout=0causes immediateSQLITE_BUSYfailures — producing ~1,300 identical Sentry events per 24 h from a single underlying cause (TAURI-RUST-A).busy_timeout=5sto absorb the common concurrent-connection race, (2) application-level retry loop (3 retries, 100/300/900 ms exponential backoff) for the residual case where the timeout is exhausted, (3) retry classifier (is_sqlite_busy) that only retriesDatabaseBusy/DatabaseLocked— schema or corruption errors fail through immediately so Sentry gets a real root cause.workspace_dir— no cross-workspace poisoning.DDL_EVER_FAILED/DB_INIT_FAILED_SENTINELapproach that CodeRabbit correctly identified as globally poisoning all workspaces after one failure.Test plan
cargo checkpasses (pure Rust change)is_sqlite_busy_{matches,does_not_match}*assert the retry classifier,with_connection_resolves_external_write_contentionexercises the retry loop against a real write-lock held for 250 msis_db_init_failedhelper andDB_INIT_FAILED_SENTINELconstant removed — no workspace-poisoning possibleNotes
Merge of upstream/main resolves all conflicts. Pre-push hook failure (Prettier not installed in worktree) is pre-existing and unrelated to this Rust-only change.
Closes #2903