Skip to content

[Onyx] Add IDB healing mechanism for "Internal error opening backing store" #90636

@fabioh8010

Description

@fabioh8010

Coming from #87862 (comment).

Issue

For the UnknownError: Internal error opening backing store for indexedDB.open error class analyzed in #87862, neither retries (action item §8.2) nor degrading to MemoryOnlyProvider actually addresses the underlying problem. The session continues working from the in-memory cache so users don't see immediate degradation, but the storage layer is silently broken — and that broken state has two real costs:

  1. Operational cost: log/Sentry volume from the silent retry storm (mitigated separately by §8.2).
  2. Data loss risk on offline refresh: if the user is offline and accumulates queued writes (e.g. SequentialQueue items stored as an Onyx key), and the cache rebuilds empty from broken storage on refresh, those queued writes are gone. The user has no indication this happened.

The right strategy is to attempt to heal the IDB connection so writes get back onto disk. There is no need to swap providers, show user-visible UI, or call deleteDatabase() (which Chapter 2 of #87862 proved also fails when corrupt LevelDB files persist on disk). Just reopen the connection and let normal operation resume if it heals; if it doesn't, fall through to the cache-only behavior the session already exhibits, without further log noise.

Precedent: Dexie's workaround

Dexie ships a clean precedent for this approach — catch UnknownError from indexedDB.open() and retry up to 3 times. No backoff, no fallback, no provider swap:

.catch((err) => {
    switch (err?.name) {
        case 'UnknownError':
            if (state.PR1398_maxLoop > 0) {
                state.PR1398_maxLoop--;
                console.warn('Dexie: Workaround for Chrome UnknownError on open()');
                return tryOpenDB();
            }
            break;
        // ...
    }
    return Promise.reject(err);
});

There's also a sibling workaround in Dexie's temp-transaction.ts: when a transaction throws InvalidStateError while the DB reports as open, close and reopen the DB and retry once. Both are bounded by the same PR1398_maxLoop = 3 budget.

Solution

Scope: web (IDB) only. The error class this issue addresses is Chromium-IDB-specific. The native SQLite provider hits a different set of errors (disk I/O error, database is locked, database or disk is full) with categorically different root causes — filesystem-level issues, lock contention, or genuine capacity exhaustion — none of which benefit from a close+reopen heal pattern. If a SQLite-side mitigation is needed, it should be designed and tracked separately.

Implement two related healing mechanisms on IDBKeyValProvider, designed to work together as a single healing strategy with a shared retry budget — directly mirroring Dexie's PR1398_maxLoop pattern.

Shared retry counter

Maintain a single counter inside IDBKeyValProvider — call it healAttemptsRemaining — initialized to 3. The counter is:

  • Decremented on every heal attempt (both init retries and mid-session reopens).
  • Reset to 3 on every successful IDB operation (IDBKeyValProvider.setItem, multiSet, mergeItem, multiMerge, removeItem, etc.).
  • Checked before any heal attempt — if it's already at 0, fall through to cache-only behavior without further attempts.

The counter, the heal logic, and all the log messages live entirely inside IDBKeyValProvider. None of this code runs on native; the SQLite provider is untouched.

This naturally creates a circuit breaker for the permanent-corruption case (no successes → counter drains to 0 → no further heal attempts), while allowing a healthy session to recover from multiple separate transient incidents (each success replenishes the budget). It mirrors Dexie's PR1398_maxLoop directly.

1. On provider init: retry indexedDB.open() on UnknownError

When IDBKeyValProvider init throws UnknownError, retry the indexedDB.open() call up to the remaining budget. This is the direct Dexie-style workaround for the transient post-Clear cookies and site data class (Walexander's repro in Dexie #543).

Sketch:

async function openWithHealing(dbName: string, version: number): Promise<IDBDatabase> {
    let lastError: unknown;
    while (healAttemptsRemaining > 0) {
        try {
            return await openIDB(dbName, version);
        } catch (error) {
            lastError = error;
            if (!(error instanceof DOMException) || error.name !== 'UnknownError') {
                throw error;
            }
            healAttemptsRemaining--;
            Logger.logInfo(`IDB heal: UnknownError on open, retrying. attemptsRemaining=${healAttemptsRemaining}`);
        }
    }
    throw lastError;
}

2. Mid-session: close + reopen on this error

When the error fires during a write operation in an active session, attempt a close + reopen of the IDB connection — up to the remaining budget — before considering the operation failed:

async function healAndRetry<T>(operation: () => Promise<T>): Promise<T> {
    try {
        const result = await operation();
        healAttemptsRemaining = 3;  // reset budget on success — mirrors Dexie's pattern
        return result;
    } catch (error) {
        if (!isBackingStoreError(error) || healAttemptsRemaining <= 0) {
            throw error;
        }

        healAttemptsRemaining--;
        Logger.logInfo(`IDB heal: backing store error during operation — attempting close + reopen. attemptsRemaining=${healAttemptsRemaining}`);
        await closeConnection();
        await openIDB(DB_NAME, DB_VERSION);

        return healAndRetry(operation);  // recursive retry, bounded by shared counter
    }
}

If a heal succeeds and the subsequent operation completes, the counter resets to 3, so a fresh transient incident later in the session gets a full budget again. If 3 heal attempts fail in succession with no intervening success, the counter hits 0 and subsequent operations fall through to cache-only behavior without further heal attempts or log noise (the cache already absorbed the write per parent comment §5).

Important constraints

  • No provider swap. The cache already serves reads and absorbs writes (parent comment §7). Swapping to MemoryOnlyProvider changes nothing observable during the session and adds complexity without benefit.
  • No user-visible UI / notification. The session is already serving correctly from cache; there's nothing to surface.
  • No deleteDatabase() calls. Chapter 2 of [Onyx] Investigate UnknownError: Internal error opening backing store for indexedDB.open. storage error #87862 demonstrated that deleteDatabase() also fails when corrupt LevelDB files persist — it's not a viable healing primitive in this scenario.
  • Bounded heal attempts via shared counter. 3 total in flight; reset on success. Healing should be cheap and silent; if it doesn't work within the budget, further attempts are pure noise until something succeeds and refreshes the budget.

Coordination with action items §8.1 and §8.2

This issue can be worked on independently of the others — the heal logic lives inside the provider (e.g. IDBKeyValProvider.setItem wrapping the raw IDB call), so errors are caught and retried before they ever reach tryOrDegradePerformance or retryOperation. None of the other action items are strict prerequisites.

That said, they interact:

Test plan

  • Unit test (init heal): mock indexedDB.open to reject with UnknownError twice, then resolve. Confirm init succeeds after 3 attempts total (1 initial + 2 retries), heal-attempt logs fire, and the counter ends at 1 (3 - 2 decrements).
  • Unit test (init heal exhaustion): mock indexedDB.open to always reject with UnknownError. Confirm init fails after 3 attempts, counter is 0, and subsequent operations skip the heal path.
  • Unit test (mid-session heal): mock Storage.setItem to reject once with the backing-store error, then resolve. Confirm a close+reopen happens, the second setItem succeeds, the heal log fires, and the counter is reset to 3 after success.
  • Unit test (mid-session heal exhaustion): mock Storage.setItem to reject 3 times consecutively. Confirm 3 heal attempts happen, the counter drains to 0, and a 4th failing setItem does NOT trigger another heal attempt.
  • Unit test (counter reset after success): drain the counter to 1 with mid-session heals, then a successful setItem, then a new error. Confirm the counter was reset to 3 and a fresh heal attempt fires.
  • Verify in VictoriaLogs post-deploy: IDB heal log lines appear, and a fraction of users emit them and then continue without further Failed to save to storage errors (indicating successful heal). The ratio of post-heal successes to heal attempts gives a direct readout of how often this mechanism helps.

Metadata

Metadata

Assignees

Type

No type
No fields configured for issues without a type.

Projects

Status

SUBISSUE

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions