Skip to content

Skip retries in retryOperation for non-retriable storage errors#786

Draft
elirangoshen wants to merge 2 commits into
Expensify:mainfrom
callstack-internal:elirangoshen/fix/90633-skip-non-retriable-storage-errors
Draft

Skip retries in retryOperation for non-retriable storage errors#786
elirangoshen wants to merge 2 commits into
Expensify:mainfrom
callstack-internal:elirangoshen/fix/90633-skip-non-retriable-storage-errors

Conversation

@elirangoshen
Copy link
Copy Markdown

@elirangoshen elirangoshen commented May 18, 2026

Details

Introduces a NON_RETRIABLE_ERRORS classification alongside the existing STORAGE_ERRORS list and short-circuits retryOperation for errors where retrying is provably futile — specifically Internal error opening backing store for indexedDB.open, where the LevelDB backing store is broken at the filesystem level and cannot recover from immediate retries.

Today, every non-capacity storage error in retryOperation retries 5 times with no delay, then silently resolves via Promise.resolve(). For the UnknownError: Internal error opening backing store… class analyzed in Expensify/App#87862, this produces ~6× the log/Sentry volume of the underlying event rate — 884,955 occurrences (26.3% of all storage failures) inflated to ~5M log lines, all noise because no retry can succeed.

The healing path (separate action item Expensify/App#90636) is responsible for recovering the IDB connection for this error class — retryOperation should defer to it, not compete with it.

Why this is safe

  • The cache layer absorbs the write before retryOperation is reached, so callers cannot observe retry success vs. failure.
  • Exhausted retries today already silently resolve via Promise.resolve(), so the only observable change is log volume.
  • The capacity-error eviction path and the generic-error retry path are untouched (regression tests pass).
  • Matching uses the same case-insensitive substring style as STORAGE_ERRORS (includes against lowercased name and message).

Scope

Scoped to Internal error opening backing store. The new classification structure (IDB_NON_RETRIABLE_ERRORSNON_RETRIABLE_ERRORS, mirroring IDB_STORAGE_ERRORSSTORAGE_ERRORS) makes adding other connection-state errors trivial later.

Related Issues

Expensify/App#90633

Automated Tests

Added two unit tests to tests/unit/onyxUtilsTest.ts in the existing describe('retryOperation', ...) block:

  1. should not retry for non-retriable IndexedDB backing-store errors — mocks Storage.setItem to reject with new Error('Internal error opening backing store for indexedDB.open.') (name: UnknownError), asserts retryOperation is called exactly once (not 6×).
  2. should log a single skip alert for non-retriable errors — asserts the new Storage operation skipped retry for non-retriable error... alert fires exactly once, and the existing "5 retries exhausted" alert does not fire.

All existing retryOperation regression tests continue to pass:

  • Generic-error continuous-failure still retries 6× (token does not match).
  • Capacity-error eviction path unchanged.
  • IDBObjectStore invalid-data still throws.

Local results: npx jest — 451/451 pass across 16 suites. npx tsc --noEmit clean. ESLint clean on changed files.

Manual Tests

End-to-end manual verification against Expensify/App (link to the App-side companion PR will be added once opened):

  1. Pin Expensify/App's package.json react-native-onyx dependency to this PR's head SHA, run npm install.
  2. In Chrome DevTools, intercept Storage.setItem and reject it with Object.assign(new Error('Internal error opening backing store for indexedDB.open.'), {name: 'UnknownError'}).
  3. Trigger an Onyx.set() call (e.g., navigate to a screen that writes a transient key).
  4. Verify in console:
    • Exactly one Failed to save to storage. Error: ... retryAttempt: 0/5 log line (not six).
    • Exactly one Storage operation skipped retry for non-retriable error. Error: ... onyxMethod: setWithRetry. alert.
    • No Storage operation failed after 5 retries... alert.

Post-deploy production signal (per Expensify/App#90633 test plan):

  • VictoriaLogs: volume of Failed to save to storage. Error: ...Internal error opening backing store... drops ~5–6×.
  • New low-volume line Storage operation skipped retry for non-retriable error... appears at roughly 1× the underlying event rate.

Author Checklist

  • I linked the correct issue in the ### Related Issues section above
  • I wrote clear testing steps that cover the changes made in this PR
    • I added steps for local testing in the Tests section
    • I tested this PR with a High Traffic account against the staging or production API to ensure there are no regressions (e.g. long loading states that impact usability).
  • I included screenshots or videos for tests on all platforms
  • I ran the tests on all platforms & verified they passed on:
    • Android / native
    • Android / Chrome
    • iOS / native
    • iOS / Safari
    • MacOS / Chrome / Safari
  • I verified there are no console errors (if there's a console error not related to the PR, report it or open an issue for it to be fixed)
  • I followed proper code patterns (see Reviewing the code)
    • I verified that any callback methods that were added or modified are named for what the method does and never what callback they handle (i.e. toggleReport and not onIconClick)
    • I verified that the left part of a conditional rendering a React component is a boolean and NOT a string, e.g. myBool && <MyComponent />.
    • I verified that comments were added to code that is not self explanatory
    • I verified that any new or modified comments were clear, correct English, and explained "why" the code was doing something instead of only explaining "what" the code was doing.
    • I verified proper file naming conventions were followed for any new files or renamed files. All non-platform specific files are named after what they export and are not named "index.js". All platform-specific files are named for the platform the code supports as outlined in the README.
    • I verified the JSDocs style guidelines (in STYLE.md) were followed
  • If a new code pattern is added I verified it was agreed to be used by multiple Expensify engineers
  • I followed the guidelines as stated in the Review Guidelines
  • I tested other components that can be impacted by my changes (i.e. if the PR modifies a shared library or component like Avatar, I verified the components using Avatar are working as expected)
  • I verified all code is DRY (the PR doesn't include any logic written more than once, with the exception of tests)
  • I verified any variables that can be defined as constants (ie. in CONST.js or at the top of the file that uses the constant) are defined as such
  • I verified that if a function's arguments changed that all usages have also been updated correctly
  • If a new component is created I verified that:
    • A similar component doesn't exist in the codebase
    • All props are defined accurately and each prop has a /** comment above it */
    • The file is named correctly
    • The component has a clear name that is non-ambiguous and the purpose of the component can be inferred from the name alone
    • The only data being stored in the state is data necessary for rendering and nothing else
    • If we are not using the full Onyx data that we loaded, I've added the proper selector in order to ensure the component only re-renders when the data it is using changes
    • For Class Components, any internal methods passed to components event handlers are bound to this properly so there are no scoping issues (i.e. for onClick={this.submit} the method this.submit should be bound to this in the constructor)
    • Any internal methods bound to this are necessary to be bound (i.e. avoid this.submit = this.submit.bind(this); if this.submit is never passed to a component event handler like onClick)
    • All JSX used for rendering exists in the render method
    • The component has the minimum amount of code necessary for its purpose, and it is broken down into smaller components in order to separate concerns and functions
  • If any new file was added I verified that:
    • The file has a description of what it does and/or why is needed at the top of the file if the code is not self explanatory
  • If the PR modifies a generic component, I tested and verified that those changes do not break usages of that component in the rest of the App (i.e. if a shared library or component like Avatar is modified, I verified that Avatar is working as expected in all cases)
  • If the main branch was merged into this PR after a review, I tested again and verified the outcome was still expected according to the Test steps.
  • I have checked off every checkbox in the PR author checklist, including those that don't apply to this PR.

Screenshots/Videos

Android: Native
Android: mWeb Chrome
iOS: Native
iOS: mWeb Safari
MacOS: Chrome / Safari

Introduce a NON_RETRIABLE_ERRORS classification alongside the existing
STORAGE_ERRORS list, and short-circuit retryOperation for errors where
retrying is provably futile (the underlying IDB connection/store is
broken at the filesystem level). The healing path is responsible for
recovery; retryOperation should defer to it, not compete with it.

Scoped to "Internal error opening backing store" — the LevelDB
backing-store class analyzed in Expensify/App#87862, which accounts
for ~26% of storage failures and produces ~6× log volume due to the
5-attempt retry loop.

Fixes Expensify/App#90633

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
CI's "Verify API Docs Are Up To Date" step caught that the auto-generated
docs missed the new "Non-retriable errors" branch added to retryOperation's
JSDoc. Regenerated via `npm run build:docs`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant