Skip to content

Fall back across blob sources when topic downloads fail#3466

Open
tonidero wants to merge 2 commits into
mainfrom
toniricodiez/fall-back-across-blob-sources
Open

Fall back across blob sources when topic downloads fail#3466
tonidero wants to merge 2 commits into
mainfrom
toniricodiez/fall-back-across-blob-sources

Conversation

@tonidero
Copy link
Copy Markdown
Contributor

@tonidero tonidero commented May 12, 2026

Motivation

RemoteConfigManager previously picked a single blob source per refresh and gave up if any topic failed to download from it. With weighted multi-source distribution in place (#previous stack PR), we want to gracefully fall back to other sources when one is degraded so a refresh can still succeed.

Description

  • Classify topic-fetch outcomes (TopicFetcher.kt): fetchTopicIfNeeded now returns a TopicFetchResult sealed type — Success, TransientFailure (timeouts / no network — worth retrying against a different source), or InvalidatingFailure (malformed blob ref, checksum mismatch, other IO). Transient/invalidating is decided via ConnectionErrorReason.fromIOException.
  • Fallback loop in RemoteConfigManager.kt: introduces downloadTopicsWithFallback, which repeatedly picks the next weighted source, attempts the still-failing tasks against it, and stops as soon as everything succeeds or sources are exhausted. The most recent source's error is preserved for the completion callback so callers see the freshest failure reason rather than the first one encountered.
  • Sticky source selection: tracks the source that produced the last successful refresh in cachedSourceId. On the next refresh, that source is tried first (bypassing weighted selection) to keep clients on a known-good source. If the sticky source returns an invalidating failure, the cached id is cleared so a fresh weighted draw can take over next time.
  • Pin reset on response change: when the backend returns a RemoteConfigResponse different from the one currently cached (e.g. an updated source list, new priorities, or new weights), cachedSourceId is cleared before the fallback loop runs. This guarantees a new top-priority source can take over instead of the client staying pinned to a now-lower-priority cached source.
  • selectWeightedExcluding helper (WeightedSource.kt): weighted selection filtered to sources not already tried in the current refresh. Negative weights are clamped to zero so a single misconfigured entry can't poison the picker.
  • Refactored refresh result into a RefreshOutcome sealed class (Success / VacuousSuccess / Failure) so caching, cleanup, and cached-source clearing are driven by a single decision rather than nested null checks.

Tests

  • RemoteConfigManagerTest: new coverage for cross-source fallback on transient failures, exhaustion behavior, invalidating-failure short-circuit per task, sticky-source preference on subsequent refreshes, cachedSourceId invalidation when the sticky source fails with an invalidating error (rewritten with a staged QueuedRandom so the test actually distinguishes cleared/not-cleared paths), and cachedSourceId reset when the backend returns a new response.
  • TopicFetcherTest: updated for the new TopicFetchResult return type, including timeout/no-network → TransientFailure and checksum/malformed/other-IO → InvalidatingFailure classification.
  • WeightedSourceSelectionTest: new tests for selectWeightedExcluding and for negative-weight clamping behavior.

Checklist

  • Unit tests
  • If applicable, create follow-up issues for purchases-ios and hybrids

Note

Medium Risk
Changes the remote-config refresh path to retry topic downloads across multiple blob sources with sticky source selection, which can affect config update reliability and caching behavior under network failures.

Overview
Remote config refresh now falls back across blob sources when topic downloads fail, retrying only the still-failing topics until one source succeeds or all sources are exhausted; the completion error now reflects the latest failure.

TopicFetcher.fetchTopicIfNeeded now returns a TopicFetchResult (Success, TransientFailure, InvalidatingFailure) with error classification driven by ConnectionErrorReason, and RemoteConfigManager tracks a sticky cachedSourceId (prefer last-good source, clear it on invalidating failures or when the backend response changes). WeightedSource adds selectWeightedExcluding to avoid re-trying sources within a refresh, and tests are expanded/updated to cover fallback, stickiness/reset, and new result classifications.

Reviewed by Cursor Bugbot for commit 4650c71. Bugbot is set up for automated code reviews on this repo. Configure here.

@tonidero tonidero force-pushed the toniricodiez/fall-back-across-blob-sources branch from 47a4da1 to 91699e8 Compare May 13, 2026 08:49
@tonidero tonidero force-pushed the toniricodiez/select-blob-source-by-priority-and-weight branch from a29029f to 2172324 Compare May 13, 2026 08:49
@tonidero tonidero force-pushed the toniricodiez/fall-back-across-blob-sources branch from 91699e8 to d88dd12 Compare May 13, 2026 09:40
@tonidero tonidero force-pushed the toniricodiez/select-blob-source-by-priority-and-weight branch from 337c191 to ff1124a Compare May 13, 2026 11:09
@tonidero tonidero force-pushed the toniricodiez/fall-back-across-blob-sources branch 2 times, most recently from b32ecf1 to 1b90c96 Compare May 13, 2026 11:52
@tonidero tonidero marked this pull request as ready for review May 13, 2026 12:40
@tonidero tonidero requested a review from a team as a code owner May 13, 2026 12:40
@tonidero tonidero force-pushed the toniricodiez/select-blob-source-by-priority-and-weight branch from ff1124a to 9d28ec1 Compare May 13, 2026 13:23
@tonidero tonidero force-pushed the toniricodiez/fall-back-across-blob-sources branch from 1b90c96 to 3049129 Compare May 13, 2026 13:24
@tonidero tonidero force-pushed the toniricodiez/select-blob-source-by-priority-and-weight branch from 9d28ec1 to 98ba254 Compare May 13, 2026 15:19
@tonidero tonidero force-pushed the toniricodiez/fall-back-across-blob-sources branch from 3049129 to 6a85f55 Compare May 13, 2026 15:19
Base automatically changed from toniricodiez/select-blob-source-by-priority-and-weight to main May 13, 2026 16:02
tonidero added 2 commits May 13, 2026 18:12
When a chosen BlobSource fails, attempt the remaining topics on other
sources (same-priority alternates first via weighted random, then lower
priority tiers) before giving up. Cache the last successful source in
memory and prefer it on subsequent refreshes, invalidating only when the
cached source itself returned a non-network failure (HTTP error,
checksum mismatch, malformed blob ref).
@tonidero tonidero force-pushed the toniricodiez/fall-back-across-blob-sources branch from 6a85f55 to 4650c71 Compare May 13, 2026 16:12
Copy link
Copy Markdown
Member

@ajpallares ajpallares left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, there's some mindbending stuff in here (or maybe I'm not having my best day 😅). I think I finally understood the logic here and I think it looks mostly ok. But I believe there are a couple of things to fix 🙏
Great job on this! 🙌


internal fun <T : WeightedSource> List<T>.selectWeightedExcluding(
excludedIds: Set<String>,
idOf: (T) -> String,
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be cleaner to add id: String to the WeightedSource interface instead?

Comment on lines 70 to +77
val tasks = response.manifest.topics.mapNotNull { (topic, entries) ->
val entry = entries[DEFAULT_ENTRY_ID] ?: return@mapNotNull null
TopicTask(topic, DEFAULT_ENTRY_ID, entry)
}
val firstError: PurchasesError? = if (source == null || tasks.isEmpty()) {
null
val outcome = if (response.blobSources.isEmpty() || tasks.isEmpty()) {
RefreshOutcome.VacuousSuccess
} else {
coroutineScope {
tasks.map { task ->
async {
topicFetcher.fetchTopicIfNeeded(
topic = task.topic,
entryId = task.entryId,
topicEntry = task.entry,
source = source,
)
}
}.awaitAll().firstNotNullOfOrNull { it }
}
downloadTopicsWithFallback(response.blobSources, tasks)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a HUGE nit, but we could defer the calculation of tasks to only when response.blobSources is not empty (that would avoid some unnecessary work). Total nit as I said, so FFTI.

Comment on lines +44 to +46
return TopicFetchResult.InvalidatingFailure(
PurchasesError(PurchasesErrorCode.UnexpectedBackendResponseError, message),
)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is incorrect. Returning InvalidatingFailure here makes it so that the source is invalidated, whereas it's the blob in the topic what's invalid. In addition, since this blobRef would be the same for all sources, this would effectively mean that all sources would be invalidated for all topics only because one blobRef is invalid.

First, to avoid these confusions, I would rename InvalidatingFailure to SourceInvalidatingFailure, meaning the source is invalid. Then, for this kind of error (where the blob is invalid regardless of the source) I would introduce a new TopicFetchResult meaning "this task is bad regardless of the source", maybe calling it TaskInvalid.

val previouslyCachedSourceId = cachedSourceId

while (remaining.isNotEmpty()) {
val source = pickNextSource(blobSources, triedIds, previouslyCachedSourceId) ?: break
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add any logs when using a fallback? when we switch sources, when we fall back from the cached source, or when we exhaust all sources. TopicFetcher logs individual fetch failures but the manager's decisions are invisible.

@Volatile
private var cachedSourceId: String? = null

fun updateRemoteConfigIfNeeded(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I raised this in a previous PR as well, so I'm sorry if I'm repeating myself. But I worry about race conditions here.

If caller A and caller B both invoke updateRemoteConfigIfNeeded:

Thread Main:

A: updateRemoteConfigIfNeeded()
   → cache is stale ✓
   → scope.launch { refresh() }   ← returns immediately, coroutine A queued on IO pool
B: updateRemoteConfigIfNeeded()
   → cache is stale ✓ (A hasn't written cache yet)
   → scope.launch { refresh() }   ← returns immediately, coroutine B queued on IO pool

Then in parallel:

IO Thread 1 (coroutine A) IO Thread 2 (coroutine B)
refresh() refresh()
getRemoteConfig() getRemoteConfig()
read cache?.response read cache?.response
write cachedSourceId write cachedSourceId ← race
downloadTopicsWithFallback() downloadTopicsWithFallback()
write cache write cache ← race
diskCache.write(response) diskCache.write(response) ← race


private sealed class RefreshOutcome {
data class Success(val source: BlobSource) : RefreshOutcome()
object VacuousSuccess : RefreshOutcome()
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think many people know what this means without looking it up (unless it's just me) 😅 Maybe just EmptySuccess?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

first time I've seen Vacuous as well 😄

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha so it's not just me 😛

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That’s… true 🤣 will fix. Thanks!

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello Toni from the plane! 👋 ✈️

Copy link
Copy Markdown
Member

@ajpallares ajpallares May 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was the first time for me as well. And then, I just saw this
https://github.com/RevenueCat/purchases-ios/blob/fed4baea51cb961f6238f068aa39eb0404b33316/RulesEngine/Operators/LogicOperators.swift#L33
It seems like AI likes this term 😄

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haha it's the new "delve"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants