Fall back across blob sources when topic downloads fail#3466
Conversation
47a4da1 to
91699e8
Compare
a29029f to
2172324
Compare
91699e8 to
d88dd12
Compare
337c191 to
ff1124a
Compare
b32ecf1 to
1b90c96
Compare
ff1124a to
9d28ec1
Compare
1b90c96 to
3049129
Compare
9d28ec1 to
98ba254
Compare
3049129 to
6a85f55
Compare
When a chosen BlobSource fails, attempt the remaining topics on other sources (same-priority alternates first via weighted random, then lower priority tiers) before giving up. Cache the last successful source in memory and prefer it on subsequent refreshes, invalidating only when the cached source itself returned a non-network failure (HTTP error, checksum mismatch, malformed blob ref).
6a85f55 to
4650c71
Compare
ajpallares
left a comment
There was a problem hiding this comment.
Wow, there's some mindbending stuff in here (or maybe I'm not having my best day 😅). I think I finally understood the logic here and I think it looks mostly ok. But I believe there are a couple of things to fix 🙏
Great job on this! 🙌
|
|
||
| internal fun <T : WeightedSource> List<T>.selectWeightedExcluding( | ||
| excludedIds: Set<String>, | ||
| idOf: (T) -> String, |
There was a problem hiding this comment.
Would it be cleaner to add id: String to the WeightedSource interface instead?
| val tasks = response.manifest.topics.mapNotNull { (topic, entries) -> | ||
| val entry = entries[DEFAULT_ENTRY_ID] ?: return@mapNotNull null | ||
| TopicTask(topic, DEFAULT_ENTRY_ID, entry) | ||
| } | ||
| val firstError: PurchasesError? = if (source == null || tasks.isEmpty()) { | ||
| null | ||
| val outcome = if (response.blobSources.isEmpty() || tasks.isEmpty()) { | ||
| RefreshOutcome.VacuousSuccess | ||
| } else { | ||
| coroutineScope { | ||
| tasks.map { task -> | ||
| async { | ||
| topicFetcher.fetchTopicIfNeeded( | ||
| topic = task.topic, | ||
| entryId = task.entryId, | ||
| topicEntry = task.entry, | ||
| source = source, | ||
| ) | ||
| } | ||
| }.awaitAll().firstNotNullOfOrNull { it } | ||
| } | ||
| downloadTopicsWithFallback(response.blobSources, tasks) |
There was a problem hiding this comment.
This is a HUGE nit, but we could defer the calculation of tasks to only when response.blobSources is not empty (that would avoid some unnecessary work). Total nit as I said, so FFTI.
| return TopicFetchResult.InvalidatingFailure( | ||
| PurchasesError(PurchasesErrorCode.UnexpectedBackendResponseError, message), | ||
| ) |
There was a problem hiding this comment.
I think this is incorrect. Returning InvalidatingFailure here makes it so that the source is invalidated, whereas it's the blob in the topic what's invalid. In addition, since this blobRef would be the same for all sources, this would effectively mean that all sources would be invalidated for all topics only because one blobRef is invalid.
First, to avoid these confusions, I would rename InvalidatingFailure to SourceInvalidatingFailure, meaning the source is invalid. Then, for this kind of error (where the blob is invalid regardless of the source) I would introduce a new TopicFetchResult meaning "this task is bad regardless of the source", maybe calling it TaskInvalid.
| val previouslyCachedSourceId = cachedSourceId | ||
|
|
||
| while (remaining.isNotEmpty()) { | ||
| val source = pickNextSource(blobSources, triedIds, previouslyCachedSourceId) ?: break |
There was a problem hiding this comment.
Should we add any logs when using a fallback? when we switch sources, when we fall back from the cached source, or when we exhaust all sources. TopicFetcher logs individual fetch failures but the manager's decisions are invisible.
| @Volatile | ||
| private var cachedSourceId: String? = null | ||
|
|
||
| fun updateRemoteConfigIfNeeded( |
There was a problem hiding this comment.
I think I raised this in a previous PR as well, so I'm sorry if I'm repeating myself. But I worry about race conditions here.
If caller A and caller B both invoke updateRemoteConfigIfNeeded:
Thread Main:
A: updateRemoteConfigIfNeeded()
→ cache is stale ✓
→ scope.launch { refresh() } ← returns immediately, coroutine A queued on IO pool
B: updateRemoteConfigIfNeeded()
→ cache is stale ✓ (A hasn't written cache yet)
→ scope.launch { refresh() } ← returns immediately, coroutine B queued on IO pool
Then in parallel:
| IO Thread 1 (coroutine A) | IO Thread 2 (coroutine B) |
|---|---|
refresh() |
refresh() |
getRemoteConfig() … |
getRemoteConfig() … |
read cache?.response |
read cache?.response |
write cachedSourceId |
write cachedSourceId ← race |
downloadTopicsWithFallback() |
downloadTopicsWithFallback() |
write cache |
write cache ← race |
diskCache.write(response) |
diskCache.write(response) ← race |
|
|
||
| private sealed class RefreshOutcome { | ||
| data class Success(val source: BlobSource) : RefreshOutcome() | ||
| object VacuousSuccess : RefreshOutcome() |
There was a problem hiding this comment.
I don't think many people know what this means without looking it up (unless it's just me) 😅 Maybe just EmptySuccess?
There was a problem hiding this comment.
first time I've seen Vacuous as well 😄
There was a problem hiding this comment.
That’s… true 🤣 will fix. Thanks!
There was a problem hiding this comment.
Hello Toni from the plane! 👋
There was a problem hiding this comment.
It was the first time for me as well. And then, I just saw this
https://github.com/RevenueCat/purchases-ios/blob/fed4baea51cb961f6238f068aa39eb0404b33316/RulesEngine/Operators/LogicOperators.swift#L33
It seems like AI likes this term 😄

Motivation
RemoteConfigManagerpreviously picked a single blob source per refresh and gave up if any topic failed to download from it. With weighted multi-source distribution in place (#previous stack PR), we want to gracefully fall back to other sources when one is degraded so a refresh can still succeed.Description
TopicFetcher.kt):fetchTopicIfNeedednow returns aTopicFetchResultsealed type —Success,TransientFailure(timeouts / no network — worth retrying against a different source), orInvalidatingFailure(malformed blob ref, checksum mismatch, other IO). Transient/invalidating is decided viaConnectionErrorReason.fromIOException.RemoteConfigManager.kt: introducesdownloadTopicsWithFallback, which repeatedly picks the next weighted source, attempts the still-failing tasks against it, and stops as soon as everything succeeds or sources are exhausted. The most recent source's error is preserved for the completion callback so callers see the freshest failure reason rather than the first one encountered.cachedSourceId. On the next refresh, that source is tried first (bypassing weighted selection) to keep clients on a known-good source. If the sticky source returns an invalidating failure, the cached id is cleared so a fresh weighted draw can take over next time.RemoteConfigResponsedifferent from the one currently cached (e.g. an updated source list, new priorities, or new weights),cachedSourceIdis cleared before the fallback loop runs. This guarantees a new top-priority source can take over instead of the client staying pinned to a now-lower-priority cached source.selectWeightedExcludinghelper (WeightedSource.kt): weighted selection filtered to sources not already tried in the current refresh. Negative weights are clamped to zero so a single misconfigured entry can't poison the picker.RefreshOutcomesealed class (Success/VacuousSuccess/Failure) so caching, cleanup, and cached-source clearing are driven by a single decision rather than nested null checks.Tests
RemoteConfigManagerTest: new coverage for cross-source fallback on transient failures, exhaustion behavior, invalidating-failure short-circuit per task, sticky-source preference on subsequent refreshes,cachedSourceIdinvalidation when the sticky source fails with an invalidating error (rewritten with a stagedQueuedRandomso the test actually distinguishes cleared/not-cleared paths), andcachedSourceIdreset when the backend returns a new response.TopicFetcherTest: updated for the newTopicFetchResultreturn type, including timeout/no-network →TransientFailureand checksum/malformed/other-IO →InvalidatingFailureclassification.WeightedSourceSelectionTest: new tests forselectWeightedExcludingand for negative-weight clamping behavior.Checklist
purchases-iosand hybridsNote
Medium Risk
Changes the remote-config refresh path to retry topic downloads across multiple blob sources with sticky source selection, which can affect config update reliability and caching behavior under network failures.
Overview
Remote config refresh now falls back across blob sources when topic downloads fail, retrying only the still-failing topics until one source succeeds or all sources are exhausted; the completion error now reflects the latest failure.
TopicFetcher.fetchTopicIfNeedednow returns aTopicFetchResult(Success,TransientFailure,InvalidatingFailure) with error classification driven byConnectionErrorReason, andRemoteConfigManagertracks a stickycachedSourceId(prefer last-good source, clear it on invalidating failures or when the backend response changes).WeightedSourceaddsselectWeightedExcludingto avoid re-trying sources within a refresh, and tests are expanded/updated to cover fallback, stickiness/reset, and new result classifications.Reviewed by Cursor Bugbot for commit 4650c71. Bugbot is set up for automated code reviews on this repo. Configure here.