Skip to content

fix(github): implement resumable pagination and graceful degradation for search#35

Open
jadeproheshan wants to merge 2 commits into
NianJiuZst:mainfrom
jadeproheshan:feat/issue
Open

fix(github): implement resumable pagination and graceful degradation for search#35
jadeproheshan wants to merge 2 commits into
NianJiuZst:mainfrom
jadeproheshan:feat/issue

Conversation

@jadeproheshan
Copy link
Copy Markdown
Contributor

@jadeproheshan jadeproheshan commented May 18, 2026

fix #33

Problem

Currently, executing openmeta scout triggers a GitHub Search API 403 secondary rate limit quickly. The previous retry implementation would discard already fetched items and restart from page 1, resulting in a fatal loop. Additionally, the fallback retry delay of 1s was insufficient for GitHub's secondary limits, causing the script to exhaust its MAX_PAGINATION_RETRIES almost instantly.

Changes Made

  1. Resumable Pagination (currentPage): The loop now tracks currentPage and passes it to the Octokit iterator so subsequent retries append to the already collected dataset instead of starting over.
  2. Smarter Rate-Limit Delays: The backoff logic now explicitly intercepts the retry-after header (often returned in secondary limits). If no headers are provided, it assumes a conservative 60-second base delay.
  3. Graceful Degradation: If MAX_PAGINATION_RETRIES (3 times) is fully exhausted but we have successfully scraped a partial list of issues, the script will gracefully return the scraped items rather than throwing an error and dumping the data.

Before / After

openmeta only fetch 56 results beforechange

屏幕截图 2026-05-18 235443

and i increase to 289 after the change
屏幕截图 2026-05-18 235600

Backoff Strategy Changes (Detailed)

  • retry-after header parsing: GitHub's secondary rate limit returns retry-after (in seconds), which was previously ignored. Now it's checked as the highest-priority source for wait time.
  • Default fallback raised from 1s → 60s: When no rate-limit headers are present, the delay was previously 1s → 2s → 4s (exponential from 1s base). This is useless against secondary limits. Now it defaults to 61s → 62s → 64s.
  • Removed x-ratelimit-reset 60s cap: The old code had Math.min(waitMs + 500, 60_000), which truncated GitHub's recommended wait time to 60s max. Now we trust the server's reset time without an artificial cap.

For Discussion / Review

  • Incomplete Pagination: With the current retry limits, it's highly likely we won't hit the MAX_SEARCH_PAGES limit before exhausting our attempts. Should we consider:
    1. Making the maximum page limit (MAX_SEARCH_PAGES) a user-configurable option?
    2. Further optimizing the backoff strategy to handle severe throttling?
  • Asynchronous Execution: Given that the fetch and waiting process can now take a significantly long time, is it necessary/worthwhile to make this operation run asynchronously?

Fei-Good and others added 2 commits May 18, 2026 13:21
…for search

- Add `currentPage` tracking to resume Search API pagination after rate limits
- Parse `retry-after` header and fallback to a 60s base delay for secondary limits
- Return partially collected items instead of crashing when retries are exhausted
Copy link
Copy Markdown
Owner

@NianJiuZst NianJiuZst left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里先只说一个我认为需要先处理的点:部分结果被当成完整成功结果缓存了

paginateSearchWithRetry() 里,如果分页过程中触发 rate limit,重试耗尽后只要 items.length > 0 就会直接 return items(大概在 src/services/github.ts 第 404-407 行)。

但上层 fetchTrendingIssues() 并不知道这是一份“降级拿到的部分结果”,它会继续按正常成功路径处理:

  • 打 success log(src/services/github.ts 第 163 行)
  • 调用 saveCachedIssues(issues) 写入缓存(src/services/github.ts 第 164 行)

这样会带来一个比较隐蔽的问题:

  • 一次被 secondary rate limit 打断的 openmeta scout
  • 可能只拿到了前几页 issue
  • 但这份不完整结果会被写进 10 分钟缓存
  • 后续用户再次运行时,如果没有 --refresh,就会直接复用这份截断数据

也就是说,现在的实现把“partial result”伪装成了“full success”,缓存语义是不对的。

建议的修复方式

  1. 不要把这种降级返回的结果直接走正常 success + cache 路径。
  2. 至少要把 paginateSearchWithRetry() 的返回结果改成能表达状态的结构,例如:
    • { items, partial: false }
    • { items, partial: true }
  3. 上层 fetchTrendingIssues() 根据 partial 决定:
    • 是否打印成功日志
    • 是否写入缓存
  4. 如果暂时不想改返回结构,更保守的做法是:重试耗尽时即使拿到部分数据,也不要写缓存

另外建议补一组单测覆盖这个分支,因为现在已有测试没有打到这里:

  • paginate.iterator 先 yield 1-2 页数据
  • 然后抛出 403/429
  • 连续重试后仍失败
  • 断言“拿到了部分结果”时,不会被当成正常完整结果写入缓存

这个点我建议在合并前处理掉。

@NianJiuZst
Copy link
Copy Markdown
Owner

NianJiuZst commented May 18, 2026

@jadeproheshan @Fei-Good 补充单测

比较建议覆盖的场景是:

  • paginate.iterator 先成功返回前几页数据
  • 中途抛出 403/429(模拟 secondary rate limit)
  • 重试后从 currentPage 继续,而不是从第 1 页重新开始
  • 最终返回结果里既保留重试前已经拿到的 items,也能继续追加后续页的数据

如果要再细一点,可以顺手覆盖两种 header 分支:

  • retry-after
  • 只带 x-ratelimit-reset

这样能比较直接地证明这次 PR 修掉的核心问题:分页被限流打断后,不会丢掉前面已经抓到的数据,也不会每次都从 page 1 重头开始。

@jadeproheshan
Copy link
Copy Markdown
Contributor Author

我把我仓库的写权限给你,你直接推到我仓库吧,这样你就不用新开一个pr,咱合作把这bug修了
@Fei-Good

@jadeproheshan
Copy link
Copy Markdown
Contributor Author

@NianJiuZst 你觉得最大获取页数定为多少合适呢?

目前的最大页数我看另一个协作者定为13,这个最大页数加上目前的退避算法,一定没法获取到完整结果。

我当时想的是两种解决方法
1.获取到了最大页数才算成功
2.既然最大页数一定获取不到,获取部分结果是常态,那么就把获取到一定数量的结果算成功。当然这样就出现了你上面说的语义不适合

@NianJiuZst
Copy link
Copy Markdown
Owner

NianJiuZst commented May 18, 2026

有一个新思路,提供一个用户可配置的值,控制“每次搜索最多收集多少个 issue”。
这个值代表用户真正关心的产物规模,而不是 GitHub API 的分页细节。
把这个值定义为:searchCandidateLimit

含义不是“最多翻多少页”,也不是“最多抓多少条原始 search item”,而是:“最多收集多少个经过过滤、去重后可用的候选 issue”,这样比页数更稳定,也更符合 scout 的实际目标。
整体行为

按页请求 GitHub Search API
每拿到一页,立刻做:

  • 过滤
  • 去重
  • 候选收集
    当“可用候选 issue 数”达到 searchCandidateLimit 时就停止
    这次搜索如果是正常完成的,就把结果整体缓存下来,用户可以自己配置这个 limit,默认给一个合理值

虽然对用户只暴露一个 limit,但内部还是要保留两个保护机制:

一个隐藏的页数安全阀,防止极端情况下翻页失控,对 rate limit 中断和正常完成做语义区分,也就是说,要区分两种“停止”:

按配置主动停止

已经收集到足够多的候选 issue:这是正常完成可以正常缓存
被 rate limit 中途打断:只拿到了一部分,这是降级结果,不能直接当成完整成功结果缓存

@jadeproheshan
Copy link
Copy Markdown
Contributor Author

get

这个区分做的真好

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: paginateSearchWithRetry triggers 403 rate limit and discards already-fetched data on retry

3 participants