fix(github): implement resumable pagination and graceful degradation for search by jadeproheshan · Pull Request #35 · NianJiuZst/openmeta-cli

jadeproheshan · 2026-05-18T16:09:04Z

Problem

Currently, executing openmeta scout triggers a GitHub Search API 403 secondary rate limit quickly. The previous retry implementation would discard already fetched items and restart from page 1, resulting in a fatal loop. Additionally, the fallback retry delay of 1s was insufficient for GitHub's secondary limits, causing the script to exhaust its MAX_PAGINATION_RETRIES almost instantly.

Changes Made

Resumable Pagination (currentPage): The loop now tracks currentPage and passes it to the Octokit iterator so subsequent retries append to the already collected dataset instead of starting over.
Smarter Rate-Limit Delays: The backoff logic now explicitly intercepts the retry-after header (often returned in secondary limits). If no headers are provided, it assumes a conservative 60-second base delay.
Graceful Degradation: If MAX_PAGINATION_RETRIES (3 times) is fully exhausted but we have successfully scraped a partial list of issues, the script will gracefully return the scraped items rather than throwing an error and dumping the data.

Before / After

openmeta only fetch 56 results beforechange

and i increase to 289 after the change

Backoff Strategy Changes (Detailed)

retry-after header parsing: GitHub's secondary rate limit returns retry-after (in seconds), which was previously ignored. Now it's checked as the highest-priority source for wait time.
Default fallback raised from 1s → 60s: When no rate-limit headers are present, the delay was previously 1s → 2s → 4s (exponential from 1s base). This is useless against secondary limits. Now it defaults to 61s → 62s → 64s.
Removed x-ratelimit-reset 60s cap: The old code had Math.min(waitMs + 500, 60_000), which truncated GitHub's recommended wait time to 60s max. Now we trust the server's reset time without an artificial cap.

For Discussion / Review

Incomplete Pagination: With the current retry limits, it's highly likely we won't hit the MAX_SEARCH_PAGES limit before exhausting our attempts. Should we consider:
1. Making the maximum page limit (MAX_SEARCH_PAGES) a user-configurable option?
2. Further optimizing the backoff strategy to handle severe throttling?
Asynchronous Execution: Given that the fetch and waiting process can now take a significantly long time, is it necessary/worthwhile to make this operation run asynchronously?

…ards already-fetched data on retry

…for search - Add `currentPage` tracking to resume Search API pagination after rate limits - Parse `retry-after` header and fallback to a 60s base delay for secondary limits - Return partially collected items instead of crashing when retries are exhausted

NianJiuZst

这里先只说一个我认为需要先处理的点：部分结果被当成完整成功结果缓存了。

在 paginateSearchWithRetry() 里，如果分页过程中触发 rate limit，重试耗尽后只要 items.length > 0 就会直接 return items（大概在 src/services/github.ts 第 404-407 行）。

但上层 fetchTrendingIssues() 并不知道这是一份“降级拿到的部分结果”，它会继续按正常成功路径处理：

打 success log（src/services/github.ts 第 163 行）
调用 saveCachedIssues(issues) 写入缓存（src/services/github.ts 第 164 行）

这样会带来一个比较隐蔽的问题：

一次被 secondary rate limit 打断的 openmeta scout
可能只拿到了前几页 issue
但这份不完整结果会被写进 10 分钟缓存
后续用户再次运行时，如果没有 --refresh，就会直接复用这份截断数据

也就是说，现在的实现把“partial result”伪装成了“full success”，缓存语义是不对的。

建议的修复方式：

不要把这种降级返回的结果直接走正常 success + cache 路径。
至少要把 paginateSearchWithRetry() 的返回结果改成能表达状态的结构，例如：
- { items, partial: false }
- { items, partial: true }
上层 fetchTrendingIssues() 根据 partial 决定：
- 是否打印成功日志
- 是否写入缓存
如果暂时不想改返回结构，更保守的做法是：重试耗尽时即使拿到部分数据，也不要写缓存。

另外建议补一组单测覆盖这个分支，因为现在已有测试没有打到这里：

paginate.iterator 先 yield 1-2 页数据
然后抛出 403/429
连续重试后仍失败
断言“拿到了部分结果”时，不会被当成正常完整结果写入缓存

这个点我建议在合并前处理掉。

NianJiuZst · 2026-05-18T16:27:13Z

@jadeproheshan @Fei-Good 补充单测

比较建议覆盖的场景是：

paginate.iterator 先成功返回前几页数据
中途抛出 403/429（模拟 secondary rate limit）
重试后从 currentPage 继续，而不是从第 1 页重新开始
最终返回结果里既保留重试前已经拿到的 items，也能继续追加后续页的数据

如果要再细一点，可以顺手覆盖两种 header 分支：

带 retry-after
只带 x-ratelimit-reset

这样能比较直接地证明这次 PR 修掉的核心问题：分页被限流打断后，不会丢掉前面已经抓到的数据，也不会每次都从 page 1 重头开始。

jadeproheshan · 2026-05-18T16:47:10Z

我把我仓库的写权限给你，你直接推到我仓库吧，这样你就不用新开一个pr，咱合作把这bug修了
@Fei-Good

jadeproheshan · 2026-05-18T17:08:48Z

@NianJiuZst 你觉得最大获取页数定为多少合适呢?

目前的最大页数我看另一个协作者定为13，这个最大页数加上目前的退避算法，一定没法获取到完整结果。

我当时想的是两种解决方法
1.获取到了最大页数才算成功
2.既然最大页数一定获取不到，获取部分结果是常态，那么就把获取到一定数量的结果算成功。当然这样就出现了你上面说的语义不适合

NianJiuZst · 2026-05-18T17:27:07Z

有一个新思路，提供一个用户可配置的值，控制“每次搜索最多收集多少个 issue”。
这个值代表用户真正关心的产物规模，而不是 GitHub API 的分页细节。
把这个值定义为：searchCandidateLimit

含义不是“最多翻多少页”，也不是“最多抓多少条原始 search item”，而是：“最多收集多少个经过过滤、去重后可用的候选 issue”，这样比页数更稳定，也更符合 scout 的实际目标。
整体行为

按页请求 GitHub Search API
每拿到一页，立刻做：

过滤
去重
候选收集
当“可用候选 issue 数”达到 searchCandidateLimit 时就停止
这次搜索如果是正常完成的，就把结果整体缓存下来，用户可以自己配置这个 limit，默认给一个合理值

虽然对用户只暴露一个 limit，但内部还是要保留两个保护机制：

一个隐藏的页数安全阀，防止极端情况下翻页失控，对 rate limit 中断和正常完成做语义区分，也就是说，要区分两种“停止”：

按配置主动停止

已经收集到足够多的候选 issue：这是正常完成可以正常缓存
被 rate limit 中途打断：只拿到了一部分，这是降级结果，不能直接当成完整成功结果缓存

jadeproheshan · 2026-05-18T17:37:21Z

get

这个区分做的真好

Fei-Good and others added 2 commits May 18, 2026 13:21

fix(github): paginateSearchWithRetry triggers 403 rate limit and disc…

c7d2b6b

…ards already-fetched data on retry

NianJiuZst self-requested a review May 18, 2026 16:13

This was referenced May 18, 2026

fix(github): paginateSearchWithRetry triggers 403 rate limit and disc… #34

Closed

Bug: paginateSearchWithRetry triggers 403 rate limit and discards already-fetched data on retry #33

Open

NianJiuZst reviewed May 18, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(github): implement resumable pagination and graceful degradation for search#35

fix(github): implement resumable pagination and graceful degradation for search#35
jadeproheshan wants to merge 2 commits into
NianJiuZst:mainfrom
jadeproheshan:feat/issue

jadeproheshan commented May 18, 2026 •

edited

Loading

Uh oh!

NianJiuZst left a comment

Uh oh!

NianJiuZst commented May 18, 2026 •

edited

Loading

Uh oh!

jadeproheshan commented May 18, 2026

Uh oh!

jadeproheshan commented May 18, 2026

Uh oh!

NianJiuZst commented May 18, 2026 •

edited

Loading

Uh oh!

jadeproheshan commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

jadeproheshan commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problem

Changes Made

Before / After

Backoff Strategy Changes (Detailed)

For Discussion / Review

Uh oh!

NianJiuZst left a comment

Choose a reason for hiding this comment

Uh oh!

NianJiuZst commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jadeproheshan commented May 18, 2026

Uh oh!

jadeproheshan commented May 18, 2026

Uh oh!

NianJiuZst commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jadeproheshan commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jadeproheshan commented May 18, 2026 •

edited

Loading

NianJiuZst commented May 18, 2026 •

edited

Loading

NianJiuZst commented May 18, 2026 •

edited

Loading