fix(github): implement resumable pagination and graceful degradation for search#35
fix(github): implement resumable pagination and graceful degradation for search#35jadeproheshan wants to merge 2 commits into
Conversation
…ards already-fetched data on retry
…for search - Add `currentPage` tracking to resume Search API pagination after rate limits - Parse `retry-after` header and fallback to a 60s base delay for secondary limits - Return partially collected items instead of crashing when retries are exhausted
NianJiuZst
left a comment
There was a problem hiding this comment.
这里先只说一个我认为需要先处理的点:部分结果被当成完整成功结果缓存了。
在 paginateSearchWithRetry() 里,如果分页过程中触发 rate limit,重试耗尽后只要 items.length > 0 就会直接 return items(大概在 src/services/github.ts 第 404-407 行)。
但上层 fetchTrendingIssues() 并不知道这是一份“降级拿到的部分结果”,它会继续按正常成功路径处理:
- 打 success log(
src/services/github.ts第 163 行) - 调用
saveCachedIssues(issues)写入缓存(src/services/github.ts第 164 行)
这样会带来一个比较隐蔽的问题:
- 一次被 secondary rate limit 打断的
openmeta scout - 可能只拿到了前几页 issue
- 但这份不完整结果会被写进 10 分钟缓存
- 后续用户再次运行时,如果没有
--refresh,就会直接复用这份截断数据
也就是说,现在的实现把“partial result”伪装成了“full success”,缓存语义是不对的。
建议的修复方式:
- 不要把这种降级返回的结果直接走正常 success + cache 路径。
- 至少要把
paginateSearchWithRetry()的返回结果改成能表达状态的结构,例如:{ items, partial: false }{ items, partial: true }
- 上层
fetchTrendingIssues()根据partial决定:- 是否打印成功日志
- 是否写入缓存
- 如果暂时不想改返回结构,更保守的做法是:重试耗尽时即使拿到部分数据,也不要写缓存。
另外建议补一组单测覆盖这个分支,因为现在已有测试没有打到这里:
paginate.iterator先 yield 1-2 页数据- 然后抛出 403/429
- 连续重试后仍失败
- 断言“拿到了部分结果”时,不会被当成正常完整结果写入缓存
这个点我建议在合并前处理掉。
|
@jadeproheshan @Fei-Good 补充单测 比较建议覆盖的场景是:
如果要再细一点,可以顺手覆盖两种 header 分支:
这样能比较直接地证明这次 PR 修掉的核心问题:分页被限流打断后,不会丢掉前面已经抓到的数据,也不会每次都从 page 1 重头开始。 |
|
我把我仓库的写权限给你,你直接推到我仓库吧,这样你就不用新开一个pr,咱合作把这bug修了 |
|
@NianJiuZst 你觉得最大获取页数定为多少合适呢? 目前的最大页数我看另一个协作者定为13,这个最大页数加上目前的退避算法,一定没法获取到完整结果。 我当时想的是两种解决方法 |
|
有一个新思路,提供一个用户可配置的值,控制“每次搜索最多收集多少个 issue”。 含义不是“最多翻多少页”,也不是“最多抓多少条原始 search item”,而是:“最多收集多少个经过过滤、去重后可用的候选 issue”,这样比页数更稳定,也更符合 scout 的实际目标。 按页请求 GitHub Search API
虽然对用户只暴露一个 limit,但内部还是要保留两个保护机制: 一个隐藏的页数安全阀,防止极端情况下翻页失控,对 rate limit 中断和正常完成做语义区分,也就是说,要区分两种“停止”: 按配置主动停止 已经收集到足够多的候选 issue:这是正常完成可以正常缓存 |
|
get 这个区分做的真好 |
fix #33
Problem
Currently, executing
openmeta scouttriggers a GitHub Search API 403 secondary rate limit quickly. The previous retry implementation would discard already fetched items and restart from page 1, resulting in a fatal loop. Additionally, the fallback retry delay of1swas insufficient for GitHub's secondary limits, causing the script to exhaust itsMAX_PAGINATION_RETRIESalmost instantly.Changes Made
currentPage): The loop now trackscurrentPageand passes it to the Octokit iterator so subsequent retries append to the already collected dataset instead of starting over.retry-afterheader (often returned in secondary limits). If no headers are provided, it assumes a conservative 60-second base delay.MAX_PAGINATION_RETRIES(3 times) is fully exhausted but we have successfully scraped a partial list of issues, the script will gracefully return the scraped items rather than throwing an error and dumping the data.Before / After
openmeta only fetch 56 results beforechange
and i increase to 289 after the change

Backoff Strategy Changes (Detailed)
retry-afterheader parsing: GitHub's secondary rate limit returnsretry-after(in seconds), which was previously ignored. Now it's checked as the highest-priority source for wait time.1s → 2s → 4s(exponential from 1s base). This is useless against secondary limits. Now it defaults to61s → 62s → 64s.x-ratelimit-reset60s cap: The old code hadMath.min(waitMs + 500, 60_000), which truncated GitHub's recommended wait time to 60s max. Now we trust the server's reset time without an artificial cap.For Discussion / Review
MAX_SEARCH_PAGESlimit before exhausting our attempts. Should we consider:MAX_SEARCH_PAGES) a user-configurable option?