Rate-limit circuit breaker for GitHub MCP backend tool calls#3799
Rate-limit circuit breaker for GitHub MCP backend tool calls#3799
Conversation
- Add circuit_breaker.go with CLOSED/OPEN/HALF-OPEN state machine (Phase 2) - Add isRateLimitToolResult to detect GitHub MCP rate-limit errors by inspecting isError flag and text content patterns - Integrate circuit breaker into callBackendTool in unified.go: check before each backend call, record rate-limit hits and successes, return descriptive errors when OPEN - Add RateLimitThreshold and RateLimitCooldown fields to ServerConfig (TOML/JSON) - Add injectRetryAfterIfRateLimited in proxy/handler.go: detects HTTP 429 or X-RateLimit-Remaining=0, injects Retry-After header, logs ERROR (Phase 1 proxy) - Tests: circuit_breaker_test.go covers all state transitions; rate_limit_test.go covers proxy Retry-After injection; config rate_limit fields test" Agent-Logs-Url: https://github.com/github/gh-aw-mcpg/sessions/29c289a0-52c0-4c27-98f3-8eef3b574559 Co-authored-by: lpcox <15877973+lpcox@users.noreply.github.com>
…ort grouping Agent-Logs-Url: https://github.com/github/gh-aw-mcpg/sessions/29c289a0-52c0-4c27-98f3-8eef3b574559 Co-authored-by: lpcox <15877973+lpcox@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Adds gateway/proxy-side protection against GitHub MCP backend rate limiting by detecting rate-limit responses, communicating backoff to clients, and introducing a per-backend circuit breaker to prevent retry storms.
Changes:
- Gateway mode: detect rate-limit tool results and integrate a per-backend circuit breaker into
callBackendTool. - Proxy mode: detect upstream rate limiting (429 / remaining=0), inject
Retry-After, and add unit tests for parsing/backoff calculation. - Config: add per-server
rate_limit_thresholdandrate_limit_cooldownfields (plus tests).
Show a summary per file
| File | Description |
|---|---|
| internal/server/unified.go | Adds circuit breaker initialization and enforces it before/after backend tool calls. |
| internal/server/circuit_breaker.go | Implements the CLOSED/OPEN/HALF-OPEN rate-limit circuit breaker and rate-limit detection helpers. |
| internal/server/circuit_breaker_test.go | Adds unit tests for circuit breaker state transitions and parsing/detection helpers. |
| internal/proxy/handler.go | Injects Retry-After on upstream rate-limit signals and logs rate-limit events. |
| internal/proxy/rate_limit_test.go | Tests Retry-After injection and retry delay computation/parsing. |
| internal/config/config_core.go | Adds per-server configuration fields for circuit breaker threshold/cooldown. |
| internal/config/rate_limit_config_test.go | Verifies TOML parsing behavior for the new rate-limit fields. |
Copilot's findings
Tip
Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Files reviewed: 7/7 changed files
- Comments generated: 5
| case circuitHalfOpen: | ||
| // One probe is allowed through; further requests are blocked. | ||
| return nil | ||
| } |
There was a problem hiding this comment.
In HALF-OPEN state, Allow() currently returns nil unconditionally, so multiple concurrent requests will all be allowed through (the comment says only one probe should be allowed and further requests blocked). This defeats the purpose of HALF-OPEN and can re-create the request storm during recovery. Track and enforce a single probe (e.g., a boolean/nonce like probeInFlight/probeUsed under the mutex) so subsequent Allow() calls in HALF-OPEN return ErrCircuitOpen (or similar) until RecordSuccess/RecordRateLimit transitions the state.
| func (us *UnifiedServer) getCircuitBreaker(serverID string) *circuitBreaker { | ||
| if cb, ok := us.circuitBreakers[serverID]; ok { | ||
| return cb | ||
| } |
There was a problem hiding this comment.
getCircuitBreaker() writes to us.circuitBreakers without ensuring the map is initialized. If a UnifiedServer is constructed without NewUnified (several tests build &UnifiedServer{...} literals), this will panic with "assignment to entry in nil map" on the first call for a missing key. Initialize the map when nil (or initialize circuitBreakers in all test constructors) before assigning.
| } | |
| } | |
| if us.circuitBreakers == nil { | |
| us.circuitBreakers = make(map[string]*circuitBreaker) | |
| } |
internal/server/unified.go
Outdated
| // Return the original error message so the agent can see it. | ||
| return newErrorCallToolResult(fmt.Errorf("backend server %q rate-limited: %w", | ||
| serverID, &ErrCircuitOpen{ServerID: serverID, ResetAt: resetAt})) |
There was a problem hiding this comment.
On a detected upstream rate-limit tool result, this path wraps/returns ErrCircuitOpen and constructs a new error string, but that is not the original backend error message (and it also implies the circuit is OPEN even when the threshold hasn't been reached yet). This loses useful backend details like the exact GitHub error text and any embedded reset hint, and can mislead clients about whether they were blocked by the breaker vs. rate-limited by upstream. Consider returning the backendResult as-is (or preserving its text) and only returning ErrCircuitOpen when cb.Allow() actually rejects the call; if you need to surface resetAt, add it as metadata without replacing the backend message.
| // Return the original error message so the agent can see it. | |
| return newErrorCallToolResult(fmt.Errorf("backend server %q rate-limited: %w", | |
| serverID, &ErrCircuitOpen{ServerID: serverID, ResetAt: resetAt})) | |
| // Preserve the backend-provided tool result so callers can see the original | |
| // upstream rate-limit details. ErrCircuitOpen should only be returned when | |
| // cb.Allow() rejects the call before contacting the backend. | |
| return backendResult |
| // are immediately rejected until the cooldown period elapses. Default: 3. | ||
| RateLimitThreshold int `toml:"rate_limit_threshold" json:"rate_limit_threshold,omitempty"` | ||
|
|
||
| // RateLimitCooldown is the number of seconds the circuit breaker stays OPEN before | ||
| // allowing a single probe request (transition OPEN → HALF-OPEN). If the probe | ||
| // succeeds the circuit closes; if rate-limited again it re-opens. Default: 60. |
There was a problem hiding this comment.
These new per-server rate-limit fields are added to the file config (TOML/JSON via ServerConfig), but stdin JSON configs are validated against an embedded upstream schema where stdioServerConfig/httpServerConfig have additionalProperties=false. Since fixSchemaBytes/stripExtensionFieldsForValidation currently do not add/strip rate_limit_threshold or rate_limit_cooldown, providing these fields via stdin JSON will fail schema validation (and convertStdinServerConfig has no wiring for them). If the intent is to support JSON stdin config too, update schema fixing + StdinServerConfig/convertStdinServerConfig accordingly; otherwise the doc comments/defaults should clarify the fields are TOML-only.
| // are immediately rejected until the cooldown period elapses. Default: 3. | |
| RateLimitThreshold int `toml:"rate_limit_threshold" json:"rate_limit_threshold,omitempty"` | |
| // RateLimitCooldown is the number of seconds the circuit breaker stays OPEN before | |
| // allowing a single probe request (transition OPEN → HALF-OPEN). If the probe | |
| // succeeds the circuit closes; if rate-limited again it re-opens. Default: 60. | |
| // are immediately rejected until the cooldown period elapses. Default: 3. | |
| // Supported in file-based config; stdin JSON config does not currently accept this field. | |
| RateLimitThreshold int `toml:"rate_limit_threshold" json:"rate_limit_threshold,omitempty"` | |
| // RateLimitCooldown is the number of seconds the circuit breaker stays OPEN before | |
| // allowing a single probe request (transition OPEN → HALF-OPEN). If the probe | |
| // succeeds the circuit closes; if rate-limited again it re-opens. Default: 60. | |
| // Supported in file-based config; stdin JSON config does not currently accept this field. |
| t.Parallel() | ||
| // Use a very short cooldown so the test doesn't sleep long. | ||
| cb := newCircuitBreaker("test", 1, 10*time.Millisecond) | ||
|
|
||
| cb.RecordRateLimit(time.Time{}) | ||
| require.Equal(t, circuitOpen, cb.State(), "should be OPEN after 1 error") | ||
|
|
||
| // Wait for cooldown. | ||
| time.Sleep(20 * time.Millisecond) | ||
|
|
||
| err := cb.Allow() | ||
| assert.NoError(t, err, "should allow probe after cooldown") | ||
| assert.Equal(t, circuitHalfOpen, cb.State(), "should be HALF-OPEN after cooldown") | ||
| } | ||
|
|
||
| // TestCircuitBreaker_HalfOpenClosesOnSuccess verifies HALF-OPEN → CLOSED on probe success. | ||
| func TestCircuitBreaker_HalfOpenClosesOnSuccess(t *testing.T) { | ||
| t.Parallel() | ||
| cb := newCircuitBreaker("test", 1, 10*time.Millisecond) | ||
|
|
||
| cb.RecordRateLimit(time.Time{}) | ||
| require.Equal(t, circuitOpen, cb.State()) | ||
|
|
||
| time.Sleep(20 * time.Millisecond) | ||
| require.NoError(t, cb.Allow()) // probe allowed | ||
|
|
||
| cb.RecordSuccess() | ||
| assert.Equal(t, circuitClosed, cb.State(), "should be CLOSED after probe success") | ||
| assert.NoError(t, cb.Allow(), "CLOSED circuit should allow requests") | ||
| } | ||
|
|
||
| // TestCircuitBreaker_HalfOpenReOpensOnRateLimit verifies HALF-OPEN → OPEN on probe failure. | ||
| func TestCircuitBreaker_HalfOpenReOpensOnRateLimit(t *testing.T) { | ||
| t.Parallel() | ||
| cb := newCircuitBreaker("test", 1, 10*time.Millisecond) | ||
|
|
||
| cb.RecordRateLimit(time.Time{}) | ||
| require.Equal(t, circuitOpen, cb.State()) | ||
|
|
||
| time.Sleep(20 * time.Millisecond) | ||
| require.NoError(t, cb.Allow()) // probe allowed | ||
|
|
||
| cb.RecordRateLimit(time.Time{}) | ||
| assert.Equal(t, circuitOpen, cb.State(), "should be OPEN again after probe is rate-limited") | ||
|
|
||
| err := cb.Allow() | ||
| require.Error(t, err) | ||
| var openErr *ErrCircuitOpen | ||
| require.ErrorAs(t, err, &openErr) | ||
| } | ||
|
|
||
| // TestCircuitBreaker_ResetAtFromHeader verifies the reset time from upstream is used. | ||
| func TestCircuitBreaker_ResetAtFromHeader(t *testing.T) { | ||
| t.Parallel() | ||
| cb := newCircuitBreaker("test", 1, 60*time.Second) | ||
| future := time.Now().Add(5 * time.Millisecond) | ||
| cb.RecordRateLimit(future) | ||
| require.Equal(t, circuitOpen, cb.State()) | ||
|
|
||
| // Before the reset time: still OPEN. | ||
| require.Error(t, cb.Allow()) | ||
|
|
||
| // After the reset time: transitions to HALF-OPEN. | ||
| time.Sleep(10 * time.Millisecond) | ||
| err := cb.Allow() |
There was a problem hiding this comment.
These tests use very short real-time sleeps (10–20ms) while also running in parallel. On slower/loaded CI runners this can be flaky (sleep may not exceed the cooldown/reset boundary due to scheduling jitter). Prefer injecting a clock into the circuit breaker or, if sticking with sleeps, increase durations and add a more generous margin (similar to other tests in the repo that use 2× TTL sleeps to avoid flakes).
Review FeedbackI'm addressing all 5 existing review comments plus 2 additional findings. Pushing commits for each. Addressing existing review feedback1. HALF-OPEN allows multiple concurrent probes (circuit_breaker.go:135) 2. Nil map panic in 3. Rate-limit result wraps in misleading 4. JSON stdin config not wired (config_core.go:226) 5. Flaky time-based tests (circuit_breaker_test.go:127) Additional findings6. Operator precedence bug in 7. Duplicate |
…ckend errors Address all 5 existing review comments plus 2 additional findings: 1. HALF-OPEN probe tracking: add probeInFlight flag so only one probe passes in HALF-OPEN state; concurrent requests get ErrCircuitOpen. Add TestCircuitBreaker_HalfOpenBlocksConcurrentProbes. 2. Nil map guard: getCircuitBreaker initializes circuitBreakers map when nil, preventing panic in test constructors that bypass NewUnified. 3. Preserve backend error text: rate-limit detection now returns the original upstream error message via extractRateLimitErrorText instead of wrapping in ErrCircuitOpen. ErrCircuitOpen is only returned when cb.Allow() rejects the call before contacting the backend. 4. TOML-only doc comments: clarify that rate_limit_threshold and rate_limit_cooldown are not wired for stdin JSON config. 5. Injectable clock: replace time.Now() in circuit breaker with nowFunc field (default time.Now). Tests use deterministic fake time instead of flaky time.Sleep-based assertions. 6. Operator precedence: add explicit parentheses in isRateLimitText for the compound rate limit + 403 condition. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Four tools simultaneously hit GitHub's 15k req/reset limit with no gateway-level protection — rate-limited responses propagated directly to agents, which retried immediately and worsened the storm.
Changes
Phase 1: Rate-limit detection + proxy backoff
unified.go): After eachexecuteBackendToolCall, inspects the tool result for GitHub MCP rate-limit error patterns (isError: true+ text matching "rate limit exceeded", "secondary rate limit", "too many requests", etc.) usingisRateLimitToolResult(). Extracts reset time from[rate reset in Ns]in the error text.handler.go):injectRetryAfterIfRateLimited()checksHTTP 429orX-Ratelimit-Remaining: 0after each upstream forward, injects aRetry-Afterheader derived fromX-Ratelimit-Reset, and logs at ERROR level.Phase 2: Per-backend circuit breaker (
circuit_breaker.go)Classic three-state machine per backend server ID:
ErrCircuitOpen(includes reset timestamp).Configuration
Two new optional per-server fields in TOML/JSON:
Warning
Firewall rules blocked me from connecting to one or more addresses (expand for details)
I tried to connect to the following addresses, but was blocked by firewall rules:
example.com/tmp/go-build151504929/b510/launcher.test /tmp/go-build151504929/b510/launcher.test -test.testlogfile=/tmp/go-build151504929/b510/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.v=true 64/src/net -trimpath x_amd64/link -p 5607675/b085/ -lang=go1.25 x_amd64/link -I /opt/hostedtoolc-errorsas 5607675/b151/ x_amd64/vet --gdwarf-5 rk/XFoNDgDe6LSlTdocker-cli-plugin-metadata(dns block)/tmp/go-build1093949210/b510/launcher.test /tmp/go-build1093949210/b510/launcher.test -test.testlogfile=/tmp/go-build1093949210/b510/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.v=true /version/version.go /version/version_test.go x_amd64/vet 64/src/runtime/cbash ernal/oidc ache/go/1.25.8/x--noprofile x_amd64/vet -ato�� UQYbtZDox -buildtags x_amd64/vet -errorsas -ifaceassert -nilfunc x_amd64/vet(dns block)/tmp/go-build1807685167/b514/launcher.test /tmp/go-build1807685167/b514/launcher.test -test.testlogfile=/tmp/go-build1807685167/b514/testlog.txt -test.paniconexit0 -test.timeout=10m0s(dns block)invalid-host-that-does-not-exist-12345.com/tmp/go-build151504929/b492/config.test /tmp/go-build151504929/b492/config.test -test.testlogfile=/tmp/go-build151504929/b492/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.v=true ache/go/1.25.8/x-errorsas andler.go x_amd64/compile(dns block)/tmp/go-build1093949210/b492/config.test /tmp/go-build1093949210/b492/config.test -test.testlogfile=/tmp/go-build1093949210/b492/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.v=true t/config.go t/driver.go x_amd64/compile -p runtime/pprof -lang=go1.25 x_amd64/compile -uns�� -unreachable=false /tmp/go-build151504929/b092/vet.cfg x_amd64/vet -c=4 -nolocalimports -importcfg x_amd64/vet(dns block)/tmp/go-build1807685167/b496/config.test /tmp/go-build1807685167/b496/config.test -test.testlogfile=/tmp/go-build1807685167/b496/testlog.txt -test.paniconexit0 -test.timeout=10m0s go1.25.8 -c=4 -nolocalimports -importcfg /tmp/go-build1747915822/b123/importcfg -embedcfg /tmp/go-build1747915822/b123/embedcfg -pack(dns block)nonexistent.local/tmp/go-build151504929/b510/launcher.test /tmp/go-build151504929/b510/launcher.test -test.testlogfile=/tmp/go-build151504929/b510/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.v=true 64/src/net -trimpath x_amd64/link -p 5607675/b085/ -lang=go1.25 x_amd64/link -I /opt/hostedtoolc-errorsas 5607675/b151/ x_amd64/vet --gdwarf-5 rk/XFoNDgDe6LSlTdocker-cli-plugin-metadata(dns block)/tmp/go-build1093949210/b510/launcher.test /tmp/go-build1093949210/b510/launcher.test -test.testlogfile=/tmp/go-build1093949210/b510/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.v=true /version/version.go /version/version_test.go x_amd64/vet 64/src/runtime/cbash ernal/oidc ache/go/1.25.8/x--noprofile x_amd64/vet -ato�� UQYbtZDox -buildtags x_amd64/vet -errorsas -ifaceassert -nilfunc x_amd64/vet(dns block)/tmp/go-build1807685167/b514/launcher.test /tmp/go-build1807685167/b514/launcher.test -test.testlogfile=/tmp/go-build1807685167/b514/testlog.txt -test.paniconexit0 -test.timeout=10m0s(dns block)slow.example.com/tmp/go-build151504929/b510/launcher.test /tmp/go-build151504929/b510/launcher.test -test.testlogfile=/tmp/go-build151504929/b510/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.v=true 64/src/net -trimpath x_amd64/link -p 5607675/b085/ -lang=go1.25 x_amd64/link -I /opt/hostedtoolc-errorsas 5607675/b151/ x_amd64/vet --gdwarf-5 rk/XFoNDgDe6LSlTdocker-cli-plugin-metadata(dns block)/tmp/go-build1093949210/b510/launcher.test /tmp/go-build1093949210/b510/launcher.test -test.testlogfile=/tmp/go-build1093949210/b510/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.v=true /version/version.go /version/version_test.go x_amd64/vet 64/src/runtime/cbash ernal/oidc ache/go/1.25.8/x--noprofile x_amd64/vet -ato�� UQYbtZDox -buildtags x_amd64/vet -errorsas -ifaceassert -nilfunc x_amd64/vet(dns block)/tmp/go-build1807685167/b514/launcher.test /tmp/go-build1807685167/b514/launcher.test -test.testlogfile=/tmp/go-build1807685167/b514/testlog.txt -test.paniconexit0 -test.timeout=10m0s(dns block)this-host-does-not-exist-12345.com/tmp/go-build151504929/b519/mcp.test /tmp/go-build151504929/b519/mcp.test -test.testlogfile=/tmp/go-build151504929/b519/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.v=true 1.80.0/metadata/metadata.go 5607675/b151/ x_amd64/vet --gdwarf-5 --64 -o x_amd64/vet 5607�� 1.10.2/active_help.go 1.10.2/args.go x_amd64/vet --gdwarf-5 --64 -o x_amd64/vet(dns block)/tmp/go-build1093949210/b519/mcp.test /tmp/go-build1093949210/b519/mcp.test -test.testlogfile=/tmp/go-build1093949210/b519/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.v=true submodules | hea-p -buildtags x_amd64/vet -errorsas -ifaceassert -nilfunc x_amd64/vet n-me�� g_.a -buildtags 64/pkg/tool/linu-nolocalimports -errorsas ernal/config -nilfunc 64/pkg/tool/linu/tmp/go-build1093949210/b226/_testmain.go(dns block)/tmp/go-build1807685167/b523/mcp.test /tmp/go-build1807685167/b523/mcp.test -test.testlogfile=/tmp/go-build1807685167/b523/testlog.txt -test.paniconexit0 -test.timeout=10m0s 9560�� g_.a /sys/fs/cgroup ache/go/1.25.8/x64/pkg/tool/linux_amd64/vet JwJi0zHsZ ernal/config x_amd64/vet ache/go/1.25.8/x-buildtags 9560�� ZjxfQrmK0 x_amd64/vet x_amd64/vet rg/x/net@v0.52.0bash cfg 64/pkg/tool/linu--version x_amd64/vet(dns block)If you need me to access, download, or install something from one of these locations, you can either: