Skip to content

Rate-limit circuit breaker for GitHub MCP backend tool calls#3799

Merged
lpcox merged 4 commits intomainfrom
copilot/add-rate-limit-circuit-breaker
Apr 14, 2026
Merged

Rate-limit circuit breaker for GitHub MCP backend tool calls#3799
lpcox merged 4 commits intomainfrom
copilot/add-rate-limit-circuit-breaker

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented Apr 14, 2026

Four tools simultaneously hit GitHub's 15k req/reset limit with no gateway-level protection — rate-limited responses propagated directly to agents, which retried immediately and worsened the storm.

Changes

Phase 1: Rate-limit detection + proxy backoff

  • Gateway mode (unified.go): After each executeBackendToolCall, inspects the tool result for GitHub MCP rate-limit error patterns (isError: true + text matching "rate limit exceeded", "secondary rate limit", "too many requests", etc.) using isRateLimitToolResult(). Extracts reset time from [rate reset in Ns] in the error text.
  • Proxy mode (handler.go): injectRetryAfterIfRateLimited() checks HTTP 429 or X-Ratelimit-Remaining: 0 after each upstream forward, injects a Retry-After header derived from X-Ratelimit-Reset, and logs at ERROR level.

Phase 2: Per-backend circuit breaker (circuit_breaker.go)

Classic three-state machine per backend server ID:

CLOSED ──(N consecutive rate-limit errors)──▶ OPEN
  ▲                                              │
  │ (probe succeeds)                  (cooldown/reset elapsed)
  │                                              ▼
  └────────────────────────────────── HALF-OPEN
  • OPEN rejects requests immediately with ErrCircuitOpen (includes reset timestamp).
  • HALF-OPEN allows one probe; re-opens on another rate-limit, closes on success.
  • Transport errors (connection failures) leave the counter unchanged — only actual rate-limit tool results affect it.

Configuration

Two new optional per-server fields in TOML/JSON:

[servers.github]
rate_limit_threshold = 3   # consecutive 429s before opening (default: 3)
rate_limit_cooldown   = 60 # seconds OPEN before probing   (default: 60)

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

  • example.com
    • Triggering command: /tmp/go-build151504929/b510/launcher.test /tmp/go-build151504929/b510/launcher.test -test.testlogfile=/tmp/go-build151504929/b510/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.v=true 64/src/net -trimpath x_amd64/link -p 5607675/b085/ -lang=go1.25 x_amd64/link -I /opt/hostedtoolc-errorsas 5607675/b151/ x_amd64/vet --gdwarf-5 rk/XFoNDgDe6LSlTdocker-cli-plugin-metadata (dns block)
    • Triggering command: /tmp/go-build1093949210/b510/launcher.test /tmp/go-build1093949210/b510/launcher.test -test.testlogfile=/tmp/go-build1093949210/b510/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.v=true /version/version.go /version/version_test.go x_amd64/vet 64/src/runtime/cbash ernal/oidc ache/go/1.25.8/x--noprofile x_amd64/vet -ato�� UQYbtZDox -buildtags x_amd64/vet -errorsas -ifaceassert -nilfunc x_amd64/vet (dns block)
    • Triggering command: /tmp/go-build1807685167/b514/launcher.test /tmp/go-build1807685167/b514/launcher.test -test.testlogfile=/tmp/go-build1807685167/b514/testlog.txt -test.paniconexit0 -test.timeout=10m0s (dns block)
  • invalid-host-that-does-not-exist-12345.com
    • Triggering command: /tmp/go-build151504929/b492/config.test /tmp/go-build151504929/b492/config.test -test.testlogfile=/tmp/go-build151504929/b492/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.v=true ache/go/1.25.8/x-errorsas andler.go x_amd64/compile (dns block)
    • Triggering command: /tmp/go-build1093949210/b492/config.test /tmp/go-build1093949210/b492/config.test -test.testlogfile=/tmp/go-build1093949210/b492/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.v=true t/config.go t/driver.go x_amd64/compile -p runtime/pprof -lang=go1.25 x_amd64/compile -uns�� -unreachable=false /tmp/go-build151504929/b092/vet.cfg x_amd64/vet -c=4 -nolocalimports -importcfg x_amd64/vet (dns block)
    • Triggering command: /tmp/go-build1807685167/b496/config.test /tmp/go-build1807685167/b496/config.test -test.testlogfile=/tmp/go-build1807685167/b496/testlog.txt -test.paniconexit0 -test.timeout=10m0s go1.25.8 -c=4 -nolocalimports -importcfg /tmp/go-build1747915822/b123/importcfg -embedcfg /tmp/go-build1747915822/b123/embedcfg -pack (dns block)
  • nonexistent.local
    • Triggering command: /tmp/go-build151504929/b510/launcher.test /tmp/go-build151504929/b510/launcher.test -test.testlogfile=/tmp/go-build151504929/b510/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.v=true 64/src/net -trimpath x_amd64/link -p 5607675/b085/ -lang=go1.25 x_amd64/link -I /opt/hostedtoolc-errorsas 5607675/b151/ x_amd64/vet --gdwarf-5 rk/XFoNDgDe6LSlTdocker-cli-plugin-metadata (dns block)
    • Triggering command: /tmp/go-build1093949210/b510/launcher.test /tmp/go-build1093949210/b510/launcher.test -test.testlogfile=/tmp/go-build1093949210/b510/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.v=true /version/version.go /version/version_test.go x_amd64/vet 64/src/runtime/cbash ernal/oidc ache/go/1.25.8/x--noprofile x_amd64/vet -ato�� UQYbtZDox -buildtags x_amd64/vet -errorsas -ifaceassert -nilfunc x_amd64/vet (dns block)
    • Triggering command: /tmp/go-build1807685167/b514/launcher.test /tmp/go-build1807685167/b514/launcher.test -test.testlogfile=/tmp/go-build1807685167/b514/testlog.txt -test.paniconexit0 -test.timeout=10m0s (dns block)
  • slow.example.com
    • Triggering command: /tmp/go-build151504929/b510/launcher.test /tmp/go-build151504929/b510/launcher.test -test.testlogfile=/tmp/go-build151504929/b510/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.v=true 64/src/net -trimpath x_amd64/link -p 5607675/b085/ -lang=go1.25 x_amd64/link -I /opt/hostedtoolc-errorsas 5607675/b151/ x_amd64/vet --gdwarf-5 rk/XFoNDgDe6LSlTdocker-cli-plugin-metadata (dns block)
    • Triggering command: /tmp/go-build1093949210/b510/launcher.test /tmp/go-build1093949210/b510/launcher.test -test.testlogfile=/tmp/go-build1093949210/b510/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.v=true /version/version.go /version/version_test.go x_amd64/vet 64/src/runtime/cbash ernal/oidc ache/go/1.25.8/x--noprofile x_amd64/vet -ato�� UQYbtZDox -buildtags x_amd64/vet -errorsas -ifaceassert -nilfunc x_amd64/vet (dns block)
    • Triggering command: /tmp/go-build1807685167/b514/launcher.test /tmp/go-build1807685167/b514/launcher.test -test.testlogfile=/tmp/go-build1807685167/b514/testlog.txt -test.paniconexit0 -test.timeout=10m0s (dns block)
  • this-host-does-not-exist-12345.com
    • Triggering command: /tmp/go-build151504929/b519/mcp.test /tmp/go-build151504929/b519/mcp.test -test.testlogfile=/tmp/go-build151504929/b519/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.v=true 1.80.0/metadata/metadata.go 5607675/b151/ x_amd64/vet --gdwarf-5 --64 -o x_amd64/vet 5607�� 1.10.2/active_help.go 1.10.2/args.go x_amd64/vet --gdwarf-5 --64 -o x_amd64/vet (dns block)
    • Triggering command: /tmp/go-build1093949210/b519/mcp.test /tmp/go-build1093949210/b519/mcp.test -test.testlogfile=/tmp/go-build1093949210/b519/testlog.txt -test.paniconexit0 -test.timeout=10m0s -test.v=true submodules | hea-p -buildtags x_amd64/vet -errorsas -ifaceassert -nilfunc x_amd64/vet n-me�� g_.a -buildtags 64/pkg/tool/linu-nolocalimports -errorsas ernal/config -nilfunc 64/pkg/tool/linu/tmp/go-build1093949210/b226/_testmain.go (dns block)
    • Triggering command: /tmp/go-build1807685167/b523/mcp.test /tmp/go-build1807685167/b523/mcp.test -test.testlogfile=/tmp/go-build1807685167/b523/testlog.txt -test.paniconexit0 -test.timeout=10m0s 9560�� g_.a /sys/fs/cgroup ache/go/1.25.8/x64/pkg/tool/linux_amd64/vet JwJi0zHsZ ernal/config x_amd64/vet ache/go/1.25.8/x-buildtags 9560�� ZjxfQrmK0 x_amd64/vet x_amd64/vet rg/x/net@v0.52.0bash cfg 64/pkg/tool/linu--version x_amd64/vet (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Copilot AI and others added 2 commits April 14, 2026 19:07
- Add circuit_breaker.go with CLOSED/OPEN/HALF-OPEN state machine (Phase 2)
- Add isRateLimitToolResult to detect GitHub MCP rate-limit errors by inspecting
  isError flag and text content patterns
- Integrate circuit breaker into callBackendTool in unified.go: check before each
  backend call, record rate-limit hits and successes, return descriptive errors when OPEN
- Add RateLimitThreshold and RateLimitCooldown fields to ServerConfig (TOML/JSON)
- Add injectRetryAfterIfRateLimited in proxy/handler.go: detects HTTP 429 or
  X-RateLimit-Remaining=0, injects Retry-After header, logs ERROR (Phase 1 proxy)
- Tests: circuit_breaker_test.go covers all state transitions; rate_limit_test.go
  covers proxy Retry-After injection; config rate_limit fields test"

Agent-Logs-Url: https://github.com/github/gh-aw-mcpg/sessions/29c289a0-52c0-4c27-98f3-8eef3b574559

Co-authored-by: lpcox <15877973+lpcox@users.noreply.github.com>
Copilot AI changed the title [WIP] Add rate-limit circuit breaker for GitHub MCP backend tool calls Rate-limit circuit breaker for GitHub MCP backend tool calls Apr 14, 2026
Copilot AI requested a review from lpcox April 14, 2026 19:12
@lpcox lpcox marked this pull request as ready for review April 14, 2026 19:38
Copilot AI review requested due to automatic review settings April 14, 2026 19:38
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds gateway/proxy-side protection against GitHub MCP backend rate limiting by detecting rate-limit responses, communicating backoff to clients, and introducing a per-backend circuit breaker to prevent retry storms.

Changes:

  • Gateway mode: detect rate-limit tool results and integrate a per-backend circuit breaker into callBackendTool.
  • Proxy mode: detect upstream rate limiting (429 / remaining=0), inject Retry-After, and add unit tests for parsing/backoff calculation.
  • Config: add per-server rate_limit_threshold and rate_limit_cooldown fields (plus tests).
Show a summary per file
File Description
internal/server/unified.go Adds circuit breaker initialization and enforces it before/after backend tool calls.
internal/server/circuit_breaker.go Implements the CLOSED/OPEN/HALF-OPEN rate-limit circuit breaker and rate-limit detection helpers.
internal/server/circuit_breaker_test.go Adds unit tests for circuit breaker state transitions and parsing/detection helpers.
internal/proxy/handler.go Injects Retry-After on upstream rate-limit signals and logs rate-limit events.
internal/proxy/rate_limit_test.go Tests Retry-After injection and retry delay computation/parsing.
internal/config/config_core.go Adds per-server configuration fields for circuit breaker threshold/cooldown.
internal/config/rate_limit_config_test.go Verifies TOML parsing behavior for the new rate-limit fields.

Copilot's findings

Tip

Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

  • Files reviewed: 7/7 changed files
  • Comments generated: 5

Comment on lines +132 to +135
case circuitHalfOpen:
// One probe is allowed through; further requests are blocked.
return nil
}
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In HALF-OPEN state, Allow() currently returns nil unconditionally, so multiple concurrent requests will all be allowed through (the comment says only one probe should be allowed and further requests blocked). This defeats the purpose of HALF-OPEN and can re-create the request storm during recovery. Track and enforce a single probe (e.g., a boolean/nonce like probeInFlight/probeUsed under the mutex) so subsequent Allow() calls in HALF-OPEN return ErrCircuitOpen (or similar) until RecordSuccess/RecordRateLimit transitions the state.

Copilot uses AI. Check for mistakes.
func (us *UnifiedServer) getCircuitBreaker(serverID string) *circuitBreaker {
if cb, ok := us.circuitBreakers[serverID]; ok {
return cb
}
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

getCircuitBreaker() writes to us.circuitBreakers without ensuring the map is initialized. If a UnifiedServer is constructed without NewUnified (several tests build &UnifiedServer{...} literals), this will panic with "assignment to entry in nil map" on the first call for a missing key. Initialize the map when nil (or initialize circuitBreakers in all test constructors) before assigning.

Suggested change
}
}
if us.circuitBreakers == nil {
us.circuitBreakers = make(map[string]*circuitBreaker)
}

Copilot uses AI. Check for mistakes.
Comment on lines +591 to +593
// Return the original error message so the agent can see it.
return newErrorCallToolResult(fmt.Errorf("backend server %q rate-limited: %w",
serverID, &ErrCircuitOpen{ServerID: serverID, ResetAt: resetAt}))
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a detected upstream rate-limit tool result, this path wraps/returns ErrCircuitOpen and constructs a new error string, but that is not the original backend error message (and it also implies the circuit is OPEN even when the threshold hasn't been reached yet). This loses useful backend details like the exact GitHub error text and any embedded reset hint, and can mislead clients about whether they were blocked by the breaker vs. rate-limited by upstream. Consider returning the backendResult as-is (or preserving its text) and only returning ErrCircuitOpen when cb.Allow() actually rejects the call; if you need to surface resetAt, add it as metadata without replacing the backend message.

Suggested change
// Return the original error message so the agent can see it.
return newErrorCallToolResult(fmt.Errorf("backend server %q rate-limited: %w",
serverID, &ErrCircuitOpen{ServerID: serverID, ResetAt: resetAt}))
// Preserve the backend-provided tool result so callers can see the original
// upstream rate-limit details. ErrCircuitOpen should only be returned when
// cb.Allow() rejects the call before contacting the backend.
return backendResult

Copilot uses AI. Check for mistakes.
Comment on lines +221 to +226
// are immediately rejected until the cooldown period elapses. Default: 3.
RateLimitThreshold int `toml:"rate_limit_threshold" json:"rate_limit_threshold,omitempty"`

// RateLimitCooldown is the number of seconds the circuit breaker stays OPEN before
// allowing a single probe request (transition OPEN → HALF-OPEN). If the probe
// succeeds the circuit closes; if rate-limited again it re-opens. Default: 60.
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These new per-server rate-limit fields are added to the file config (TOML/JSON via ServerConfig), but stdin JSON configs are validated against an embedded upstream schema where stdioServerConfig/httpServerConfig have additionalProperties=false. Since fixSchemaBytes/stripExtensionFieldsForValidation currently do not add/strip rate_limit_threshold or rate_limit_cooldown, providing these fields via stdin JSON will fail schema validation (and convertStdinServerConfig has no wiring for them). If the intent is to support JSON stdin config too, update schema fixing + StdinServerConfig/convertStdinServerConfig accordingly; otherwise the doc comments/defaults should clarify the fields are TOML-only.

Suggested change
// are immediately rejected until the cooldown period elapses. Default: 3.
RateLimitThreshold int `toml:"rate_limit_threshold" json:"rate_limit_threshold,omitempty"`
// RateLimitCooldown is the number of seconds the circuit breaker stays OPEN before
// allowing a single probe request (transition OPEN → HALF-OPEN). If the probe
// succeeds the circuit closes; if rate-limited again it re-opens. Default: 60.
// are immediately rejected until the cooldown period elapses. Default: 3.
// Supported in file-based config; stdin JSON config does not currently accept this field.
RateLimitThreshold int `toml:"rate_limit_threshold" json:"rate_limit_threshold,omitempty"`
// RateLimitCooldown is the number of seconds the circuit breaker stays OPEN before
// allowing a single probe request (transition OPEN → HALF-OPEN). If the probe
// succeeds the circuit closes; if rate-limited again it re-opens. Default: 60.
// Supported in file-based config; stdin JSON config does not currently accept this field.

Copilot uses AI. Check for mistakes.
Comment on lines +63 to +127
t.Parallel()
// Use a very short cooldown so the test doesn't sleep long.
cb := newCircuitBreaker("test", 1, 10*time.Millisecond)

cb.RecordRateLimit(time.Time{})
require.Equal(t, circuitOpen, cb.State(), "should be OPEN after 1 error")

// Wait for cooldown.
time.Sleep(20 * time.Millisecond)

err := cb.Allow()
assert.NoError(t, err, "should allow probe after cooldown")
assert.Equal(t, circuitHalfOpen, cb.State(), "should be HALF-OPEN after cooldown")
}

// TestCircuitBreaker_HalfOpenClosesOnSuccess verifies HALF-OPEN → CLOSED on probe success.
func TestCircuitBreaker_HalfOpenClosesOnSuccess(t *testing.T) {
t.Parallel()
cb := newCircuitBreaker("test", 1, 10*time.Millisecond)

cb.RecordRateLimit(time.Time{})
require.Equal(t, circuitOpen, cb.State())

time.Sleep(20 * time.Millisecond)
require.NoError(t, cb.Allow()) // probe allowed

cb.RecordSuccess()
assert.Equal(t, circuitClosed, cb.State(), "should be CLOSED after probe success")
assert.NoError(t, cb.Allow(), "CLOSED circuit should allow requests")
}

// TestCircuitBreaker_HalfOpenReOpensOnRateLimit verifies HALF-OPEN → OPEN on probe failure.
func TestCircuitBreaker_HalfOpenReOpensOnRateLimit(t *testing.T) {
t.Parallel()
cb := newCircuitBreaker("test", 1, 10*time.Millisecond)

cb.RecordRateLimit(time.Time{})
require.Equal(t, circuitOpen, cb.State())

time.Sleep(20 * time.Millisecond)
require.NoError(t, cb.Allow()) // probe allowed

cb.RecordRateLimit(time.Time{})
assert.Equal(t, circuitOpen, cb.State(), "should be OPEN again after probe is rate-limited")

err := cb.Allow()
require.Error(t, err)
var openErr *ErrCircuitOpen
require.ErrorAs(t, err, &openErr)
}

// TestCircuitBreaker_ResetAtFromHeader verifies the reset time from upstream is used.
func TestCircuitBreaker_ResetAtFromHeader(t *testing.T) {
t.Parallel()
cb := newCircuitBreaker("test", 1, 60*time.Second)
future := time.Now().Add(5 * time.Millisecond)
cb.RecordRateLimit(future)
require.Equal(t, circuitOpen, cb.State())

// Before the reset time: still OPEN.
require.Error(t, cb.Allow())

// After the reset time: transitions to HALF-OPEN.
time.Sleep(10 * time.Millisecond)
err := cb.Allow()
Copy link

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests use very short real-time sleeps (10–20ms) while also running in parallel. On slower/loaded CI runners this can be flaky (sleep may not exceed the cooldown/reset boundary due to scheduling jitter). Prefer injecting a clock into the circuit breaker or, if sticking with sleeps, increase durations and add a more generous margin (similar to other tests in the repo that use 2× TTL sleeps to avoid flakes).

Copilot uses AI. Check for mistakes.
@lpcox
Copy link
Copy Markdown
Collaborator

lpcox commented Apr 14, 2026

Review Feedback

I'm addressing all 5 existing review comments plus 2 additional findings. Pushing commits for each.

Addressing existing review feedback

1. HALF-OPEN allows multiple concurrent probes (circuit_breaker.go:135)
Adding probeInFlight bool — once a probe is allowed, subsequent Allow() calls in HALF-OPEN return ErrCircuitOpen until RecordSuccess/RecordRateLimit resolves the probe.

2. Nil map panic in getCircuitBreaker (unified.go:391)
Adding nil-map guard before writing.

3. Rate-limit result wraps in misleading ErrCircuitOpen (unified.go:593)
The circuit may not actually be open yet (1st or 2nd error), and the backend's original error text is lost. Fix: extract original text from backendResult, return it as the error content, and only return ErrCircuitOpen when cb.Allow() rejects the call.

4. JSON stdin config not wired (config_core.go:226)
Adding doc comments clarifying fields are file-config (TOML) only.

5. Flaky time-based tests (circuit_breaker_test.go:127)
Injecting a nowFunc into the circuit breaker so tests use deterministic time instead of time.Sleep.

Additional findings

6. Operator precedence bug in isRateLimitText (circuit_breaker.go:254)
strings.Contains(lower, "rate limit") && strings.Contains(lower, "403") binds correctly due to && > ||, but the missing parentheses make this fragile and confusing. Adding explicit parentheses.

7. Duplicate parseRateLimitReset
Both proxy/handler.go and server/circuit_breaker.go have identical Unix-timestamp parsers. Not fixing in this commit to keep scope manageable, but worth consolidating later.

…ckend errors

Address all 5 existing review comments plus 2 additional findings:

1. HALF-OPEN probe tracking: add probeInFlight flag so only one probe
   passes in HALF-OPEN state; concurrent requests get ErrCircuitOpen.
   Add TestCircuitBreaker_HalfOpenBlocksConcurrentProbes.

2. Nil map guard: getCircuitBreaker initializes circuitBreakers map
   when nil, preventing panic in test constructors that bypass NewUnified.

3. Preserve backend error text: rate-limit detection now returns the
   original upstream error message via extractRateLimitErrorText instead
   of wrapping in ErrCircuitOpen. ErrCircuitOpen is only returned when
   cb.Allow() rejects the call before contacting the backend.

4. TOML-only doc comments: clarify that rate_limit_threshold and
   rate_limit_cooldown are not wired for stdin JSON config.

5. Injectable clock: replace time.Now() in circuit breaker with nowFunc
   field (default time.Now). Tests use deterministic fake time instead
   of flaky time.Sleep-based assertions.

6. Operator precedence: add explicit parentheses in isRateLimitText
   for the compound rate limit + 403 condition.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@lpcox lpcox merged commit 53c24f3 into main Apr 14, 2026
16 checks passed
@lpcox lpcox deleted the copilot/add-rate-limit-circuit-breaker branch April 14, 2026 20:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Rate-limit circuit breaker for GitHub MCP backend tool calls

3 participants