Problem
service-runtime centralizes auth, events, audit, observability, idempotency, and database helpers — but contains zero resilience primitives. No circuit breaker, no retry, no backoff, no fallback.
Meanwhile, at least 6 services have independently implemented circuit breakers:
fermata — apps/api/utils/circuit_breaker.py, apps/api/services/circuit_breaker.py
maestro — src/server/circuit-breaker.ts, src/safety/circuit-breaker.ts
ensemble — src/lib/integrations-infra/integration-circuit-breaker.ts
cadence — packages/cadence-web/convex/integrations/circuitBreaker.ts
gate — internal/jobs/circuit_breaker.go
cerebro — internal/providers/http_client.go
Similarly, retry/backoff exists in gate (internal/resilience/retry.go, internal/providers/http_retry.go), cerebro, and others — all independently written, with different thresholds, failure semantics, and no shared behavior.
This is the pattern that would prevent the "identity goes down → gateway authentication fails → every managed-mode agent stops" cascade. A shared circuit breaker in identityclient with a configurable fallback would give the gateway the degradation path it currently lacks.
Proposed packages
resilience/circuitbreaker
- Three-state (closed → open → half-open) circuit breaker
- Configurable failure threshold, success threshold, and open duration
- Prometheus metrics for state transitions
- Compatible with
httpkit middleware chain
resilience/retry
- Configurable retry with exponential backoff and jitter
- Context-aware (respects cancellation)
- Classifiable errors (retryable vs. permanent)
- Compatible with
identityclient and natsbus callers
resilience/fallback
- Wraps a primary function with a fallback on circuit-open or timeout
- Designed for the identity introspection use case: try live introspection → fall back to cached result
First consumer: identityclient
The highest-value integration is wrapping identityclient.Introspect() with circuit breaker + cached fallback. This directly addresses the LLM Gateway's identity dependency (see evalops/llm-gateway#48).
Why this matters
The mesh optimized for correctness at rest (typed contracts, identity tokens, audit trails) but not for correctness under failure. The one pattern that directly addresses cascade failure risk — the single biggest infrastructure concern — was never promoted to the shared layer.
Context
Identified during org-wide architecture review (2026-04-12). Related: evalops/llm-gateway#48 (identity fallback), evalops/deploy#4 (NATS clustering).
Problem
service-runtimecentralizes auth, events, audit, observability, idempotency, and database helpers — but contains zero resilience primitives. No circuit breaker, no retry, no backoff, no fallback.Meanwhile, at least 6 services have independently implemented circuit breakers:
fermata—apps/api/utils/circuit_breaker.py,apps/api/services/circuit_breaker.pymaestro—src/server/circuit-breaker.ts,src/safety/circuit-breaker.tsensemble—src/lib/integrations-infra/integration-circuit-breaker.tscadence—packages/cadence-web/convex/integrations/circuitBreaker.tsgate—internal/jobs/circuit_breaker.gocerebro—internal/providers/http_client.goSimilarly, retry/backoff exists in
gate(internal/resilience/retry.go,internal/providers/http_retry.go),cerebro, and others — all independently written, with different thresholds, failure semantics, and no shared behavior.This is the pattern that would prevent the "identity goes down → gateway authentication fails → every managed-mode agent stops" cascade. A shared circuit breaker in
identityclientwith a configurable fallback would give the gateway the degradation path it currently lacks.Proposed packages
resilience/circuitbreakerhttpkitmiddleware chainresilience/retryidentityclientandnatsbuscallersresilience/fallbackFirst consumer:
identityclientThe highest-value integration is wrapping
identityclient.Introspect()with circuit breaker + cached fallback. This directly addresses the LLM Gateway's identity dependency (see evalops/llm-gateway#48).Why this matters
The mesh optimized for correctness at rest (typed contracts, identity tokens, audit trails) but not for correctness under failure. The one pattern that directly addresses cascade failure risk — the single biggest infrastructure concern — was never promoted to the shared layer.
Context
Identified during org-wide architecture review (2026-04-12). Related: evalops/llm-gateway#48 (identity fallback), evalops/deploy#4 (NATS clustering).