Skip to content

Add shared resilience primitives — circuit breaker, retry, backoff, fallback #32

@haasonsaas

Description

@haasonsaas

Problem

service-runtime centralizes auth, events, audit, observability, idempotency, and database helpers — but contains zero resilience primitives. No circuit breaker, no retry, no backoff, no fallback.

Meanwhile, at least 6 services have independently implemented circuit breakers:

  • fermataapps/api/utils/circuit_breaker.py, apps/api/services/circuit_breaker.py
  • maestrosrc/server/circuit-breaker.ts, src/safety/circuit-breaker.ts
  • ensemblesrc/lib/integrations-infra/integration-circuit-breaker.ts
  • cadencepackages/cadence-web/convex/integrations/circuitBreaker.ts
  • gateinternal/jobs/circuit_breaker.go
  • cerebrointernal/providers/http_client.go

Similarly, retry/backoff exists in gate (internal/resilience/retry.go, internal/providers/http_retry.go), cerebro, and others — all independently written, with different thresholds, failure semantics, and no shared behavior.

This is the pattern that would prevent the "identity goes down → gateway authentication fails → every managed-mode agent stops" cascade. A shared circuit breaker in identityclient with a configurable fallback would give the gateway the degradation path it currently lacks.

Proposed packages

resilience/circuitbreaker

  • Three-state (closed → open → half-open) circuit breaker
  • Configurable failure threshold, success threshold, and open duration
  • Prometheus metrics for state transitions
  • Compatible with httpkit middleware chain

resilience/retry

  • Configurable retry with exponential backoff and jitter
  • Context-aware (respects cancellation)
  • Classifiable errors (retryable vs. permanent)
  • Compatible with identityclient and natsbus callers

resilience/fallback

  • Wraps a primary function with a fallback on circuit-open or timeout
  • Designed for the identity introspection use case: try live introspection → fall back to cached result

First consumer: identityclient

The highest-value integration is wrapping identityclient.Introspect() with circuit breaker + cached fallback. This directly addresses the LLM Gateway's identity dependency (see evalops/llm-gateway#48).

Why this matters

The mesh optimized for correctness at rest (typed contracts, identity tokens, audit trails) but not for correctness under failure. The one pattern that directly addresses cascade failure risk — the single biggest infrastructure concern — was never promoted to the shared layer.

Context

Identified during org-wide architecture review (2026-04-12). Related: evalops/llm-gateway#48 (identity fallback), evalops/deploy#4 (NATS clustering).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions