Skip to content

feat(llm): five-layer LLM resilience orchestrator for autonomous agent stability#293

Open
alexk-dev wants to merge 18 commits intomainfrom
feat/llm-500-resilience
Open

feat(llm): five-layer LLM resilience orchestrator for autonomous agent stability#293
alexk-dev wants to merge 18 commits intomainfrom
feat/llm-500-resilience

Conversation

@alexk-dev
Copy link
Copy Markdown
Owner

@alexk-dev alexk-dev commented Apr 16, 2026

Summary

  • Adds a five-layer resilience orchestrator (LlmResilienceOrchestrator) that prevents the autonomous agent loop from breaking when remote LLM providers return transient errors (HTTP 500, timeouts, rate limits)
  • Each layer operates at a different timescale: L1 hot retry with full jitter (seconds), L2 router-driven provider fallback, L3 per-provider circuit breaker (minutes), L4 graceful degradation — context compaction / model downgrade / tool stripping (seconds), L5 cold retry via DelayedSessionAction (minutes to hours: 2m → 5m → 15m → 1h)
  • L2 now uses the model-router fallback settings from the dashboard and runtime config, with sequential, round_robin, and weighted selection modes; the old random fallback path is not carried forward
  • Adds explicit tracing spans for each resilience layer (llm.resilience.L1llm.resilience.L5) plus llm.model.switch spans for fallback model changes
  • Exposes circuit breaker state in dashboard usage metrics and adds a CI Javadoc generation gate for public API documentation regressions
  • The orchestrator is toggled via resilience.json runtime config; when disabled, the existing retry logic is used unchanged
  • Fixes a bug where the retry cap was 3s (less than the 5s base delay); default is now 60s with AWS-style full jitter

Architecture

LlmCallPhase.handleLlmError()
  └─ LlmResilienceOrchestrator.handle()
       ├─ L1: LlmRetryPolicy (exponential backoff + full jitter)
       ├─ L2: RuntimeConfigRouterFallbackSelector
       │    └─ Model router fallbacks: sequential / round_robin / weighted
       ├─ L3: ProviderCircuitBreaker (CLOSED → OPEN → HALF_OPEN per provider)
       ├─ L4: RecoveryStrategy chain
       │    ├─ ContextCompactionRecoveryStrategy
       │    ├─ ModelDowngradeRecoveryStrategy
       │    └─ ToolStripRecoveryStrategy
       └─ L5: SuspendedTurnManager (→ DelayedSessionAction RETRY_LLM_TURN)

Key design decisions

  • Orchestrator is stateless per call — mutable state lives in ProviderCircuitBreaker (L3), per-turn context attributes (L2 fallback attempts), and DelayedSessionAction persistence (L5)
  • L2 reuses router settings — fallback chains are configured next to each model tier in the dashboard and persisted through model-router.json
  • Fallback strategies are deterministic by defaultsequential is the compatibility/default mode; round_robin and weighted are opt-in per tier
  • L5 reuses DelayedSessionAction infrastructure (file-based JSON, lease-based polling, dead-letter) instead of adding SQLite — consistent with existing persistence model
  • L4 strategies are ordered: compaction first (cheapest), then model downgrade, then tool stripping (most aggressive)
  • Tracing is explicit — each fallback layer gets its own span, and model changes are represented by a dedicated llm.model.switch span instead of being hidden in LLM retry logs
  • Backward compatible — the orchestrator is null-safe and the old retry path still works when resilience is disabled

Files changed

Area Files What
New: resilience package 9 files in toolloop/resilience/ Orchestrator, retry policy, circuit breaker, router fallback selector, 3 recovery strategies, suspended turn manager, Spring config
Model router fallback UI TierFallbacksPage, FallbackRow, ModelsTab, model router mappers/types Dashboard editor for per-tier fallback chains and sequential / round_robin / weighted modes
Model/config RuntimeConfig, ContextAttributes, FallbackModes, RuntimeConfigService, RuntimeSettingsMergeService, DelayedActionKind Resilience config section, router fallback fields, resilience context keys, RETRY_LLM_TURN kind
Integration LlmCallPhase, DefaultToolLoopSystem, ToolLoopAutoConfiguration, ModelSelectionService Wire the orchestrator into LLM error handling and apply selected router fallback models on retry
Tracing LlmCallPhase, resilience trace DTOs/tests Emit llm.resilience.* spans, llm.model.switch, model before/after attributes, and terminal error status only on exhausted L5
Dashboard metrics UsageController, dashboard/src/api/usage.ts Export llm.circuit_breaker.state metrics with provider/state tags
Persistence StorageRuntimeConfigPersistenceAdapter, RuntimeConfigService Load/persist resilience and model-router fallback config sections
Enum completeness AutomationCommandHandler, ScheduleSessionActionTool Handle new RETRY_LLM_TURN enum value
i18n messages_en.properties, messages_ru.properties Add command.later.kind.retry-llm
Quality gate .github/workflows/docker-publish.yml Generate Javadoc in CI code-quality checks
Tests resilience, LLM call phase, runtime config/settings, dashboard usage, tool loop auto config Unit coverage for each resilience layer, full L1→L5 cascade integration test, router fallback strategy tests, tracing assertions, dashboard circuit breaker metric tests

Configuration (model-router.json)

{
  "tiers": {
    "balanced": {
      "model": { "provider": "openai", "id": "gpt-5.1" },
      "reasoning": "none",
      "fallbackMode": "weighted",
      "fallbacks": [
        {
          "model": { "provider": "openrouter", "id": "anthropic/claude-sonnet-4" },
          "reasoning": "medium",
          "weight": 2.0
        },
        {
          "model": { "provider": "openai", "id": "gpt-5.1-mini" },
          "reasoning": "none",
          "weight": 1.0
        }
      ]
    }
  }
}

Configuration (resilience.json)

{
  "enabled": true,
  "hotRetryMaxAttempts": 5,
  "hotRetryBaseDelayMs": 5000,
  "hotRetryCapMs": 60000,
  "circuitBreakerFailureThreshold": 5,
  "circuitBreakerWindowSeconds": 60,
  "circuitBreakerOpenDurationSeconds": 120,
  "degradationCompactContext": true,
  "degradationCompactMinMessages": 6,
  "degradationDowngradeModel": true,
  "degradationFallbackModelTier": "balanced",
  "degradationStripTools": true,
  "coldRetryEnabled": true,
  "coldRetryMaxAttempts": 4
}

alexk-dev and others added 18 commits April 15, 2026 21:41
Introduces a multi-layer defense system for LLM API failures (HTTP 500,
timeouts, rate limits) that prevents the autonomous agent loop from
breaking when remote providers are unstable.

Layers:
- L1: Hot retry with full jitter (fixes cap=3s bug → 60s default)
- L2: Provider fallback (stub — requires llm-router config design)
- L3: Per-provider circuit breaker (CLOSED → OPEN → HALF_OPEN)
- L4: Graceful degradation (context compaction, model downgrade, tool strip)
- L5: Cold retry via DelayedSessionAction (2m → 5m → 15m → 1h backoff)

The orchestrator integrates into LlmCallPhase.handleLlmError() and is
enabled via resilience.json runtime config. When disabled, the existing
retry logic is used unchanged.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Snapshot the breaker once per handle() so L1 and L3 share one decision.
An OPEN breaker now vetoes both L1 hot retry (which would hammer a
provider in cooldown) and L4 degradation (whose recovery would route
back to the same broken provider), routing straight to L5 instead.

Also lock recordSuccess against concurrent recordFailure to prevent
half-reset snapshots tripping the breaker back open, and add regression
tests for the breaker fast-fail paths and large-attempt overflow.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@sonarqubecloud
Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant