feat(llm): five-layer LLM resilience orchestrator for autonomous agent stability by alexk-dev · Pull Request #293 · alexk-dev/golemcore-bot

alexk-dev · 2026-04-16T01:42:24Z

Summary

Adds a five-layer resilience orchestrator (LlmResilienceOrchestrator) that prevents the autonomous agent loop from breaking when remote LLM providers return transient errors (HTTP 500, timeouts, rate limits)
Each layer operates at a different timescale: L1 hot retry with full jitter (seconds), L2 router-driven provider fallback, L3 per-provider circuit breaker (minutes), L4 graceful degradation — context compaction / model downgrade / tool stripping (seconds), L5 cold retry via DelayedSessionAction (minutes to hours: 2m → 5m → 15m → 1h)
L2 now uses the model-router fallback settings from the dashboard and runtime config, with sequential, round_robin, and weighted selection modes; the old random fallback path is not carried forward
Adds explicit tracing spans for each resilience layer (llm.resilience.L1 … llm.resilience.L5) plus llm.model.switch spans for fallback model changes
Exposes circuit breaker state in dashboard usage metrics and adds a CI Javadoc generation gate for public API documentation regressions
The orchestrator is toggled via resilience.json runtime config; when disabled, the existing retry logic is used unchanged
Fixes a bug where the retry cap was 3s (less than the 5s base delay); default is now 60s with AWS-style full jitter

Architecture

LlmCallPhase.handleLlmError()
  └─ LlmResilienceOrchestrator.handle()
       ├─ L1: LlmRetryPolicy (exponential backoff + full jitter)
       ├─ L2: RuntimeConfigRouterFallbackSelector
       │    └─ Model router fallbacks: sequential / round_robin / weighted
       ├─ L3: ProviderCircuitBreaker (CLOSED → OPEN → HALF_OPEN per provider)
       ├─ L4: RecoveryStrategy chain
       │    ├─ ContextCompactionRecoveryStrategy
       │    ├─ ModelDowngradeRecoveryStrategy
       │    └─ ToolStripRecoveryStrategy
       └─ L5: SuspendedTurnManager (→ DelayedSessionAction RETRY_LLM_TURN)

Key design decisions

Orchestrator is stateless per call — mutable state lives in ProviderCircuitBreaker (L3), per-turn context attributes (L2 fallback attempts), and DelayedSessionAction persistence (L5)
L2 reuses router settings — fallback chains are configured next to each model tier in the dashboard and persisted through model-router.json
Fallback strategies are deterministic by default — sequential is the compatibility/default mode; round_robin and weighted are opt-in per tier
L5 reuses DelayedSessionAction infrastructure (file-based JSON, lease-based polling, dead-letter) instead of adding SQLite — consistent with existing persistence model
L4 strategies are ordered: compaction first (cheapest), then model downgrade, then tool stripping (most aggressive)
Tracing is explicit — each fallback layer gets its own span, and model changes are represented by a dedicated llm.model.switch span instead of being hidden in LLM retry logs
Backward compatible — the orchestrator is null-safe and the old retry path still works when resilience is disabled

Files changed

Area	Files	What
New: resilience package	9 files in `toolloop/resilience/`	Orchestrator, retry policy, circuit breaker, router fallback selector, 3 recovery strategies, suspended turn manager, Spring config
Model router fallback UI	`TierFallbacksPage`, `FallbackRow`, `ModelsTab`, model router mappers/types	Dashboard editor for per-tier fallback chains and sequential / round_robin / weighted modes
Model/config	`RuntimeConfig`, `ContextAttributes`, `FallbackModes`, `RuntimeConfigService`, `RuntimeSettingsMergeService`, `DelayedActionKind`	Resilience config section, router fallback fields, resilience context keys, `RETRY_LLM_TURN` kind
Integration	`LlmCallPhase`, `DefaultToolLoopSystem`, `ToolLoopAutoConfiguration`, `ModelSelectionService`	Wire the orchestrator into LLM error handling and apply selected router fallback models on retry
Tracing	`LlmCallPhase`, resilience trace DTOs/tests	Emit `llm.resilience.*` spans, `llm.model.switch`, model before/after attributes, and terminal error status only on exhausted L5
Dashboard metrics	`UsageController`, `dashboard/src/api/usage.ts`	Export `llm.circuit_breaker.state` metrics with provider/state tags
Persistence	`StorageRuntimeConfigPersistenceAdapter`, `RuntimeConfigService`	Load/persist resilience and model-router fallback config sections
Enum completeness	`AutomationCommandHandler`, `ScheduleSessionActionTool`	Handle new `RETRY_LLM_TURN` enum value
i18n	`messages_en.properties`, `messages_ru.properties`	Add `command.later.kind.retry-llm`
Quality gate	`.github/workflows/docker-publish.yml`	Generate Javadoc in CI code-quality checks
Tests	resilience, LLM call phase, runtime config/settings, dashboard usage, tool loop auto config	Unit coverage for each resilience layer, full L1→L5 cascade integration test, router fallback strategy tests, tracing assertions, dashboard circuit breaker metric tests

Configuration (`model-router.json`)

{
  "tiers": {
    "balanced": {
      "model": { "provider": "openai", "id": "gpt-5.1" },
      "reasoning": "none",
      "fallbackMode": "weighted",
      "fallbacks": [
        {
          "model": { "provider": "openrouter", "id": "anthropic/claude-sonnet-4" },
          "reasoning": "medium",
          "weight": 2.0
        },
        {
          "model": { "provider": "openai", "id": "gpt-5.1-mini" },
          "reasoning": "none",
          "weight": 1.0
        }
      ]
    }
  }
}

Configuration (`resilience.json`)

{
  "enabled": true,
  "hotRetryMaxAttempts": 5,
  "hotRetryBaseDelayMs": 5000,
  "hotRetryCapMs": 60000,
  "circuitBreakerFailureThreshold": 5,
  "circuitBreakerWindowSeconds": 60,
  "circuitBreakerOpenDurationSeconds": 120,
  "degradationCompactContext": true,
  "degradationCompactMinMessages": 6,
  "degradationDowngradeModel": true,
  "degradationFallbackModelTier": "balanced",
  "degradationStripTools": true,
  "coldRetryEnabled": true,
  "coldRetryMaxAttempts": 4
}

Introduces a multi-layer defense system for LLM API failures (HTTP 500, timeouts, rate limits) that prevents the autonomous agent loop from breaking when remote providers are unstable. Layers: - L1: Hot retry with full jitter (fixes cap=3s bug → 60s default) - L2: Provider fallback (stub — requires llm-router config design) - L3: Per-provider circuit breaker (CLOSED → OPEN → HALF_OPEN) - L4: Graceful degradation (context compaction, model downgrade, tool strip) - L5: Cold retry via DelayedSessionAction (2m → 5m → 15m → 1h backoff) The orchestrator integrates into LlmCallPhase.handleLlmError() and is enabled via resilience.json runtime config. When disabled, the existing retry logic is used unchanged. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Snapshot the breaker once per handle() so L1 and L3 share one decision. An OPEN breaker now vetoes both L1 hot retry (which would hammer a provider in cooldown) and L4 degradation (whose recovery would route back to the same broken provider), routing straight to L5 instead. Also lock recordSuccess against concurrent recordFailure to prevent half-reset snapshots tripping the breaker back open, and add regression tests for the breaker fast-fail paths and large-attempt overflow. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

sonarqubecloud · 2026-04-16T23:46:56Z

Quality Gate passed

Issues
45 New issues
0 Accepted issues

Measures
0 Security Hotspots
85.4% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

alexk-dev and others added 18 commits April 15, 2026 21:41

test(llm): cover resilience recovery branches

9d6a050

fix(llm): use secure jitter for retry backoff

47787c3

Merge remote-tracking branch 'origin/main' into feat/llm-500-resilience

980d452

feat(llm): wire router fallbacks into L2 resilience

3276027

fix(llm): normalize exhausted L2 fallbacks

67898f5

fix(llm): simplify fallback tier resolution

52b1850

Merge remote-tracking branch 'origin/main' into feat/llm-500-resilience

da63bc4

fix(llm): harden resilience fallback recovery

fcb4efa

fix(settings): preserve resilience config merges

f853700

fix(config): normalize resilience settings

25756cf

test(llm): satisfy strict pmd import rule

c002e72

Merge remote-tracking branch 'origin/main' into feat/llm-500-resilience

02cea6d

Add resilience tracing and javadoc gate

540c891

docs: document PR description format

685583c

Fix LLM resilience review regressions

aca1525

Fix resilience code size gate

a08a642

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(llm): five-layer LLM resilience orchestrator for autonomous agent stability#293

feat(llm): five-layer LLM resilience orchestrator for autonomous agent stability#293
alexk-dev wants to merge 18 commits intomainfrom
feat/llm-500-resilience

alexk-dev commented Apr 16, 2026 •

edited

Loading

Uh oh!

sonarqubecloud bot commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

alexk-dev commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture

Key design decisions

Files changed

Configuration (model-router.json)

Configuration (resilience.json)

Uh oh!

sonarqubecloud bot commented Apr 16, 2026

Quality Gate passed

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

alexk-dev commented Apr 16, 2026 •

edited

Loading

Configuration (`model-router.json`)

Configuration (`resilience.json`)