feat(llm): five-layer LLM resilience orchestrator for autonomous agent stability#293
Open
feat(llm): five-layer LLM resilience orchestrator for autonomous agent stability#293
Conversation
Introduces a multi-layer defense system for LLM API failures (HTTP 500, timeouts, rate limits) that prevents the autonomous agent loop from breaking when remote providers are unstable. Layers: - L1: Hot retry with full jitter (fixes cap=3s bug → 60s default) - L2: Provider fallback (stub — requires llm-router config design) - L3: Per-provider circuit breaker (CLOSED → OPEN → HALF_OPEN) - L4: Graceful degradation (context compaction, model downgrade, tool strip) - L5: Cold retry via DelayedSessionAction (2m → 5m → 15m → 1h backoff) The orchestrator integrates into LlmCallPhase.handleLlmError() and is enabled via resilience.json runtime config. When disabled, the existing retry logic is used unchanged. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Snapshot the breaker once per handle() so L1 and L3 share one decision. An OPEN breaker now vetoes both L1 hot retry (which would hammer a provider in cooldown) and L4 degradation (whose recovery would route back to the same broken provider), routing straight to L5 instead. Also lock recordSuccess against concurrent recordFailure to prevent half-reset snapshots tripping the breaker back open, and add regression tests for the breaker fast-fail paths and large-attempt overflow. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Summary
LlmResilienceOrchestrator) that prevents the autonomous agent loop from breaking when remote LLM providers return transient errors (HTTP 500, timeouts, rate limits)DelayedSessionAction(minutes to hours: 2m → 5m → 15m → 1h)llm.resilience.L1…llm.resilience.L5) plusllm.model.switchspans for fallback model changesresilience.jsonruntime config; when disabled, the existing retry logic is used unchangedArchitecture
Key design decisions
ProviderCircuitBreaker(L3), per-turn context attributes (L2 fallback attempts), andDelayedSessionActionpersistence (L5)model-router.jsonsequentialis the compatibility/default mode;round_robinandweightedare opt-in per tierDelayedSessionActioninfrastructure (file-based JSON, lease-based polling, dead-letter) instead of adding SQLite — consistent with existing persistence modelllm.model.switchspan instead of being hidden in LLM retry logsFiles changed
toolloop/resilience/TierFallbacksPage,FallbackRow,ModelsTab, model router mappers/typesRuntimeConfig,ContextAttributes,FallbackModes,RuntimeConfigService,RuntimeSettingsMergeService,DelayedActionKindRETRY_LLM_TURNkindLlmCallPhase,DefaultToolLoopSystem,ToolLoopAutoConfiguration,ModelSelectionServiceLlmCallPhase, resilience trace DTOs/testsllm.resilience.*spans,llm.model.switch, model before/after attributes, and terminal error status only on exhausted L5UsageController,dashboard/src/api/usage.tsllm.circuit_breaker.statemetrics with provider/state tagsStorageRuntimeConfigPersistenceAdapter,RuntimeConfigServiceAutomationCommandHandler,ScheduleSessionActionToolRETRY_LLM_TURNenum valuemessages_en.properties,messages_ru.propertiescommand.later.kind.retry-llm.github/workflows/docker-publish.ymlConfiguration (
model-router.json){ "tiers": { "balanced": { "model": { "provider": "openai", "id": "gpt-5.1" }, "reasoning": "none", "fallbackMode": "weighted", "fallbacks": [ { "model": { "provider": "openrouter", "id": "anthropic/claude-sonnet-4" }, "reasoning": "medium", "weight": 2.0 }, { "model": { "provider": "openai", "id": "gpt-5.1-mini" }, "reasoning": "none", "weight": 1.0 } ] } } }Configuration (
resilience.json){ "enabled": true, "hotRetryMaxAttempts": 5, "hotRetryBaseDelayMs": 5000, "hotRetryCapMs": 60000, "circuitBreakerFailureThreshold": 5, "circuitBreakerWindowSeconds": 60, "circuitBreakerOpenDurationSeconds": 120, "degradationCompactContext": true, "degradationCompactMinMessages": 6, "degradationDowngradeModel": true, "degradationFallbackModelTier": "balanced", "degradationStripTools": true, "coldRetryEnabled": true, "coldRetryMaxAttempts": 4 }