-
Notifications
You must be signed in to change notification settings - Fork 0
Error Model
How failures propagate through stud-cli — which errors are recoverable, which are terminal, and who owns the next move.
flowchart TD
Err[failure observed] --> Class{class?}
Class -->|validation| Fatal1[session does not start]
Class -->|provider transient| Retry1[retry with backoff]
Class -->|provider capability| Fail1[fail fast; user picks new choice]
Class -->|tool transient| Retry2[retry per tool policy]
Class -->|tool terminal| Fail2[surface; model may react]
Class -->|session-store| Fatal2[snapshot fails; retry; escalate]
Class -->|sm decision| SM[SM owns next move]
Class -->|cancel| Clean[cancellation path]
Class -->|crash| Crash[session close; resume possible]
| Class | Recoverable? | Handler |
|---|---|---|
| Validation | No within this run | Session refuses to start; operator fixes config. |
| Provider transient (5xx, network blip) | Yes | Retry per the provider's retry policy; escalate after N. |
| Provider capability (required feature unavailable) | No | Capability Negotiation rejects the choice; user picks another. |
| Tool transient (timeout, resource lock) | Sometimes | Per-tool retry policy; otherwise surface to the model. |
| Tool terminal (schema violation, auth failure, logical failure) | No | Surface to the model as a typed error; model may react. |
| Session store | Depends | Retry snapshot; after N, error the session or trigger operator intervention. |
| State Machine decision | N/A | The SM owns the next move on a failed tool call or a user-caused error within a gated turn. |
| Cancellation | N/A | A cooperative exit, not an error. Landed into audit. |
| Crash | Recoverable via resume | See Persistence and Recovery. |
When a State Machine is attached, the stage pipeline governs failure recovery:
- A failed tool call surfaces into the stage's
Actloop as a typed error; the stage mayAsserta failed attempt and letNext()route to a repair stage or a retry. - A failed
COMPOSE_REQUESTinside a stage execution fails the attempt; the stage'sretryPolicydecides whether to retry fromSetupor surface the failure toNext(). - An unrecoverable stage returns a terminal failure from
Exit;Next()resolves per the stage's resolution policy —skip,retry-later, orabort-workflow.
Without an attached SM, core falls back to default class behavior:
- Provider transient → retry.
- Tool terminal → surface to model; the model decides to retry, give up, or ask the user.
- Persistent unrecoverable → end the turn with a user-visible error.
Errors carry a class and a code:
| Class | Example codes |
|---|---|
Validation |
ShapeInvalid, ContractVersionMismatch, ConfigSchemaViolation
|
ProviderTransient |
NetworkTimeout, Provider5xx, RateLimited
|
ProviderCapability |
MissingStreaming, MissingToolCalling, ContextWindowTooSmall
|
ToolTransient |
ExecutionTimeout, ResourceBusy
|
ToolTerminal |
InputInvalid, OutputMalformed, Forbidden, NotFound
|
Session |
ManifestDrift, StoreUnavailable, ResumeMismatch
|
Cancellation |
SessionCancelled, TurnCancelled, ToolCancelled
|
ExtensionHost |
LifecycleFailure, DependencyCycle, DependencyMissing
|
Extensions emit typed errors through the Host API; callers match on class/code, not on message substrings.
- Context preservation. Wrapping an error preserves the original class and code; the wrapper's message is additive.
-
No silent swallow. Empty
catchis non-conformant. An extension that intentionally ignores an error emits aSuppressedErrorobservability event with the reason. - User-visible vs internal. The UI sees a message tuned for the user (from the Interaction Protocol or a typed render). The audit trail sees the full error (class, code, context).
-
Never expose internal details to the model by default. A tool that returns
OutputMalformedsends the model a typed error shape — not a full stack trace. An SM may explicitly allow richer model-facing messages for debugging workflows.
See the Error Handling standard for the broader principles this page aligns with.
Retryable classes declare a retry policy in their contract:
| Field | Meaning |
|---|---|
retryable |
Whether the class is retryable. |
maxAttempts |
Upper bound. |
backoff |
fixed, linear, or exponential with jitter. |
classifier |
For tool errors: which codes are retryable vs terminal. |
Retries are audited. A retried tool call lands in the audit trail with the attempt count so operators see when providers or tools become flaky.
A provider adapter may override, but core ships a default for ProviderTransient / RateLimited:
| Signal | Core response |
|---|---|
Upstream Retry-After header present |
Honor it exactly. Sleep the given duration, then retry. |
No Retry-After header |
Exponential backoff: 5 → 10 → 20 → 40 → 80 → 160 seconds, with ±20% jitter. |
| Cap on attempts | 6 attempts total (matching the six-step schedule above). After the sixth failure, surface RateLimited as a terminal error on the class. |
Rationale: users ran out of patience before they ran out of retries in pre-v1 experiments — an explicit, bounded default is easier to reason about than an open-ended one. No token budget or cost cap is imposed on top of the retry schedule in v1.
Bundled provider adapters may declare their own retry policy that overrides this default. See OpenAI-Compatible and Anthropic for adapter-specific overrides.
| Class | Default |
|---|---|
ProviderTransient / NetworkTimeout, Provider5xx
|
Three attempts; exponential backoff 1 → 2 → 4 seconds with jitter. |
ToolTransient / ExecutionTimeout, ResourceBusy
|
Policy set by the tool's contract. Core does not impose a default here — tools know their own cost. |
Session / StoreUnavailable |
Three attempts; linear backoff 2 → 4 → 6 seconds. After failure, the session errors. |
The SM's stage-level retryPolicy (see SM Stage Lifecycle) sits on top of these defaults — the SM decides whether a failed tool call re-enters Act, not the class-level retry. The class-level retry governs transport-level redo; the SM's retry governs workflow-level redo.
A user may be asked to choose a recovery path:
- "Retry the tool call?" →
Approvewith yes/no. - "Provide a replacement input?" →
Ask. - "Authenticate again?" →
Auth.DeviceCode.
Every Interaction Protocol exchange on an error path is audited — see Audit Trail.
A tool that returns a partial result with an error (e.g., fetched 3 of 5 URLs) returns:
- The partial payload in its declared output shape.
- An
errors[]field enumerating per-item failures with typed errors.
Tools must not blend errors into a success-shaped output. The LLM is cheapest to correct when typed error signals are available.
- Execution Model
- Message Loop
- Concurrency and Cancellation
- Error Model
- Event and Command Ordering
- Event Bus
- Command Model
- Interaction Protocol
- Hook Taxonomy
- Host API
- Extension Lifecycle
- Env Provider
- Prompt Registry
- Resource Registry
- Session Lifecycle
- Session Manifest
- Persistence and Recovery
- Stage Executions
- Subagent Sessions
- Contract Pattern
- Versioning and Compatibility
- Deprecation Policy
- Capability Negotiation
- Dependency Resolution
- Validation Pipeline
- Cardinality and Activation
- Extension State
- Conformance and Testing
- Providers
- Provider Params
- Tools
- Hooks
- UI
- Loggers
- State Machines
- SM Stage Lifecycle
- Stage Definitions
- Commands
- Session Store
- Context Providers
- Settings Shape
- Trust Model
- Project Trust
- Extension Isolation
- Extension Integrity
- LLM Context Isolation
- Secrets Hygiene
- Security Modes
- Tool Approvals
- MCP Trust
- Sandboxing
- Configuration Scopes
- Project Root
- Extension Discovery
- Extension Installation
- Extension Reloading
- Headless and Interactor
- Determinism and Ordering
- Launch Arguments
- Network Policy
- Platform Integration
Tools
UI
Session Stores
Loggers
Providers
Hooks
Context Providers
Commands
- First Run
- Default Chat
- Tool Call Cycle
- Hook Interception
- Guard Deny Reproposal
- State Machine Workflow
- SM Stage Retry
- Hot Model Switch
- Capability Mismatch Switch
- Session Resume
- Session Resume Drift
- Approval and Auth
- Interaction Timeout
- Headless Run
- Parallel Tool Approvals
- Subagent Delegation
- Scope Layering
- Project First-Run Trust
- Reload Mid-Turn
- Compaction Warning
- MCP Remote Tool Call
- MCP Prompt Consume
- MCP Resource Bind
- MCP Reconnect