Error Model

How failures propagate through stud-cli — which errors are recoverable, which are terminal, and who owns the next move.

Failure classes

flowchart TD
    Err[failure observed] --> Class{class?}
    Class -->|validation| Fatal1[session does not start]
    Class -->|provider transient| Retry1[retry with backoff]
    Class -->|provider capability| Fail1[fail fast; user picks new choice]
    Class -->|tool transient| Retry2[retry per tool policy]
    Class -->|tool terminal| Fail2[surface; model may react]
    Class -->|session-store| Fatal2[snapshot fails; retry; escalate]
    Class -->|sm decision| SM[SM owns next move]
    Class -->|cancel| Clean[cancellation path]
    Class -->|crash| Crash[session close; resume possible]

Class	Recoverable?	Handler
Validation	No within this run	Session refuses to start; operator fixes config.
Provider transient (5xx, network blip)	Yes	Retry per the provider's retry policy; escalate after N.
Provider capability (required feature unavailable)	No	Capability Negotiation rejects the choice; user picks another.
Tool transient (timeout, resource lock)	Sometimes	Per-tool retry policy; otherwise surface to the model.
Tool terminal (schema violation, auth failure, logical failure)	No	Surface to the model as a typed error; model may react.
Session store	Depends	Retry snapshot; after N, error the session or trigger operator intervention.
State Machine decision	N/A	The SM owns the next move on a failed tool call or a user-caused error within a gated turn.
Cancellation	N/A	A cooperative exit, not an error. Landed into audit.
Crash	Recoverable via resume	See Persistence and Recovery.

SM ownership on failure

When a State Machine is attached, the stage pipeline governs failure recovery:

A failed tool call surfaces into the stage's Act loop as a typed error; the stage may Assert a failed attempt and let Next() route to a repair stage or a retry.
A failed COMPOSE_REQUEST inside a stage execution fails the attempt; the stage's retryPolicy decides whether to retry from Setup or surface the failure to Next().
An unrecoverable stage returns a terminal failure from Exit; Next() resolves per the stage's resolution policy — skip, retry-later, or abort-workflow.

Without an attached SM, core falls back to default class behavior:

Provider transient → retry.
Tool terminal → surface to model; the model decides to retry, give up, or ask the user.
Persistent unrecoverable → end the turn with a user-visible error.

Typed errors

Errors carry a class and a code:

Class	Example codes
`Validation`	`ShapeInvalid`, `ContractVersionMismatch`, `ConfigSchemaViolation`
`ProviderTransient`	`NetworkTimeout`, `Provider5xx`, `RateLimited`
`ProviderCapability`	`MissingStreaming`, `MissingToolCalling`, `ContextWindowTooSmall`
`ToolTransient`	`ExecutionTimeout`, `ResourceBusy`
`ToolTerminal`	`InputInvalid`, `OutputMalformed`, `Forbidden`, `NotFound`
`Session`	`ManifestDrift`, `StoreUnavailable`, `ResumeMismatch`
`Cancellation`	`SessionCancelled`, `TurnCancelled`, `ToolCancelled`
`ExtensionHost`	`LifecycleFailure`, `DependencyCycle`, `DependencyMissing`

Extensions emit typed errors through the Host API; callers match on class/code, not on message substrings.

Propagation rules

Context preservation. Wrapping an error preserves the original class and code; the wrapper's message is additive.
No silent swallow. Empty catch is non-conformant. An extension that intentionally ignores an error emits a SuppressedError observability event with the reason.
User-visible vs internal. The UI sees a message tuned for the user (from the Interaction Protocol or a typed render). The audit trail sees the full error (class, code, context).
Never expose internal details to the model by default. A tool that returns OutputMalformed sends the model a typed error shape — not a full stack trace. An SM may explicitly allow richer model-facing messages for debugging workflows.

See the Error Handling standard for the broader principles this page aligns with.

Retry policy

Retryable classes declare a retry policy in their contract:

Field	Meaning
`retryable`	Whether the class is retryable.
`maxAttempts`	Upper bound.
`backoff`	`fixed`, `linear`, or `exponential` with jitter.
`classifier`	For tool errors: which codes are retryable vs terminal.

Retries are audited. A retried tool call lands in the audit trail with the attempt count so operators see when providers or tools become flaky.

Core default on `RateLimited`

A provider adapter may override, but core ships a default for ProviderTransient / RateLimited:

Signal	Core response
Upstream `Retry-After` header present	Honor it exactly. Sleep the given duration, then retry.
No `Retry-After` header	Exponential backoff: `5 → 10 → 20 → 40 → 80 → 160` seconds, with ±20% jitter.
Cap on attempts	6 attempts total (matching the six-step schedule above). After the sixth failure, surface `RateLimited` as a terminal error on the class.

Rationale: users ran out of patience before they ran out of retries in pre-v1 experiments — an explicit, bounded default is easier to reason about than an open-ended one. No token budget or cost cap is imposed on top of the retry schedule in v1.

Bundled provider adapters may declare their own retry policy that overrides this default. See OpenAI-Compatible and Anthropic for adapter-specific overrides.

Other retryable classes

Class	Default
`ProviderTransient / NetworkTimeout`, `Provider5xx`	Three attempts; exponential backoff `1 → 2 → 4` seconds with jitter.
`ToolTransient / ExecutionTimeout`, `ResourceBusy`	Policy set by the tool's contract. Core does not impose a default here — tools know their own cost.
`Session / StoreUnavailable`	Three attempts; linear backoff `2 → 4 → 6` seconds. After failure, the session errors.

The SM's stage-level retryPolicy (see SM Stage Lifecycle) sits on top of these defaults — the SM decides whether a failed tool call re-enters Act, not the class-level retry. The class-level retry governs transport-level redo; the SM's retry governs workflow-level redo.

Interaction Protocol on error

A user may be asked to choose a recovery path:

"Retry the tool call?" → Approve with yes/no.
"Provide a replacement input?" → Ask.
"Authenticate again?" → Auth.DeviceCode.

Every Interaction Protocol exchange on an error path is audited — see Audit Trail.

Partial tool failure

A tool that returns a partial result with an error (e.g., fetched 3 of 5 URLs) returns:

The partial payload in its declared output shape.
An errors[] field enumerating per-item failures with typed errors.

Tools must not blend errors into a success-shaped output. The LLM is cheapest to correct when typed error signals are available.

Related pages

Introduction

Reading

Core runtime

Contracts

Category contracts

Context

Security

Runtime behavior

Operations

Providers (bundled)

Integrations

MCP

Reference extensions

Tools

UI

Session Stores

Filesystem

Loggers

File

Providers

CLI-Wrapper

Hooks

Context Providers

System Prompt File

Commands

Bundled

Case studies

Flows

Maintainers

CLAUDE.md

Error Model

Error Model

Failure classes

SM ownership on failure

Typed errors

Propagation rules

Retry policy

Core default on RateLimited

Other retryable classes

Interaction Protocol on error

Partial tool failure

Related pages

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Core default on `RateLimited`