Skip to content

Error Model

Z-M-Huang edited this page Apr 27, 2026 · 3 revisions

Error Model

How failures propagate through stud-cli — which errors are recoverable, which are terminal, and who owns the next move.


Failure classes

flowchart TD
    Err[failure observed] --> Class{class?}
    Class -->|validation| Fatal1[session does not start]
    Class -->|provider transient| Retry1[retry with backoff]
    Class -->|provider capability| Fail1[fail fast; user picks new choice]
    Class -->|tool transient| Retry2[retry per tool policy]
    Class -->|tool terminal| Fail2[surface; model may react]
    Class -->|session-store| Fatal2[snapshot fails; retry; escalate]
    Class -->|sm decision| SM[SM owns next move]
    Class -->|cancel| Clean[cancellation path]
    Class -->|crash| Crash[session close; resume possible]
Loading
Class Recoverable? Handler
Validation No within this run Session refuses to start; operator fixes config.
Provider transient (5xx, network blip) Yes Retry per the provider's retry policy; escalate after N.
Provider capability (required feature unavailable) No Capability Negotiation rejects the choice; user picks another.
Tool transient (timeout, resource lock) Sometimes Per-tool retry policy; otherwise surface to the model.
Tool terminal (schema violation, auth failure, logical failure) No Surface to the model as a typed error; model may react.
Session store Depends Retry snapshot; after N, error the session or trigger operator intervention.
State Machine decision N/A The SM owns the next move on a failed tool call or a user-caused error within a gated turn.
Cancellation N/A A cooperative exit, not an error. Landed into audit.
Crash Recoverable via resume See Persistence and Recovery.

SM ownership on failure

When a State Machine is attached, the stage pipeline governs failure recovery:

  • A failed tool call surfaces into the stage's Act loop as a typed error; the stage may Assert a failed attempt and let Next() route to a repair stage or a retry.
  • A failed COMPOSE_REQUEST inside a stage execution fails the attempt; the stage's retryPolicy decides whether to retry from Setup or surface the failure to Next().
  • An unrecoverable stage returns a terminal failure from Exit; Next() resolves per the stage's resolution policyskip, retry-later, or abort-workflow.

Without an attached SM, core falls back to default class behavior:

  • Provider transient → retry.
  • Tool terminal → surface to model; the model decides to retry, give up, or ask the user.
  • Persistent unrecoverable → end the turn with a user-visible error.

Typed errors

Errors carry a class and a code:

Class Example codes
Validation ShapeInvalid, ContractVersionMismatch, ConfigSchemaViolation
ProviderTransient NetworkTimeout, Provider5xx, RateLimited
ProviderCapability MissingStreaming, MissingToolCalling, ContextWindowTooSmall
ToolTransient ExecutionTimeout, ResourceBusy
ToolTerminal InputInvalid, OutputMalformed, Forbidden, NotFound
Session ManifestDrift, StoreUnavailable, ResumeMismatch
Cancellation SessionCancelled, TurnCancelled, ToolCancelled
ExtensionHost LifecycleFailure, DependencyCycle, DependencyMissing

Extensions emit typed errors through the Host API; callers match on class/code, not on message substrings.


Propagation rules

  • Context preservation. Wrapping an error preserves the original class and code; the wrapper's message is additive.
  • No silent swallow. Empty catch is non-conformant. An extension that intentionally ignores an error emits a SuppressedError observability event with the reason.
  • User-visible vs internal. The UI sees a message tuned for the user (from the Interaction Protocol or a typed render). The audit trail sees the full error (class, code, context).
  • Never expose internal details to the model by default. A tool that returns OutputMalformed sends the model a typed error shape — not a full stack trace. An SM may explicitly allow richer model-facing messages for debugging workflows.

See the Error Handling standard for the broader principles this page aligns with.


Retry policy

Retryable classes declare a retry policy in their contract:

Field Meaning
retryable Whether the class is retryable.
maxAttempts Upper bound.
backoff fixed, linear, or exponential with jitter.
classifier For tool errors: which codes are retryable vs terminal.

Retries are audited. A retried tool call lands in the audit trail with the attempt count so operators see when providers or tools become flaky.

Core default on RateLimited

A provider adapter may override, but core ships a default for ProviderTransient / RateLimited:

Signal Core response
Upstream Retry-After header present Honor it exactly. Sleep the given duration, then retry.
No Retry-After header Exponential backoff: 5 → 10 → 20 → 40 → 80 → 160 seconds, with ±20% jitter.
Cap on attempts 6 attempts total (matching the six-step schedule above). After the sixth failure, surface RateLimited as a terminal error on the class.

Rationale: users ran out of patience before they ran out of retries in pre-v1 experiments — an explicit, bounded default is easier to reason about than an open-ended one. No token budget or cost cap is imposed on top of the retry schedule in v1.

Bundled provider adapters may declare their own retry policy that overrides this default. See OpenAI-Compatible and Anthropic for adapter-specific overrides.

Other retryable classes

Class Default
ProviderTransient / NetworkTimeout, Provider5xx Three attempts; exponential backoff 1 → 2 → 4 seconds with jitter.
ToolTransient / ExecutionTimeout, ResourceBusy Policy set by the tool's contract. Core does not impose a default here — tools know their own cost.
Session / StoreUnavailable Three attempts; linear backoff 2 → 4 → 6 seconds. After failure, the session errors.

The SM's stage-level retryPolicy (see SM Stage Lifecycle) sits on top of these defaults — the SM decides whether a failed tool call re-enters Act, not the class-level retry. The class-level retry governs transport-level redo; the SM's retry governs workflow-level redo.


Interaction Protocol on error

A user may be asked to choose a recovery path:

  • "Retry the tool call?" → Approve with yes/no.
  • "Provide a replacement input?" → Ask.
  • "Authenticate again?" → Auth.DeviceCode.

Every Interaction Protocol exchange on an error path is audited — see Audit Trail.


Partial tool failure

A tool that returns a partial result with an error (e.g., fetched 3 of 5 URLs) returns:

  • The partial payload in its declared output shape.
  • An errors[] field enumerating per-item failures with typed errors.

Tools must not blend errors into a success-shaped output. The LLM is cheapest to correct when typed error signals are available.


Related pages

Introduction

Reading

Core runtime

Contracts

Category contracts

Context

Security

Runtime behavior

Operations

Providers (bundled)

Integrations

Reference extensions

Tools

UI

Session Stores

Loggers

Providers

Hooks

Context Providers

Commands

Case studies

Flows

Maintainers

Clone this wiki locally