The reliability layer your AI agents are missing.
Your agent is processing 1,000 customer records. It reaches record 847 — and the process dies.
Without Aetheris: start over from record 1. Re-run 847 LLM calls. Pay twice. Pray nothing was written twice.
With Aetheris: restart. It resumes from recorded progress. Runtime Tool side effects are not repeated when the production Ledger and Effect Store are configured.
Every production AI agent eventually hits the same three walls:
| Failure mode | What happens today |
|---|---|
| Process crash mid-task | Restart from the beginning; re-run all LLM calls |
| Retry after tool failure | Email sent twice, order created twice, payment charged twice |
| "Why did the AI do that?" | No visibility, no audit trail, no replay |
Aetheris is an open-source runtime that solves all three — without requiring you to rewrite your agent.
Requirements: Go 1.26.1+, Git
git clone https://github.com/Colin4k1024/Aetheris.git
cd Aetheris
make run-embedded # starts with embedded SQLite, no external servicescurl http://localhost:8080/api/health # {"status":"ok", ...}From Python (pip install aetheris):
from aetheris import AetherisClient
client = AetherisClient("http://localhost:8080")
job = client.run("my-agent", "Summarize the Q3 earnings report")
result = job.wait()
print(result.output)From any language — Aetheris exposes a REST API. Wrap your existing agent with two config lines:
# configs/api.embedded.yaml
agents:
agents:
my_python_agent:
type: "external_http"
external:
url: "http://localhost:9000/invoke"
timeout: "120s"Then submit a job:
curl -X POST http://localhost:8080/api/agents/my_python_agent/message \
-H "Idempotency-Key: task-001" \
-H "Content-Type: application/json" \
-d '{"message": "Process customer batch #42"}'Every job step is checkpointed. If the worker dies, the next worker picks up from the last checkpoint — not the beginning.
Job progress: ████████████████████░░░░░░░░░░ (step 16/25)
Worker crash! 💀
Restart: ████████████████████ (resumes at step 16)
External API calls implemented as Aetheris Runtime Tools (payments, emails, order creation) are wrapped in an invocation ledger and Effect Store. If a step is retried after a recorded commit, the runtime injects the recorded result instead of repeating the side effect.
# Without Aetheris: retry → email sent twice
# With Aetheris Runtime Tool: retry → ledger returns recorded result, email send is not repeatedEvery LLM call, tool invocation, and checkpoint is appended to an immutable event log. You can replay any job from any point — without re-calling LLMs or external APIs.
aetheris trace <job-id> # view the full decision timeline
aetheris replay <job-id> # replay without side effectsGuarantees are configuration-dependent. See the guarantee matrix for the exact boundary between embedded mode, external_http, native Runtime Tools, and production Postgres deployments.
Aetheris works with any agent, in any language. You don't need to change your agent code.
For split API/Worker deployments, load the same external_http agent definition into both processes so the API can accept /api/agents/:id/message and the Worker can execute the job.
# Your existing LangChain agent — unchanged
from langchain_openai import ChatOpenAI
from langchain.agents import create_react_agent
agent = create_react_agent(ChatOpenAI(), tools, prompt)
# Expose it as an HTTP endpoint (one function)
from aetheris.integrations.langchain import serve
serve(agent, port=9000) # Aetheris will call this endpoint durably→ Full LangChain integration guide
# Add to configs/api.embedded.yaml
agents:
agents:
my_agent:
type: "external_http"
external:
url: "http://your-agent:9000/invoke"Your agent receives a job envelope with message, job_id, and idempotency_key. It returns {"answer": "...", "final": true}.
// Built-in via AgentFactory — config-driven
// configs/agents.yaml
agents:
my_eino_agent:
type: "react"
llm: "default"
tools: ["web_search", "calculator"]Your Agent (Python/JS/Go/any)
│
▼
Aetheris API ──── idempotency key ──▶ Invocation Ledger
│ (Runtime Tool boundary)
▼
Durable Worker ──── checkpoint ──────▶ Event Store
│ (crash recovery)
▼
Trace & Replay API ───────────────────────────────▶ Audit
The runtime is event-sourced: every state transition is an append-only event. With the production Ledger and Effect Store configured, replay injects recorded LLM and Runtime Tool results instead of re-calling them.
| Aetheris | LangGraph Platform | Temporal | |
|---|---|---|---|
| Open source + self-hosted | ✅ | ❌ (cloud only) | ✅ |
| No infrastructure for local dev | ✅ (embedded SQLite) | ❌ | ❌ (requires server) |
| At-most-once Runtime Tool boundary | ✅ with Ledger + Effect Store | ||
| Works with any agent framework | ✅ | ❌ LangGraph only | ❌ requires SDK |
| LLM decision audit trail | ✅ | ✅ | ❌ |
| Replay without re-calling recorded LLM/Tool effects | ✅ with Effect Store | ✅ for recorded workflow history |
See the current black-box adapter boundary in 2 minutes:
cd examples/crash_recovery
pip install aetheris
python demo.py
# Starts a local external_http demo agent and submits one durable batch jobThe example shows durable submission and trace visibility around one external HTTP call. For true per-step checkpoint resume inside the work itself, use native Aetheris tools/workflows instead of a single external_http call.
| Path | Purpose |
|---|---|
| cmd/api | HTTP API service |
| cmd/worker | Background job worker |
| cmd/cli | CLI: aetheris trace/replay/jobs/chat |
| configs | Runtime configs (embedded, Docker, production) |
| examples | Working examples for each integration pattern |
| sdk/python | Python SDK (pip install aetheris) |
| docs | Guides, API reference, design notes |
| internal/agent | Core runtime engine |
Apache 2.0 — free to use, self-host, and modify.