Releases: dep0we/atomic-agents-stack
v0.13.0 — LLMBackend Protocol arc closure
[0.13.0] - 2026-05-13
This release closes the LLMBackend Protocol arc (#87) — the framework now has two backend protocols in production (Memory + LLM) and three reference LLM backends (Anthropic, OpenAI direct, Moonshot) registered at framework import. Drop-in upgrade: existing agents and model.md configs keep working unchanged; ambiguous registrations get a new provider: field for disambiguation.
Operators upgrading from v0.12.0 — three behavior changes worth knowing:
- Moonshot
MOONSHOT_BASE_URLis now read once at backend construction (was per-call). Set the env var before importingatomic_agentsor restart the process to pick up a change. - Anthropic tool errors now propagate
is_error: Trueon tool_result blocks — a real improvement (the model gets a proper API-level error signal) but a new wire-shape that downstream eval harnesses comparing transcripts before/after the migration will see. format_tool_resultsProtocol signature changed from single-arg (PR 1) to three-arg (tool_uses+tool_results+assistant_text, PR 2). No external consumers existed when this changed; surfaced only for completeness.
Added
-
atomic-agents review --backend kimisubcommand (#134) — cross-family adversarial code review via Moonshot. Closes the gap that surfaced during PR #117 + #118 spec reviews when Codex hung mid-round and the only fallback was an Opus subagent (same model family as the author). Newatomic_agents/review.pymodule with aReviewRequest/ReviewResultdataclass surface and a built-in system prompt that enforces CLAUDE.md rule #12 (verify before claim — every finding must quote file:line evidence). CLI accepts--promptor--prompt-file, optional--target(primary file under review),--read-files(comma-separated grounding context),--working-dir,--model,--max-tokens(default 16000 — reasoning-style Moonshot models like Kimi K2.x consume a large slice for internalreasoning_content). All operator-supplied paths flow through the framework's canonical_io.safe_resolve_underguard —..segments and absolute paths that escape--working-dirraisePathTraversalError(matches the discipline used everywhere else paths cross a trust boundary). Empty/whitespace-only--promptand--prompt-fileare rejected with exit 1 before the LLM call (no silent paid no-op). When a reviewer returns empty visible content (the documented K2.x thinking-model failure mode — see #146), a WARNING precedes the cost summary so operators don't pay for blank reviews silently. Review output writes to stdout; cost summary to stderr so piping to a file doesn't pollute the artifact. Default model ismoonshot/moonshot-v1-128k(non-thinking, produces visible content reliably); Kimi K2.6 / K2.5 are priced in the table but require #146's reasoning_content extraction work before they become the recommended reviewer. 26 unit tests covering prompt assembly, path-traversal refusal (relative, absolute, target + read-files), file resolution, backend dispatch, model overrides, cost summary stream routing, empty-content warning, empty-prompt guards, and CLI integration. Live-validated against open PR #145 ($0.005 / 21s per review).docs/methodology.md§"Codex as a real outside voice" gains a new "Reviewer roster" sub-section explaining when to use Codex vs Opus subagent vs Kimi, including the honest empirical note that today's Kimi default model is a weaker reviewer than Opus or Codex (use as third opinion alongside, not as substitute) and a security caveat thatMOONSHOT_BASE_URLdetermines where the API key + prompt + read-files contents are sent. -
_llm._call_moonshotreadsMOONSHOT_BASE_URL/ATOMIC_AGENTS_MOONSHOT_BASE_URLenv vars for endpoint override. Operators with keys issued via the international portal (api.moonshot.ai) can now use the framework; default behavior unchanged (api.moonshot.cn). Stopgap until proper per-region routing lands with the LLMBackend protocol (#87). -
_costs.PRICINGextended with Moonshot model entries.moonshot/moonshot-v1-{8k,32k,128k}(non-thinking),moonshot/kimi-k2.6+moonshot/kimi-k2.5(thinking, api.moonshot.ai naming), andmoonshot/kimi-k2-0905-preview+moonshot/kimi-k2-0711-preview(thinking, api.moonshot.cn naming — also referenced by the existingtests/test_llm_tool_uses.pyfixture). All entries at placeholder rates of $0.30 / $1.20 per Mtok in/out; verify against Moonshot's current published pricing before depending on dashboard cost totals. -
README hero diagram — at-a-glance SVG at the top of the README showing the three core value claims (agent-as-folder, stateless cost-capped runtime, grepable JSONL audit trail) and the four shipped runtime shapes (
cron·launchd· Claude Code skill · embedded Python). Light + dark variants wired via<picture>element so each viewer's system color scheme is matched. Source SVGs atdocs/assets/atomic-agents-hero.svganddocs/assets/atomic-agents-hero-dark.svg. Every in-diagram claim —Responsefield names, JSONL top-level log shape, the runtime list — verified against shipped code (atomic_agents/types.pyfor the dataclass surface,docs/samples/caldwell/log/for real log shape,extras/for the runtime ports) per CLAUDE.md taste rules #12 + #13; an Opus subagent stood in for the rate-limited Codex round and caught four drift claims (inventedrun_id/capturesJSONL fields, mistypedResponse(text, cost, run_id), aspirationalMCP server/HTTP serviceruntime chips, drifted "21 locked spec docs" count) before commit. -
docs/spec/30-responsibility-audit.md— design spec for the responsibility audit primitive: a scheduled or on-demand offline-reflection surface that reads cross-cutting state (tools.md + judges.md + mandates.md + recent run logs + escalation queue) and produces a structured per-action-class coverage report with gap analysis as the primary output. Status: RFC. Origin: #116. Implementation pending in follow-up issues filed after spec merges. Defines the six-row coverage model (Discovery / Authorization / Action execution / Evidence / Reversibility / Escalation) generalized from commerce; per-agent vs project-level scope; CLI surface (atomic-agents audit responsibility) with on-demand + scheduled + doctor-triggered modes; audit-output file format and frontmatter schema; rule-engine vs LLM enrichment two-mode operation; newcost_source: "audit"ledger value (sibling ofactorandjudge); 4 audit event types (audit_started / audit_completed / audit_failed / audit_budget_exhausted); composition rules with the eval framework (sibling, not collapsed), the dream pipeline (sibling shape, different layer), the doctor (bidirectional cross-reference), and the future PolicyBackend (#89) (fleet-scale composition); 6 doctor checks (check_responsibility_audit_age, check_responsibility_audit_gap_count, check_responsibility_audit_stale_policy, check_responsibility_audit_unused_mandates, check_responsibility_audit_escalation_drift, check_audit_budget_exhausted); backward compat as opt-in (audit never runs automatically; operators schedule explicitly); 5 open questions documented for impl-PR resolution. -
docs/spec/29-mandates.md— design spec for the mandate primitive: durable, operator-granted scoped authority records that live inmandates.md, are referenced by side-effectful action proposals viamandate_id, and validated by the judge layer's newMandateCheckspecialist. Status: RFC. Origin: #115. Implementation pending in follow-up issues filed after spec merges. Defines theMandate+MandateConstraints+TargetPattern+TimeWindowdataclasses; themandates.mdfile format with parser rules; per-agent vs project-root resolution (with can-only-tighten discipline mirroring spec/28's judge-policy floors); theAuthorizationshape extension (granted_by: "mandate:<id>"+mandate_idfield); theMandateCheckjudge specialist's 8-step validation order; cost-event ledger split (cost_source: "mandate:<id>") with reservation pattern for concurrent-action TOCTOU defense; 7 mandate lifecycle event shapes;judges.md's new## Mandatesconfiguration section; 5 doctor checks (check_mandate_health,check_mandate_no_expiry,check_mandate_id_collisions,check_mandate_relaxation_violations,check_mandate_source_hash_drift); read-only CLI surface (atomic-agents mandate list / show / usage— nogrant/revokesubcommands because the file IS the operator's grant); backward-compat as opt-in. -
docs/spec/01-anatomy.md§"Graduated autonomy" — new framework-level section naming graduated autonomy as a stated property of the framework, with the four-class action taxonomy table from spec/28 and the principle that the same agent definition runs at every scale by configuringtools.md/judges.md/mandates.mdrather than re-shaping the agent.docs/spec/28-judge-layer.mdOverview gains a one-paragraph cross-link to the new section, framing the judge layer as the mechanism that encodes the principle. -
docs/spec/28-judge-layer.md— design spec for the judge layer: a pre-action validation surface that gates side-effectful tool calls behind aJudgeBackendreturning one of four outcomes (allow / block / revise / escalate). Defines action proposal schema, four-outcome model,JudgeBackendprotocol surface + capability advertisement, defaultLLMJudgeBackendreference implementation, action classification (read-only / reversible-write / external-side-effect / high-ris...
v0.12.0
Public-flip-readiness Minor. Documentation-only — no framework behavior changes from v0.11.0; runtime ships unchanged. Drop-in upgrade from v0.11.0 — no ### BREAKING callouts.
This is the launch shape release: the README rewrite that anchors the public-flip narrative, the repo-surface kit standard for an OSS project (CONTRIBUTING / CODE_OF_CONDUCT / SECURITY / issue + PR templates), and a LICENSE consistency fix that closes the final maintainer-name drift left from the personal-references scrub. Codex adversarial review caught six factual errors / overclaims in the original README draft before merge; all six were applied.
Changed
- README rewritten for the public flip. Framework-first positioning replaces the prior feature-list opener. New tagline lands as the hero — "AI agents that live in your folder, not someone else's database" with subtitle "Vault-native, MIT-licensed, Markdown-source-of-truth." New
Why this existsopener names "agent state ends up in app databases / vector stores / hosted trace systems / bespoke glue code" as the precise enemy rather than naming specific competitors as universally hosted. NewCurrent limitssection makes the alpha / single-maintainer / macOS-Linux-primary / only-MemoryBackend-shipped / log-only-alerts state explicit before the comparison matrix. New honest comparison matrix names Letta, Mem0, LangGraph + LangSmith, and direct-SDK with narrower defensible claims (Markdown-source-of-truth, no required server, spec-level file layout) — and aWhere the alternatives winparagraph names where each does better than Atomic Agents. Backend-protocol scaling section now labels org-scale-over-Postgres asv1 directionrather than implying shipped today. 6 deployment runbooks linked. Spec count corrected (13 → 21). Caldwell description sharpened to surface the 5 days of real JSONL logs, the helper-pattern day with ~76% cost savings, and the evals across happy/edge/adversarial/decline categories. Caldwell appears as one sample among future samples, not as the headline. - README badge URLs corrected from
github.com/user/*(broken) togithub.com/dep0we/*. Version badge added. - README default
ATOMIC_AGENTS_ROOTcorrected from~/agents/agents(duplicated path, typo) to~/docs/agentsperatomic_agents/_platform.py:DEFAULT_AGENTS_ROOT. - README "spec docs are not aspirational" softened to "spec docs separate shipped behavior from explicit future/deferred boundaries" — closer to ground truth (some spec sections mark future work explicitly).
- Status section updated v0.10 → v0.11.0; protocol-pattern v1.0 expectation named.
Added
CONTRIBUTING.md— reading order before opening a PR (CLAUDE.md / TENSIONS.md / methodology.md), branch + commit shape, test expectations, review-in-rounds practice, what lands cleanly vs what needs an issue first.CODE_OF_CONDUCT.md— condensed Contributor-Covenant-shaped policy with the project's actual maintainer contact and an enforcement table; not vendored boilerplate.SECURITY.md— 90-day disclosure window, in-scope vs documented-honest-limitations (best-effort path-traversal check in MCP args, advisory-only cost guardrails without the shared helper, plain-markdown-no-encryption-at-rest), operator hygiene checklist..github/ISSUE_TEMPLATE/{bug,feature,question}.md— structured templates matching the project's existing issue conventions (title prefixes, env/repro/scope sections)..github/pull_request_template.md— mirrors the project's existing PR shape (Summary / Why / Test plan / Design alignment self-check against the 14 CLAUDE.md design rules).
Fixed
LICENSEcopyright line flipped fromCopyright (c) 2026 Dan PowerstoCopyright (c) 2026 atomic-agents-stack contributors. Matches thepyproject.tomlauthor field from the personal-references scrub (issue #77 / PR #92). The Codex adversarial review of the launch README flagged the inconsistency between LICENSE andpyproject.tomlas small but visible drift — the only place the maintainer name still surfaced after the scrub.
Process notes (operator-visible context, not behavior changes)
-
Codex adversarial review run against the launch README before the public flip (per the
codex_reviews_mandatoryoperator preference). Six findings caught and applied: factual errors in the comparison matrix (Letta has self-hosted Docker; Mem0 OSS exists; LangGraph has filesystem-backed memory via Deep Agents and LangSmith for observability — original matrix overclaimed all four),runs anywhere markdown doessoftened toMarkdown-source-of-truthbecauseatomic_agents/_locks.pyimports POSIXfcntlunconditionally,spec docs are not aspirationalsoftened toseparate shipped behavior from explicit future/deferred boundaries, dangling "Your first agent below" reference removed, and the LICENSE consistency fix above. Recommendation: Revise before public flip — adopted. -
Public-flip launch shape work shipped (closes #93). The launch-shape design doc from the
/office-hourssession lands as the README + repo-surface kit above.
v0.11.0
Documentation-heavy Minor focused on public-flip readiness. Closes the operator-doc gaps that were blockers for a credible public release: every supported deployment shape now has its own runbook, every public exception is cataloged, and the Caldwell sample is unambiguously fictional. No new framework features; the runtime ships unchanged from v0.10.0. Drop-in upgrade from v0.10.0 — no ### BREAKING callouts.
Added
docs/deployment/obsidian.md— Obsidian-backed deployment guide (833 lines, closes #67). The operator runbook for running an Atomic Agent on top of an Obsidian-synced vault. Covers recommended vault layout,.obsidian/sync.ignorepatterns with per-line rationale,.obsidian/config dir handling,AgentLockrace conditions vs Sync writes,_dashboard/index.htmlself-containment, conflict copy recovery, a 9-step worked first-run example, and cross-platform read/edit-vs-run boundaries. Cites concrete code paths (_locks.py:54-77,_io.atomic_write,_platform.py:get_agents_root,mcp.py:616,memory/filesystem.pywalker,migrate.py:find_content_files,dashboard/render.py:7-9).docs/deployment/programmatic.md— Programmatic invocation guide + complete public exception table (761 lines, closes #69). The Python-embedded path for using the framework inside another app. Covers when to use programmatic vs CLI,Agent+call()public surface, cost-guardrail handling, memory API (canonicalagent.memory.*+ deprecated re-exports), helper/delegate semantics with the one-level constraint, concurrency model, three worked examples (single-shot cron embed, custom orchestrator, subprocess-safe gunicorn-shaped worker), and a complete public exception table covering 18 exported classes plus a "raised but not yet in__all__" subsection for 9 internal ones (follow-up tracked in #99).docs/deployment/disaster-recovery.md— Disaster recovery runbook (741 lines, closes #72). Symptom-organized runbook for first-response when something goes wrong. Nine scenarios covering stale-lock recovery (lsofdiagnosis, why.lockpersists after flock release), mid-run crash recovery via atomic-write guarantees, corrupted INDEX repair viaFilesystemBackend.list_orphans(), migration rollback flow, memory-write races, Obsidian Sync conflict recovery,.versions/snapshot management, doctor failure-mode mapping, and the git-as-canonical-backup pattern. Every recovery shows the exact command to run plus how to verify the fix.docs/deployment/cost-guardrail-sizing.md— Cost guardrail sizing guidance (521 lines, closes #73). How to pick numbers for daily/monthly caps and cap action. Includes a current pricing-per-MTok snapshot from_costs.py, the 14-day observe-then-apply pattern from spec/09, and seven role archetypes with recommended starting caps: personal financial advisor, daily-brief cron, interactive skill-mode, helper-heavy summarizer, goal-driven autonomous, high-stakes single-call modeling, and multi-role coordinator. Names a real schema gap (noper_call_usdfield today) that's tracked as a follow-up in #100.- Cross-references between the four new deployment docs so a reader landing on any one of them can fan out to the others without scrolling back to a directory listing.
Changed
-
Caldwell sample
model.mdcost guardrails backfilled to real Archetype A values — replaced placeholderdaily_cap_usd: 5.00 / monthly_cap_usd: 100.00 / enabled: falseblock with the recommended personal-financial-advisor archetype from the new sizing guide (daily_cap_usd: 0.50,monthly_cap_usd: 7.00,daily_cap_action: fallback,enabled: true). Operators copying the sample now see a real worked example matching documented recommendations rather than a placeholder block. -
Personal-reference scrub for public release (issue #77). Stripped maintainer-identifying details ahead of the public flip:
- Maintainer-name references removed from code and tests —
pyproject.tomlauthor field, the_platform.pyenv-var docstring, two test-fixture description strings (tests/test_capture.py,tests/test_schema.py), and three doc-side mentions (CLAUDE.md,docs/TENSIONS.md,docs/methodology.md). One narrative sentence in the Caldwell financial-modeling sample skill that anchored on the maintainer was rewritten to be context-neutral. The package author field now readsatomic-agents-stack contributors. - Sample-persona name genericized in spec docs. Every reference to the Caldwell sample's user-persona name outside
docs/samples/caldwell/was rewritten as "the operator" / "the user" / "you" / placeholder — coveringdocs/architecture.md,docs/README.md,docs/GOVERNANCE.md,docs/appendix/portability.md,docs/methodology.md, and every spec doc (spec/01–spec/13). The persona name still appears inside the Caldwell sample, where the framing as "the sample's fictional user" is correct. The example USER.md content inspec/01-anatomy.mdis now generic so the spec no longer depends on the sample's persona. - Caldwell sample reshape. The sample's surface details rewritten so the persona reads as unambiguously synthetic: Director of Operations at Atlas Logistics (was Head of IT at a fictional industrial conglomerate), freelance technical editing on the side (drops the prior consulting-practice angle), married + 2 kids in school (was 4 kids + 5 grandkids), Madison, WI (was a Tennessee location). The spouse name was removed throughout; the spouse-side-business project was replaced with
project_freelance_editing_growth.mdplus a renamedproject_freelance_retainers.md. Downstream sample artifacts (journal entries, evals, dashboard sample data, log JSONL summaries) updated to match. - Canonical example agent names genericized. Two named example-agents in the docs (placeholders for the maintainer's actual advisor agents) were replaced with
agent-a/agent-bplaceholders across ~15 doc files (docs/architecture.md,docs/README.md,docs/appendix/portability.md, every spec doc that listed example agents, every implementation guide). - Real project-name leaks removed. Personal-vault example paths in
spec/01-anatomy.mdandsamples/caldwell/tools.mdwere rewritten with generic placeholders. Folder-name examples inimplementation/claude-skill-agent.mdandimplementation/chatgpt-skill-agent.mdthat had used the maintainer's real project names were replaced withagent-a//agent-b/. Also caught and fixed three "Maya 2026" → "May 2026" date typos inspec/07andimplementation/chatgpt-skill-agent.md. - Acceptance criteria from the issue verified — case-insensitive greps for the maintainer name, the sample-persona name outside
docs/samples/, and the four real project-name patterns all return zero hits across.md/.py/.toml. Full test suite (720 tests) passes. Sample remains internally consistent.
- Maintainer-name references removed from code and tests —
Fixed
- README "What's shipped" table refreshed to add v0.10.0 rows (MCP client, MemoryBackend, doctor, deployment docs) and consolidate the inflated
v0.2–v0.8labels — those modules all actually shipped in v0.9.0, the leading-zerov0.Xlabels were aspirational milestone numbers from the build sequence. Versions in the table are now real release tags. - README "What's shipped" table swept again post-cluster to add four new deployment-doc rows (
obsidian.md,programmatic.md,disaster-recovery.md,cost-guardrail-sizing.md) and broaden thedocs/deployment/description in the Repository structure block. The four rows ship as✅ v0.11.0. - Public-API audit (filed as follow-up #99) — discovered while documenting the programmatic invocation path that 9 exception classes are raised inside the package but not in
atomic_agents/__init__.py's__all__. The newprogrammatic.mddocuments current behavior honestly and surfaces the gap; the actual__all__promotion + CLI exit-code parity work is queued for a future Minor.
v0.10.0 — MCP, MemoryBackend, doctor, release docs
First Minor release after the spec-completion v0.9.0. Adds the MCP client, the
MemoryBackend protocol (the watershed for the protocol-pattern scaling
roadmap), the atomic-agents doctor preflight CLI, and the SemVer/upgrade
documentation that turns "what version are you running" into an answerable
question. No ### BREAKING changes — drop-in upgrade from v0.9.0.
Added
atomic-agents doctor preflight CLI (issue #66, PR #75)
- New CLI subcommand:
atomic-agents doctor [--agent <name>] [--agents-root <path>] [--json] [--no-mcp]. Runs nine independent checks (env, python, vault, provider-keys, model, mcp, locks, memory-backend, write-paths) and reports each aspass/fail/skip. - Each failing check emits a
fix_hintcontaining the literal command needed to resolve it (e.g.security add-generic-password ... -s atomic-agents-anthropic -w '<key>'for a missing Keychain entry). - Provider-keys check reuses the production lookup chain (
_llm._get_key()) so doctor's verdict can never disagree with runtime behaviour. Provider inference follows_costs.PRICINGkeys:claude-*→ anthropic,gpt-*→ openai,moonshot/*→ moonshot. Also verifies the optional provider SDK is importable —gpt-*andmoonshot/*selections require theopenaiextra; doctor fails fast instead of letting the runtime hitImportErroron first call. - MCP check exercises the real stdio handshake (
session.initialize+list_tools) per declared server, threadingtools.mdread_pathsthroughparse_mcp_mdso the samePathTraversalErrorthat runtime would raise surfaces at install time. Bounded by a 10-second default wall-clock timeout — a server that starts but never replies fails the check instead of hanging the CLI. Skipped via--no-mcpor whenmcp.mdis absent. - Cascade-aware: when the agent path matches
<system>/projects/<project>/agents/<role>(spec/06),model.mdandtools.mdare resolved via_cascade.resolve_*so role-level config satisfies the vault check and downstream parsers see the same config the runtime would. - Locks check uses
flock(LOCK_NB)to distinguish a lingering lock file (normal) from an actively-held lock (problem); flags stale when held + mtime > 300s. - Write-paths check verifies the agent's
memory/directory falls inside at least onewrite_path, is NOT shadowed by aread_only_path, and is itselfos.W_OKwritable on disk —FilesystemBackend.write_note()enforces all three at runtime, so a misalignment would otherwise fail after the agent has already spent tokens. - Malformed config (bad YAML in
model.md, etc.) is reported as a FAIL CheckResult, not as an exit-2 doctor crash. Exit-2 is reserved for genuine bugs in doctor itself. - Output formats: human-readable aligned table by default, machine-readable JSON via
--json(intended for Cloud Run liveness probes / launchd preflight). - Exit codes:
0all-pass,1any-fail,2doctor itself crashed. - Spec:
docs/spec/27-doctor.md. Getting-started gains a "Verify your install" step (§9). - Codex review across three pre-merge rounds: 9 P2 findings closed (cascade resolution, parse-error containment, optional SDK detection, MCP read_paths enforcement, memory-in-write_paths verification, YAML-syntax detection, empty-write_paths-in-agent-scope FAIL, MCP handshake timeout, direct memory-dir
os.W_OK). - 54 new tests in
tests/test_doctor.pycovering each check's PASS + FAIL paths, every codex-fix scenario, and CLI integration (exit codes, JSON shape, crash → exit 2).
SemVer policy + upgrade runbook (issue #68, PR #76)
- New:
docs/deployment/versioning.md— full SemVer policy with project-specific Major/Minor/Patch definitions (schema break vs new feature vs bug fix), pre-1.0 caveat, and the release-cutting procedure (extract CHANGELOG section via awk, tag with annotation, create GitHub Release with--notes-file). - New:
docs/deployment/upgrading.md— operator runbook: read release notes → pull → copy migration script(s) into<vault>/_migrations/→python -m atomic_agents.migrate --status→--to vN --dry-run→--to vN→ verify (atomic-agents doctorin v0.10.0+; pre-v0.10 falls back toinfo+runsmoke check) → restart LaunchAgents. - Updated:
README.mdgains a "Versioning & upgrades" section linking both docs. - Updated:
CHANGELOG.mdheader now documents Keep-a-Changelog section conventions (Added / Changed / Deprecated / Removed / Fixed / Security) and the### BREAKINGcallout convention for any change that forces operator work to upgrade. - Tagged historical releases v0.1.0 and v0.9.0 retroactively at the commits where their CHANGELOG entries landed, so
git tag -land the GitHub Releases page now match the CHANGELOG history. - Codex review across five pre-merge rounds: 11 P2 findings closed (pre-1.0 bump-rule consistency, migrate
--to vNrequirement, migration scripts location (<vault>/_migrations/), GitHub Release notes from CHANGELOG (not--notes-from-tag), doctor reference gating to v0.10.0+, no-op migrate behavior, rollback semantics, single-snapshot-per---to,CURRENT_SCHEMA_VERSIONlives in the package not the vault).
MCP (Model Context Protocol) client support (PR #55, follow-up #56)
- New module
atomic_agents/mcp/enables agents to consume tools from external MCP servers (stdio transport). - Server registry parsed from
<agent>/mcp.md; tool collision detection across MCP + custom tools (ToolNameCollision). - Validator integration end-to-end so server-side schema rejections surface as
MCPValidationErrorto the agent. - Env merge semantics: agent-level + per-server env vars compose without leaking parent process env.
- Codex review (6 findings) closed before merge; covers env merge, validator wiring, collision detection, server lifecycle.
MemoryBackend protocol + FilesystemBackend default (PR #57, follow-up #58/#59)
- New
atomic_agents.memorypackage withMemoryBackendProtocol,Note/NoteRef/VersionRef/WritePolicy/StagedMemory/MemoryStatsdataclasses, andFilesystemBackendas the default registered backend. - Boil-the-lake refactor of the memory layer: 9 call sites (agent.py, dream.py, tuning.py, both dashboards, cli.py, _capture.py, _versioning.py) route through
agent.memoryinstead of direct filesystem operations. WritePolicyenforced atwrite_note,apply_staging, and inside staged writes — security-equivalent to the prior write-path enforcement, now backend-pluggable.- Atomic dir-swap in
apply_stagingis rollback-safe (microsecond-precision archive name + restore-on-failure). - New exceptions:
BackendNotRegistered,VersionNotFound,StagingNotApplied. - Spec doc:
docs/spec/20-memory-backend.md. - Test count: 626 → 668 (35 conformance tests in
test_memory_protocol_conformance.py+ 10 fs-specific intest_memory_filesystem_backend.py+ 4 live-agent integration intest_memory_integration.py). - Two scoped follow-ups deferred to issues #58 (dream →
staging.write_note(capture, policy)) and #59 (CLI →agent.memoryinstead of direct backend). - Codex reviewed the scope (10 findings → rev 2) and the implementation diff (4 P1 + 7 P2 + 3 P3 → all closed in fix commit).
Issue tracking convention
All scoped follow-ups, codex-deferred items, and future enhancements now go to GitHub Issues at dep0we/atomic-agents-stack with label conventions: enhancement, documentation, infrastructure, polish, backend (new — protocol abstractions), deployment (new — install / upgrade / runbooks), spec, bug. Title prefix [scope] (e.g. [backend], [deployment], [v0.X], [polish]).
Roadmap as live backlog
- 6 backend-protocol scaling issues filed (#60–#65): LockBackend (urgent — multi-process cliff), LogBackend, PersonaBackend, AgentProfileBackend, ToolRegistryBackend, CorpusBackend.
- 8 deployment-readiness issues filed (#66–#73): doctor CLI, Obsidian guide, SemVer + release pipeline, programmatic API docs, cost alert webhook, launchd template stamper, recovery runbook, cost guardrail sizing.
- Filter via
gh issue list --label backendorgh issue list --label deployment.
v0.9.0 — Spec-completion release
Spec-completion release. The full v0.x build sequence is landed: every deferred spec module from v0.1 plus operational extras and an in-repo copy of the spec.
Added
Eval runner (atomic_agents.eval, was issue #1, PR #12)
EvalRunnerclass withrun_test,run_suite, and category/test filters.- Cross-family LLM-as-judge: Claude scores OpenAI agents, OpenAI scores Claude agents — never self-judge. Same-family fallback when no cross-family judge is available; raises
NoJudgeAvailableif none. - Rubric weighting: per-dimension weights from
evals/rubric.mdfrontmatter; weighted score in [0,5]; threshold-based pass/fail. - Hard-fail override: any rubric dimension marked
hard_fail: truein the rubric forces a failed verdict regardless of weighted score. - Malformed-judge-JSON retry: one retry with stricter "JSON only" reminder before recording
judge_error. - Run logs land in
evals/runs/YYYY-MM-DD.jsonl; long agent responses persisted separately underevals/runs/responses/and referenced from the JSONL line. - CLI:
python -m atomic_agents.eval <agent> [--category|--test|--all|--summary-only|--no-write].
Tuning analyzer (atomic_agents.tuning, was issue #2, PR #22)
- Eval-driven self-improvement per spec/11. Detects four pattern types from recent eval runs: recurring persona-fidelity miss, recurring hard-fail, stale memory reference, promotable hot memory.
EditProposaldataclass: each detected pattern emits a concrete proposed edit with the eval evidence inline.- Optional LLM polish (~$0.02 per proposal) to improve report wording without changing recommendations.
- Reports land in
evals/tuning_reports/YYYY-MM-DD_proposal.md. Operator approves/rejects in the report file. --applywrites approved diffs to the target persona/memory/tools files viaatomic_write, respectingtools.mdwrite_paths. Diffs that are instructional (multi-step, comment-only) are flagged as manual-apply with a skip reason; all decisions (applied, skipped, rejected, deferred) land inevals/tuning_history.jsonl. Use--dry-runwith--applyto preview what would change without writing.- CLI:
python -m atomic_agents.tuning <agent> [--since|--apply|--polish|--dry-run].
Goal manager (atomic_agents.goal, was issue #3, PR #14)
- Goal + sub-goal lifecycle for goal-driven and hybrid agents per spec/12.
GoalManager: load/save<agent>/goal.md, dispatch logic (next_sub_goalfilters byblocked_bychain), status transitions with sanity enforcement, history JSONL.- Pacing analysis in
progress_report: planned vs. elapsed days, on-track / behind / ahead verdict. - Non-destructive abandon and complete (archives
goal.mdtogoal_archive/<date>_<slug>.md). - Operating modes: reactive, goal-driven, hybrid — manager works the same across all three.
- CLI:
python -m atomic_agents.goal <agent> {status|next|advance|abandon|complete|report}.
Schema migration runner (atomic_agents.migrate, was issue #4, PR #16)
- Vault-wide schema migrations with mandatory snapshot + automatic rollback on validation failure.
MigrationScriptProtocol: declaresFROM_VERSION,TO_VERSION,applies_to, and a puremigrate(content_dict)function.- Snapshot format: gzipped tarball under
<vault>/_migrations/snapshots/<timestamp>.tar.gz— small for typical vaults, restorable with--rollback. - Migration plan walks the script chain
from_current → to_target; refuses to skip versions. - Post-validation re-parses every changed file against the target schema; any failure rolls back the entire batch.
- Safety property: package's
CURRENT_SCHEMA_VERSIONand the migration ladder ship together — until both are present, post-validation rejects new-schema files and rolls back, so the vault can never silently land in an unsupported state. - CLI:
python -m atomic_agents.migrate [--to|--dry-run|--status|--rollback|--list-snapshots].
Tool-call captures (Path 1) (atomic_agents._capture, was issue #5, PR #16)
- Structured tool-call extraction alongside the existing fenced-JSON parser per spec/05. Provider SDKs validate inputs against schema before they reach the helper, eliminating the malformed-JSON failure mode.
CAPTURE_TOOL_SCHEMA: shared JSON Schema; identical taxonomy + required fields as the fenced-block validator.anthropic_tool_definition()andopenai_tool_definition(): provider-specific format wrappers.extract_tool_call_captures()andextract_all_captures(): combined Path 1 + Path 2 extractor with priority-aware dedup (tool calls win on collisions)._RawLLMResponse.tool_usesfield — normalized across Anthropic and OpenAI/Moonshot.AtomicAgent._capture_tool_definitions(model)picks the right per-provider formatter;agent.call()passes the capture tool to every LLM call and extracts captures from both paths.
Multi-agent project cascade loader (atomic_agents._cascade, was issue #6, PR #23)
- Three-layer cascade per spec/06: role / project / instance. When an agent path resolves like
<system>/projects/<project>/agents/<role>/, the loader walks up to find the role and project layers. CascadePathsdataclass;detect_cascade(agent_root)returnsNonefor single-agent layouts (full backwards compat).- Layer-1:
<role>/PROMPT.md. Layer-2:<project>/{canon.md, style_guide.md, goal.md, policy/*.md}. Layer-3: instance persona/memory/wiki/journal/log + optionaltools.md/tools.override.md/model.mdoverrides. tools.mdresolution:tools.override.md(additive merge with role) > instancetools.md(replaces role) > roletools.md(base).- Queue mechanics:
claim_next_queued(atomic POSIX rename),release_claim,move_to_dead_letter(with reason file),recover_stale_claims(mtime-based lease expiry). assemble_system_prompt()extended for cascade order: role PROMPT → instance persona → tools → project canon/goal/style_guide/policy → memory/wiki/notes/journal.parse_tools_mdandparse_model_mdsplit into path-based wrappers + text-based core (so cascade-merged content can be parsed without writing to disk).
Helper provenance preservation (was issue #7, PR #20)
- Per spec/10 Wave 8: helper output must preserve attribution back to source so the parent can cite it.
helper_call(..., sources=...)andhelper_call_parallel(..., sources=... | sources_per_prompt=...). When sources are passed, the helper's system prompt prepends a citation instruction + source bullet list._detect_provenance(text, sources)heuristic: bracketed citations ([§2, p3],[page 5]), inline phrases (according to,per memo,§3), or verbatim source-basename mention. Conservative — prefers false-positive over false-negative.HelperResult.sourcesechoes the input list;HelperResult.provenance_preservedreports the heuristic verdict.- Run record JSONL gains
sourcesandprovenance_preservedfields when sources are passed; omitted otherwise (log shape unchanged for backwards compat).
Research integrity Layers 2 + 3 (was issue #8, PR #21)
- Layer 2 — source-grounded eval. When a golden test declares
expected_facts,_build_judge_promptappends a "Factual accuracy check" section instructing the judge to verify each fact (stated_in_response,value_correct,cited) and emit afactual_checksarray.compute_factual_accuracy_from_checksderives a 1–5 dimension score from the checks (full credit when verified + cited, half credit when stated correctly but uncited). When the rubric weightsfactual_accuracybut the judge omits a numeric score for it, the runner derives one from the checks; judge's numeric score takes priority when present. - Layer 3 — research log per response.
_helpers_this_runrollup tracks helper calls during a parent run;agent.call()embeds it ashelper_provenancein the parent's run log record. Field is omitted when no helpers were called, so log shape stays unchanged for reactive agents.
Spec import (docs/, was issue #9, PR #18)
- All 13 spec docs (
docs/spec/01-anatomy→13-research-integrity),architecture.md,docs/README.md, the 7 implementation guides,appendix/portability.md, and the complete Caldwell sample agent (persona, memory, wiki, journal, log, evals/rubric+judge+5 golden tests) imported from the source vault. - 122 Obsidian wikilinks converted to relative markdown links across 27 files.
- 38 dangling cross-references (filename examples like
[[feedback_communication_style]]) converted to inline code so the intent reads correctly. Zero broken markdown links remain. - Stale
lib/atomic_agents.pyreferences updated toatomic_agents(the package name in this repo) across 6 files.
Operational extras (extras/, was issue #11, PR #19)
- Seven Claude Code skill wrappers:
atomic-agents-{run,info,eval,tune,goal,dashboard,migrate}— each is a portableSKILL.mdwith action-oriented instructions, invocation, output reading, and troubleshooting. - Three macOS LaunchAgent plist templates: daily run, daily eval suite, hourly dashboard refresh. All three validate with
plutil -lint. README walks through substitution, loading, and the Keychain alternative for keys. - Linux cron templates:
crontab.example+run-atomic-agent.shportable shell wrapper handling env loading, key sourcing from a chmod-600 file, and per-command logging. __KEY__placeholder syntax (double-underscore) for textual placeholders so plist templates remain valid XML during review.
Changed
- Top-level README's "What's shipped" table refreshed to mark every shipped module, including the test count (296).
docs/README.mdstatus table refreshed to show all shipped modules with their module names.- Repository structure section in the top-level README expanded to surface
docs/andextras/trees.
Tests
- 296 total (was 67 in v0.1). New tests by module: eval +27, tuning +25, goal +39, migrate +32, tool-call captures +32, cascade +35, helper provenance +23, research in...
v0.1.0 — Initial release
Initial release. Core framework + cost dashboard.
Added
Core framework (atomic_agents/)
AtomicAgentclass — canonical agent runtime per spec/04. Loads persona (IDENTITY/SOUL/USER), tools.md, model.md, memory INDEX + recent + pinned notes, wiki INDEX, and recent journal entries; calls the LLM with cost-guardrail enforcement; extracts captures; logs every run to JSONL.- Helper-mediated atomic captures — parses fenced
```atomic_captureJSON blocks (incl. quad-backtick fence), validates against schema, writes new memory notes with INDEX updates using atomic temp+fsync+rename pattern. - Multi-tier cost guardrails — 50% / 80% / 100% thresholds with
skip/fallback/alertactions permodel.md. - Helper functions —
helper_call(sequential) andhelper_call_parallel(ThreadPoolExecutor fan-out, default 5 concurrent) per spec/10. - Provider routing — Anthropic primary, OpenAI and Moonshot Kimi as optional extras.
- Per-agent file locking —
flock-based with stale-lock recovery on process death. - Frontmatter validation per spec/03, including Wave 6 date-suffix filename pattern.
- Secrets loading via env vars, macOS Keychain, or
~/.config/atomic_agents/keys.json. - CLI:
atomic-agents run <agent>andatomic-agents info <agent>.
Cost & observability dashboard (atomic_agents.dashboard/)
- HTML dashboard renderer per spec/09 — global view (all agents) + per-agent drilldowns.
- Aggregations: per-agent costs, model breakdown, helper savings, cache savings, top expensive runs, daily cost chart, monthly trend (12-month rolling), provider breakdown.
- Suggested cap calculator — after 14 days of observed usage, surfaces recommended
daily_cap_usdandmonthly_cap_usdformodel.mdcost_guardrails. - Self-contained HTML output (inline CSS, no external assets, no JavaScript dependencies).
- Optional local web server (
python -m atomic_agents.dashboard serve, port 8765) with/regenerateendpoint for the Refresh button. - Pure Python aggregation — no LLM calls, no external services, ~30 sec for typical scale.
Tests (67 total)
- Atomic file I/O (write, append, cleanup, crash recovery)
- Per-agent flock (acquire/release, busy + wait scenarios)
- Schema validation (all required fields, type taxonomy, date-suffix filenames)
- Capture parsing (fenced JSON, dedup, multi-block, quad-backtick fence, write-path enforcement)
- Cost calculation (cache hits, period sums, malformed line handling)
- tools.md + model.md parsers
- Dashboard aggregation (load, summarize, helper savings, cache savings, suggested caps)
- Dashboard rendering (HTML output, per-agent + global, edge cases)
Notes
- The Atomic Agents specification (
docs/) describes a layered system: spec docs, implementation guides, sample agents, portability appendix. The spec is the central artifact; this repo is the reference implementation. - This release contains core + dashboard. Eval, tuning, goals, and migration runners ship in subsequent releases.
- Designed as an open standard — anyone can build agents to the spec, with or without using this Python implementation.